🎯

machine-learning

🎯Skill

from pluginagentmarketplace/custom-plugin-ai-data-scientist

What it does

Builds, trains, and evaluates machine learning models for classification, regression, and clustering using scikit-learn's powerful algorithms and techniques.

📦

Part of

pluginagentmarketplace/custom-plugin-ai-data-scientist(12 items)

machine-learning

Installation

Add MarketplaceAdd marketplace to Claude Code

/plugin marketplace add pluginagentmarketplace/custom-plugin-ai-data-scientist

Install PluginInstall plugin from marketplace

/plugin install ai-data-scientist-plugin@pluginagentmarketplace-ai-data-scientist

git cloneClone repository

git clone https://github.com/pluginagentmarketplace/custom-plugin-ai-data-scientist.git

Claude CodeAdd plugin in Claude Code

/plugin load .

📖 Extracted from docs: pluginagentmarketplace/custom-plugin-ai-data-scientist

Need more details? View full documentation on GitHub →

7Installs

AddedFeb 4, 2026

View on GitHub Back to Skills

Skill Details

SKILL.md

Supervised/unsupervised learning, model selection, evaluation, and scikit-learn. Use for building classification, regression, or clustering models.

Overview

# Machine Learning with Scikit-Learn

Build, train, and evaluate ML models for classification, regression, and clustering.

Quick Start

Classification

```python

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import classification_report

# Split data

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

)

# Train model

model = RandomForestClassifier(n_estimators=100, random_state=42)

model.fit(X_train, y_train)

# Predict

predictions = model.predict(X_test)

probabilities = model.predict_proba(X_test)

# Evaluate

print(classification_report(y_test, predictions))

```

Regression

```python

from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_absolute_error, r2_score

model = GradientBoostingRegressor(n_estimators=100)

model.fit(X_train, y_train)

predictions = model.predict(X_test)

print(f"MAE: {mean_absolute_error(y_test, predictions):.2f}")

print(f"R²: {r2_score(y_test, predictions):.3f}")

```

Clustering

```python

from sklearn.cluster import KMeans

import matplotlib.pyplot as plt

# Find optimal k (elbow method)

inertias = []

for k in range(1, 11):

km = KMeans(n_clusters=k, random_state=42)

km.fit(X)

inertias.append(km.inertia_)

plt.plot(range(1, 11), inertias, marker='o')

plt.xlabel('Number of clusters')

plt.ylabel('Inertia')

plt.show()

# Train with optimal k

kmeans = KMeans(n_clusters=5, random_state=42)

clusters = kmeans.fit_predict(X)

```

Model Selection Guide

Classification:

Logistic Regression: Linear, interpretable, baseline
Random Forest: Non-linear, feature importance, robust
XGBoost: Best performance, handles missing data
SVM: Small datasets, kernel trick

Regression:

Linear Regression: Linear relationships, interpretable
Ridge/Lasso: Regularization, feature selection
Random Forest: Non-linear, robust to outliers
XGBoost: Best performance, often wins competitions

Clustering:

K-Means: Fast, spherical clusters
DBSCAN: Arbitrary shapes, handles noise
Hierarchical: Dendrogram, no k selection

Evaluation Metrics

Classification:

```python

from sklearn.metrics import (

accuracy_score, precision_score, recall_score,

f1_score, roc_auc_score, confusion_matrix

)

accuracy = accuracy_score(y_true, y_pred)

precision = precision_score(y_true, y_pred, average='weighted')

recall = recall_score(y_true, y_pred, average='weighted')

f1 = f1_score(y_true, y_pred, average='weighted')

roc_auc = roc_auc_score(y_true, y_pred_proba, multi_class='ovr')

```

Regression:

```python

from sklearn.metrics import (

mean_absolute_error, mean_squared_error, r2_score

)

mae = mean_absolute_error(y_true, y_pred)

mse = mean_squared_error(y_true, y_pred)

rmse = np.sqrt(mse)

r2 = r2_score(y_true, y_pred)

```

Cross-Validation

```python

from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')

print(f"CV F1: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

```

Hyperparameter Tuning

```python

from sklearn.model_selection import GridSearchCV

param_grid = {

'n_estimators': [100, 200, 300],

'max_depth': [5, 10, 15],

'min_samples_split': [2, 5, 10]

}

grid_search = GridSearchCV(

RandomForestClassifier(),

param_grid,

cv=5,

scoring='f1_weighted',

n_jobs=-1

)

grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")

print(f"Best score: {grid_search.best_score_:.3f}")

# Use best model

best_model = grid_search.best_estimator_

```

Feature Engineering

```python

from sklearn.preprocessing import StandardScaler, LabelEncoder

# Scaling

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Encoding

encoder = LabelEncoder()

y_encoded = encoder.fit_transform(y)

# Polynomial features

from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2)

X_poly = poly.fit_transform(X)

```

Pipeline

```python

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

pipeline = Pipeline([

('scaler', StandardScaler()),

('classifier', RandomForestClassifier(n_estimators=100))

])

pipeline.fit(X_train, y_train)

predictions = pipeline.predict(X_test)

```

Best Practices

Always split data before preprocessing
Use cross-validation for reliable estimates
Scale features for distance-based models
Handle class imbalance (SMOTE, class weights)
Check for overfitting (train vs test performance)
Save models with joblib or pickle

More from this repository10

🎯

reinforcement-learning🎯Skill

Trains intelligent agents to learn optimal behaviors through interaction with environments using reinforcement learning techniques.

🎯

computer-vision🎯Skill

Processes and analyzes images using deep learning models for classification, detection, and visual understanding tasks.

🎯

data-visualization🎯Skill

Generates interactive data visualizations and performs exploratory data analysis using Matplotlib, Seaborn, Plotly, and other visualization tools.

🎯

time-series🎯Skill

Performs time series analysis using ARIMA, SARIMA, Prophet, detecting trends, seasonality, and anomalies for precise temporal predictions.

🎯

statistical-analysis🎯Skill

Performs rigorous statistical analysis using Python's SciPy, enabling hypothesis testing, A/B testing, and data validation across various statistical methods.

🎯

python-programming🎯Skill

Enables efficient Python programming for data science, covering fundamentals, data manipulation, and advanced library usage with NumPy and Pandas.

🎯

data-engineering🎯Skill

Builds scalable data pipelines and infrastructure using Apache Spark, Airflow, and big data processing techniques for efficient ETL workflows.

🎯

model-optimization🎯Skill

Optimizes machine learning models through techniques like quantization, pruning, hyperparameter tuning, and AutoML for improved performance and efficiency.

🎯

deep-learning🎯Skill

Develops neural network models using PyTorch and TensorFlow for advanced machine learning tasks like image classification, NLP, and pattern recognition.

🎯

nlp-processing🎯Skill

nlp-processing skill from pluginagentmarketplace/custom-plugin-ai-data-scientist