scikit-learn integration guide#
Scikit-learn (also known as sklearn) is an open source machine learning framework commonly used for building predictive models. With the Neptune–scikit-learn integration, you can track your classifiers, regressors, and k-means clustering results, specifically:
- Classifier and regressor parameters
- Pickled model
- Test predictions and their probabilities
- Test scores
- Classifier and regressor visualizations, such as confusion matrix, precision–recall chart, and feature importance chart
- K-means cluster labels and clustering visualizations
- Code snapshots and Git information
- Custom model-building metadata
See example in Neptune  Code examples 
Before you start#
- Sign up at neptune.ai/register.
- Create a project for storing your metadata.
- Have scikit-learn installed.
Installing the integration#
To use your preinstalled version of Neptune together with the integration:
To install both Neptune and the integration:
Passing your Neptune credentials
Once you've registered and created a project, set your Neptune API token and full project name to the NEPTUNE_API_TOKEN
and NEPTUNE_PROJECT
environment variables, respectively.
To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.
Your full project name has the form workspace-name/project-name
. You can copy it from the project settings: Click the
menu in the top-right →
Details & privacy.
On Windows, navigate to Settings → Edit the system environment variables, or enter the following in Command Prompt: setx SOME_NEPTUNE_VARIABLE 'some-value'
While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.
run = neptune.init_run(
project="ml-team/classification", # your full project name here
api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8", # your API token here
)
For more help, see Set Neptune credentials.
If you'd rather follow the guide without any setup, you can run the example in Colab .
scikit-learn logging example#
This example shows how to log and observe metadata as you train your model with scikit-learn.
-
Create and fit an example estimator.
Prepare a fitted estimator. The blow snippet illustrates the idea:
from sklearn.datasets import fetch_california_housing from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3} estimator = RandomForestRegressor(**parameters) X, y = fetch_california_housing(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, ) estimator.fit(X_train, y_train)
-
Create a Neptune run:
-
If you haven't set up your credentials, you can log anonymously:
-
-
To log parameters of your model training run, pass them to the namespace of your choice.
For example, to log them under the namespace "params":
-
Similarly, log scores on the test data under the namespaces and fields of your choice.
-
To stop the connection to Neptune and sync all data, call the
stop()
method: -
Run your script.
To open the run, click the Neptune link that appears in the console output.
Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-115
Logging estimator parameters#
To only log estimator parameters, use the get_estimator_params()
function:
import neptune.integrations.sklearn as npt_utils
from neptune.utils import stringify_unsupported
rfc = RandomForestClassifier()
run = neptune.init_run(name="only estimator params") # name is optional
run["estimator/params"] = stringify_unsupported(npt_utils.get_estimator_params(rfr))
run.stop()
Logging pickled model#
To log a fitted model as a pickled file, use the get_pickled_model()
function:
import neptune.integrations.sklearn as npt_utils
rfc = RandomForestClassifier()
rfc.fit(X, y)
run = neptune.init_run(
name="only pickled model", # optional
)
run["estimator/pickled-model"] = npt_utils.get_pickled_model(rfc)
run.stop()
Logging confusion matrix#
Use the create_confusion_matrix_chart()
function to log a confusion matrix chart:
import neptune.integrations.sklearn as npt_utils
rfc = RandomForestClassifier()
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=28743
)
rfc.fit(X_train, y_train)
run = neptune.init_run(
name="only confusion matrix", # optional
)
run["confusion-matrix"] = npt_utils.create_confusion_matrix_chart(
rfc, X_train, X_test, y_train, y_test
)
run.stop()
More options#
You can also log regressor, classifier, or k-means summary information to Neptune. The summary includes:
- All parameters
- Visualizations
- Logged metadata
- Code snapshot and Git metadata
- K-means: Cluster labels
- Classifier and regressor:
- Pickled model
- Test predictions and their probabilities
- Test scores
Logging classification summary#
Start by preparing a fitted classifier.
from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
parameters = {
"n_estimators": 120,
"learning_rate": 0.12,
"min_samples_split": 3,
"min_samples_leaf": 2,
}
gbc = GradientBoostingClassifier(**parameters)
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
gbc.fit(X_train, y_train)
We'll use the gbc
object to log metadata to the Neptune run.
-
Create a run:
import neptune run = neptune.init_run( # (1)! name="classification example", # optional tags=["GradientBoostingClassifier", "classification"], # optional )
-
If you haven't set up your credentials, you can log anonymously:
-
-
In a namespace of your choice, log the classifier summary:
import neptune.integrations.sklearn as npt_utils run["cls_summary"] = npt_utils.create_classifier_summary( gbc, X_train, X_test, y_train, y_test )
In the snippet above, the namespace is
cls_summary
. -
To stop the connection to Neptune and sync all data, call the
stop()
method: -
Run your script.
To open the run, click the Neptune link that appears in the console output.
Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-95
Logging regression summary#
Start by preparing a fitted regressor.
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3}
rfr = RandomForestRegressor(**parameters)
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rfr.fit(X_train, y_train)
We'll use the rfr
object to log metadata to the Neptune run.
-
Create a run:
import neptune run = neptune.init_run( # (1)! name="regression example", # optional tags=["GradientBoostingClassifier", "classification"], # optional )
-
If you haven't set up your credentials, you can log anonymously:
-
-
In a namespace of your choice, log the regressor summary:
import neptune.integrations.sklearn as npt_utils run["rfr_summary"] = npt_utils.create_regressor_summary( rfr, X_train, X_test, y_train, y_test )
In the snippet above, the namespace is
rfr_summary
. -
To stop the connection to Neptune and sync all data, call the
stop()
method: -
Run your script.
To open the run, click the Neptune link that appears in the console output.
Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-92
Logging k-means clustering summary#
Start by preparing a KMeans
object and example data.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
parameters = {"n_init": 11, "max_iter": 270}
km = KMeans(**parameters)
X, y = make_blobs(n_samples=579, n_features=17, centers=7, random_state=28743)
-
Create a run:
import neptune run = neptune.init_run( # (1)! name="clustering-example", # optional tags=["KMeans", "clustering"], # optional )
-
If you haven't set up your credentials, you can log anonymously:
-
-
In a namespace of your choice, log the regressor summary:
import neptune.integrations.sklearn as npt_utils run["kmeans_summary"] = npt_utils.create_kmeans_summary(km, X, n_clusters=17)
In the snippet above, the namespace is
kmeans_summary
. -
To stop the connection to Neptune and sync all data, call the
stop()
method: -
Run your script.
To open the run, click the Neptune link that appears in the console output.
Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-96
Manually logging metadata#
If you have other types of metadata that are not covered in this guide, you can still log them using the Neptune client library.
When you initialize the run, you get a run
object, to which you can assign different types of metadata in a structure of your own choosing.
import neptune
# Create a new Neptune run
run = neptune.init_run()
# Log metrics inside loops
for epoch in range(n_epochs):
# Your training loop
run["train/epoch/loss"].append(loss) # Each append() call appends a value
run["train/epoch/accuracy"].append(acc)
# Track artifact versions and metadata
run["train/images"].track_files("./datasets/images")
# Upload entire files
run["test/preds"].upload("path/to/test_preds.csv")
# Log text or other metadata, in a structure of your choosing
run["tokenizer"] = "regexp_tokenize"
Related
- Add Neptune to your code
- What you can log and display
- Resume a run
- scikit-learn integration API reference
- neptune-sklearn repo on GitHub
- scikit-learn on GitHub