scikit-learn integration guide#

Custom dashboard displaying metadata logged with scikit-learn

Scikit-learn (also known as sklearn) is an open source machine learning framework commonly used for building predictive models. With the Neptune–scikit-learn integration, you can track your classifiers, regressors, and k-means clustering results, specifically:

Classifier and regressor parameters
Pickled model
Test predictions and their probabilities
Test scores
Classifier and regressor visualizations, such as confusion matrix, precision–recall chart, and feature importance chart
K-means cluster labels and clustering visualizations
Code snapshots and Git information
Custom model-building metadata

See example in Neptune Code examples

Before you start#

Sign up at neptune.ai/register.
Create a project for storing your metadata.
Have scikit-learn installed.

Installing the integration#

Install integration onlyInstall Neptune + integration

To use your preinstalled version of Neptune together with the integration:

pip

pip install -U neptune-sklearn

To install both Neptune and the integration:

pip

pip install -U "neptune[sklearn]"

Passing your Neptune credentials

Once you've registered and created a project, set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...6Lc"

To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.

export NEPTUNE_PROJECT="ml-team/classification"

Your full project name has the form workspace-name/project-name. You can copy it from the project settings: Click the menu in the top-right → Details & privacy.

On Windows, navigate to Settings → Edit the system environment variables, or enter the following in Command Prompt: setx SOME_NEPTUNE_VARIABLE 'some-value'

While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.

run = neptune.init_run(
    project="ml-team/classification",  # your full project name here
    api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8",  # your API token here
)

For more help, see Set Neptune credentials.

If you'd rather follow the guide without any setup, you can run the example in Colab .

scikit-learn logging example#

This example shows how to log and observe metadata as you train your model with scikit-learn.

Create and fit an example estimator.

Prepare a fitted estimator. The blow snippet illustrates the idea:

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3}

estimator = RandomForestRegressor(**parameters)
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
)
estimator.fit(X_train, y_train)

Create a Neptune run:

import neptune

run = neptune.init_run() # (1)!

If you haven't set up your credentials, you can log anonymously:

neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/sklearn-integration",
)

To log parameters of your model training run, pass them to the namespace of your choice.

For example, to log them under the namespace "params":
```
run["params"] = parameters
```

Similarly, log scores on the test data under the namespaces and fields of your choice.

y_pred = estimator.predict(X_test)

run["scores/max_error"] = max_error(y_test, y_pred)
run["scores/mean_absolute_error"] = mean_absolute_error(y_test, y_pred)
run["scores/r2_score"] = r2_score(y_test, y_pred)

To stop the connection to Neptune and sync all data, call the stop() method:
```
run.stop()
```
Run your script.

To open the run, click the Neptune link that appears in the console output.

Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-115

See example in Neptune

Logging estimator parameters#

To only log estimator parameters, use the get_estimator_params() function:

import neptune.integrations.sklearn as npt_utils
from neptune.utils import stringify_unsupported

rfc = RandomForestClassifier()

run = neptune.init_run(name="only estimator params")  # name is optional

run["estimator/params"] = stringify_unsupported(npt_utils.get_estimator_params(rfr))

run.stop()

See in Neptune

Logging pickled model#

To log a fitted model as a pickled file, use the get_pickled_model() function:

import neptune.integrations.sklearn as npt_utils

rfc = RandomForestClassifier()
rfc.fit(X, y)

run = neptune.init_run(
    name="only pickled model",  # optional
)

run["estimator/pickled-model"] = npt_utils.get_pickled_model(rfc)

run.stop()

See in Neptune

Logging confusion matrix#

Use the create_confusion_matrix_chart() function to log a confusion matrix chart:

import neptune.integrations.sklearn as npt_utils

rfc = RandomForestClassifier()
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=28743
)
rfc.fit(X_train, y_train)

run = neptune.init_run(
    name="only confusion matrix",  # optional
)

run["confusion-matrix"] = npt_utils.create_confusion_matrix_chart(
    rfc, X_train, X_test, y_train, y_test
)

run.stop()

See in Neptune

More options#

You can also log regressor, classifier, or k-means summary information to Neptune. The summary includes:

All parameters
Visualizations
Logged metadata
Code snapshot and Git metadata
K-means: Cluster labels
Classifier and regressor:
- Pickled model
- Test predictions and their probabilities
- Test scores

Logging classification summary#

Start by preparing a fitted classifier.

Example

from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

parameters = {
    "n_estimators": 120,
    "learning_rate": 0.12,
    "min_samples_split": 3,
    "min_samples_leaf": 2,
}

gbc = GradientBoostingClassifier(**parameters)

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

gbc.fit(X_train, y_train)

We'll use the gbc object to log metadata to the Neptune run.

Create a run:

import neptune

run = neptune.init_run( # (1)!
    name="classification example",  # optional
    tags=["GradientBoostingClassifier", "classification"],  # optional
)

If you haven't set up your credentials, you can log anonymously:

neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/sklearn-integration",
)

In a namespace of your choice, log the classifier summary:

import neptune.integrations.sklearn as npt_utils

run["cls_summary"] = npt_utils.create_classifier_summary(
    gbc, X_train, X_test, y_train, y_test
)

In the snippet above, the namespace is cls_summary.

To stop the connection to Neptune and sync all data, call the stop() method:
```
run.stop()
```
Run your script.

To open the run, click the Neptune link that appears in the console output.

Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-95

See in Neptune

Logging regression summary#

Start by preparing a fitted regressor.

Example

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3}

rfr = RandomForestRegressor(**parameters)

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rfr.fit(X_train, y_train)

We'll use the rfr object to log metadata to the Neptune run.

Create a run:

import neptune

run = neptune.init_run( # (1)!
    name="regression example",  # optional
    tags=["GradientBoostingClassifier", "classification"],  # optional
)

If you haven't set up your credentials, you can log anonymously:

neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/sklearn-integration",
)

In a namespace of your choice, log the regressor summary:

import neptune.integrations.sklearn as npt_utils

run["rfr_summary"] = npt_utils.create_regressor_summary(
    rfr, X_train, X_test, y_train, y_test
)

In the snippet above, the namespace is rfr_summary.

To stop the connection to Neptune and sync all data, call the stop() method:
```
run.stop()
```
Run your script.

To open the run, click the Neptune link that appears in the console output.

Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-92

See in Neptune

Logging k-means clustering summary#

Start by preparing a KMeans object and example data.

Example

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

parameters = {"n_init": 11, "max_iter": 270}

km = KMeans(**parameters)
X, y = make_blobs(n_samples=579, n_features=17, centers=7, random_state=28743)

Create a run:

import neptune

run = neptune.init_run( # (1)!
    name="clustering-example",  # optional
    tags=["KMeans", "clustering"],  # optional
)

If you haven't set up your credentials, you can log anonymously:

neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/sklearn-integration",
)

In a namespace of your choice, log the regressor summary:

import neptune.integrations.sklearn as npt_utils

run["kmeans_summary"] = npt_utils.create_kmeans_summary(km, X, n_clusters=17)

In the snippet above, the namespace is kmeans_summary.

To stop the connection to Neptune and sync all data, call the stop() method:
```
run.stop()
```
Run your script.

To open the run, click the Neptune link that appears in the console output.

Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-96

See in Neptune

Manually logging metadata#

If you have other types of metadata that are not covered in this guide, you can still log them using the Neptune client library.

When you initialize the run, you get a run object, to which you can assign different types of metadata in a structure of your own choosing.

import neptune

# Create a new Neptune run
run = neptune.init_run()

# Log metrics inside loops
for epoch in range(n_epochs):
    # Your training loop

    run["train/epoch/loss"].append(loss)  # Each append() call appends a value
    run["train/epoch/accuracy"].append(acc)

# Track artifact versions and metadata
run["train/images"].track_files("./datasets/images")

# Upload entire files
run["test/preds"].upload("path/to/test_preds.csv")

# Log text or other metadata, in a structure of your choosing
run["tokenizer"] = "regexp_tokenize"