scikit-learn integration guide#

Custom dashboard displaying metadata logged with scikit-learn

Scikit-learn (also known as sklearn) is an open source machine learning framework commonly used for building predictive models. With the Neptune–scikit-learn integration, you can track your classifiers, regressors, and k-means clustering results, specifically:

  • Classifier and regressor parameters
  • Pickled model
  • Test predictions and their probabilities
  • Test scores
  • Classifier and regressor visualizations, such as confusion matrix, precision–recall chart, and feature importance chart
  • K-means cluster labels and clustering visualizations
  • Code snapshots and Git information
  • Custom model-building metadata

Before you start#

Installing the integration#

To use your preinstalled version of Neptune together with the integration:

pip install -U neptune-sklearn

To install both Neptune and the integration:

pip install -U "neptune[sklearn]"
Passing your Neptune credentials

Once you've registered and created a project, set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...6Lc"

To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.

export NEPTUNE_PROJECT="ml-team/classification"

Your full project name has the form workspace-name/project-name. You can copy it from the project settings: Click the menu in the top-right → Details & privacy.

On Windows, navigate to SettingsEdit the system environment variables, or enter the following in Command Prompt: setx SOME_NEPTUNE_VARIABLE 'some-value'

While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.

run = neptune.init_run(
    project="ml-team/classification",  # your full project name here
    api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8",  # your API token here

For more help, see Set Neptune credentials.

If you'd rather follow the guide without any setup, you can run the example in Colab .

scikit-learn logging example#

This example shows how to log and observe metadata as you train your model with scikit-learn.

  1. Create and fit an example estimator.

    Prepare a fitted estimator. The blow snippet illustrates the idea:

    from sklearn.datasets import fetch_california_housing
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split
    parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3}
    estimator = RandomForestRegressor(**parameters)
    X, y = fetch_california_housing(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
    ), y_train)
  2. Create a Neptune run:

    import neptune
    run = neptune.init_run() # (1)!
    1. If you haven't set up your credentials, you can log anonymously:

  3. To log parameters of your model training run, pass them to the namespace of your choice.

    For example, to log them under the namespace "params":

    run["params"] = parameters
  4. Similarly, log scores on the test data under the namespaces and fields of your choice.

    y_pred = estimator.predict(X_test)
    run["scores/max_error"] = max_error(y_test, y_pred)
    run["scores/mean_absolute_error"] = mean_absolute_error(y_test, y_pred)
    run["scores/r2_score"] = r2_score(y_test, y_pred)
  5. To stop the connection to Neptune and sync all data, call the stop() method:

  6. Run your script.

    To open the run, click the Neptune link that appears in the console output.

    Example link:

Logging estimator parameters#

To only log estimator parameters, use the get_estimator_params() function:

import neptune.integrations.sklearn as npt_utils
from neptune.utils import stringify_unsupported

rfc = RandomForestClassifier()

run = neptune.init_run(name="only estimator params")  # name is optional

run["estimator/params"] = stringify_unsupported(npt_utils.get_estimator_params(rfr))


Logging pickled model#

To log a fitted model as a pickled file, use the get_pickled_model() function:

import neptune.integrations.sklearn as npt_utils

rfc = RandomForestClassifier(), y)

run = neptune.init_run(
    name="only pickled model",  # optional

run["estimator/pickled-model"] = npt_utils.get_pickled_model(rfc)


Logging confusion matrix#

Use the create_confusion_matrix_chart() function to log a confusion matrix chart:

import neptune.integrations.sklearn as npt_utils

rfc = RandomForestClassifier()
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=28743
), y_train)

run = neptune.init_run(
    name="only confusion matrix",  # optional

run["confusion-matrix"] = npt_utils.create_confusion_matrix_chart(
    rfc, X_train, X_test, y_train, y_test


More options#

You can also log regressor, classifier, or k-means summary information to Neptune. The summary includes:

  • All parameters
  • Visualizations
  • Logged metadata
  • Code snapshot and Git metadata
  • K-means: Cluster labels
  • Classifier and regressor:
    • Pickled model
    • Test predictions and their probabilities
    • Test scores

Logging classification summary#

Start by preparing a fitted classifier.

from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

parameters = {
    "n_estimators": 120,
    "learning_rate": 0.12,
    "min_samples_split": 3,
    "min_samples_leaf": 2,

gbc = GradientBoostingClassifier(**parameters)

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2), y_train)

We'll use the gbc object to log metadata to the Neptune run.

  1. Create a run:

    import neptune
    run = neptune.init_run( # (1)!
        name="classification example",  # optional
        tags=["GradientBoostingClassifier", "classification"],  # optional
    1. If you haven't set up your credentials, you can log anonymously:

  2. In a namespace of your choice, log the classifier summary:

    import neptune.integrations.sklearn as npt_utils
    run["cls_summary"] = npt_utils.create_classifier_summary(
        gbc, X_train, X_test, y_train, y_test

    In the snippet above, the namespace is cls_summary.

  3. To stop the connection to Neptune and sync all data, call the stop() method:

  4. Run your script.

    To open the run, click the Neptune link that appears in the console output.

    Example link:

Logging regression summary#

Start by preparing a fitted regressor.

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3}

rfr = RandomForestRegressor(**parameters)

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2), y_train)

We'll use the rfr object to log metadata to the Neptune run.

  1. Create a run:

    import neptune
    run = neptune.init_run( # (1)!
        name="regression example",  # optional
        tags=["GradientBoostingClassifier", "classification"],  # optional
    1. If you haven't set up your credentials, you can log anonymously:

  2. In a namespace of your choice, log the regressor summary:

    import neptune.integrations.sklearn as npt_utils
    run["rfr_summary"] = npt_utils.create_regressor_summary(
        rfr, X_train, X_test, y_train, y_test

    In the snippet above, the namespace is rfr_summary.

  3. To stop the connection to Neptune and sync all data, call the stop() method:

  4. Run your script.

    To open the run, click the Neptune link that appears in the console output.

    Example link:

Logging k-means clustering summary#

Start by preparing a KMeans object and example data.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

parameters = {"n_init": 11, "max_iter": 270}

km = KMeans(**parameters)
X, y = make_blobs(n_samples=579, n_features=17, centers=7, random_state=28743)
  1. Create a run:

    import neptune
    run = neptune.init_run( # (1)!
        name="clustering-example",  # optional
        tags=["KMeans", "clustering"],  # optional
    1. If you haven't set up your credentials, you can log anonymously:

  2. In a namespace of your choice, log the regressor summary:

    import neptune.integrations.sklearn as npt_utils
    run["kmeans_summary"] = npt_utils.create_kmeans_summary(km, X, n_clusters=17)

    In the snippet above, the namespace is kmeans_summary.

  3. To stop the connection to Neptune and sync all data, call the stop() method:

  4. Run your script.

    To open the run, click the Neptune link that appears in the console output.

    Example link:

Manually logging metadata#

If you have other types of metadata that are not covered in this guide, you can still log them using the Neptune client library.

When you initialize the run, you get a run object, to which you can assign different types of metadata in a structure of your own choosing.

import neptune

# Create a new Neptune run
run = neptune.init_run()

# Log metrics inside loops
for epoch in range(n_epochs):
    # Your training loop

    run["train/epoch/loss"].append(loss)  # Each append() call appends a value

# Track artifact versions and metadata

# Upload entire files

# Log text or other metadata, in a structure of your choosing
run["tokenizer"] = "regexp_tokenize"