Skip to content

Working with scikit-learn#

Open in Colab

Custom dashboard displaying metadata logged with scikit-learn

Scikit-learn (also known as sklearn) is an open-source machine learning framework commonly used for building predictive models.

With the Neptune–scikit-learn integration, you can track your classifiers, regressors, and \(k\)-means clustering results, specifically:

  • Classifier and regressor parameters
  • Pickled model
  • Test predictions and their probabilities
  • Test scores
  • Classifier and regressor visualizations, such as confusion matrix, precision–recall chart, and feature importance chart
  • K-means cluster labels and clustering visualizations
  • Code snapshots and Git information
  • Custom model-building metadata

See example in Neptune  Code examples 

Related

Before you start#

Tip

If you'd rather follow the guide without any setup, you can run the example in Colab .

Installing the Neptune–scikit-learn integration#

On the command line or in a terminal app, such as Command Prompt, enter the following:

pip install neptune-sklearn
conda install -c conda-forge neptune-sklearn

scikit-learn logging example#

This example shows how to log and observe metadata as you train your model with scikit-learn.

  1. Create and fit an example estimator.

    Prepare a fitted estimator. The blow snippet illustrates the idea:

    from sklearn.datasets import fetch_california_housing
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.model_selection import train_test_split
    
    parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3}
    
    estimator = RandomForestRegressor(**parameters)
    X, y = fetch_california_housing(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.20,
    )
    estimator.fit(X_train, y_train)
    
  2. Create a Neptune run:

    import neptune.new as neptune
    
    run = neptune.init_run()  # (1)
    
    1. If you haven't set up your credentials, you can log anonymously: neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/sklearn-integration")
  3. To log parameters of your model training run, pass them to the namespace of your choice.

    For example, to log them under the namespace "params":

    run["params"] = parameters
    
  4. Similarly, log scores on the test data under the namespaces and fields of your choice.

    y_pred = estimator.predict(X_test)
    
    run["scores/max_error"] = max_error(y_test, y_pred)
    run["scores/mean_absolute_error"] = mean_absolute_error(y_test, y_pred)
    run["scores/r2_score"] = r2_score(y_test, y_pred)
    
  5. To stop the connection to Neptune and sync all data, call the stop() method:

    run.stop()
    
    Warning

    Always call stop() in interactive environments, such as a Python interpreter or Jupyter notebook. The connection to Neptune is not stopped when the cell has finished executing, but rather when the entire notebook stops.

    If you're running a script, the connection is stopped automatically when the script finishes executing. However, it's a best practice to call stop() when the connection is no longer needed.

  6. Run your script.

    To open the run, click the Neptune link that appears in the console output.

    Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-115

See example in Neptune 

Logging estimator parameters#

To only log estimator parameters, use the get_estimator_params() method:

import neptune.new.integrations.sklearn as npt_utils

rfc = RandomForestClassifier()

run = neptune.init_run(name="only estimator params")  # name is optional

run["estimator/parameters"] = npt_utils.get_estimator_params(rfc)

run.stop()

See in Neptune 

Logging pickled model#

To log a fitted model as a pickled file, use the get_estimator_params() method:

import neptune.new.integrations.sklearn as npt_utils

rfc = RandomForestClassifier()
rfc.fit(X, y)

run = neptune.init_run(
    name="only pickled model",  # optional
)

run["estimator/pickled-model"] = npt_utils.get_pickled_model(rfc)

run.stop()

See in Neptune 

Logging confusion matrix#

Use the create_confusion_matrix_chart() method to log a confusion matrix chart:

import neptune.new.integrations.sklearn as npt_utils

rfc = RandomForestClassifier()
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=28743
)
rfc.fit(X_train, y_train)

run = neptune.init_run(
    name="only confusion matrix",  # optional
)

run["confusion-matrix"] = npt_utils.create_confusion_matrix_chart(
    rfc, X_train, X_test, y_train, y_test
)

run.stop()

See in Neptune 

More options#

You can also log regressor, classifier, or k-means summary information to Neptune. The summary includes:

  • All parameters
  • Visualizations
  • Logged metadata
  • Code snapshot and Git metadata
  • K-means: Cluster labels
  • Classifier and regressor:
    • Pickled model
    • Test predictions and their probabilities
    • Test scores

Logging classification summary#

Start by preparing a fitted classifier.

Example
from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

parameters = {
    "n_estimators": 120,
    "learning_rate": 0.12,
    "min_samples_split": 3,
    "min_samples_leaf": 2,
}

gbc = GradientBoostingClassifier(**parameters)

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

gbc.fit(X_train, y_train)

We'll use the gbc object to log metadata to the Neptune run.

  1. Create a run:

    import neptune.new as neptune
    
    run = neptune.init_run(  # (1)
        name="classification example",  # optional
        tags=["GradientBoostingClassifier", "classification"],  # optional
    )
    
    1. If you haven't set up your credentials, you can log anonymously: neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/sklearn-integration")
  2. In a namespace of your choice, log the classifier summary:

    import neptune.new.integrations.sklearn as npt_utils
    
    run["cls_summary"] = npt_utils.create_classifier_summary(
        gbc, X_train, X_test, y_train, y_test
    )
    

    In the snippet above, the namespace is cls_summary.

  3. To stop the connection to Neptune and sync all data, call the stop() method:

    run.stop()
    
    Warning

    Always call stop() in interactive environments, such as a Python interpreter or Jupyter notebook. The connection to Neptune is not stopped when the cell has finished executing, but rather when the entire notebook stops.

    If you're running a script, the connection is stopped automatically when the script finishes executing. However, it's a best practice to call stop() when the connection is no longer needed.

  4. Run your script.

    To open the run, click the Neptune link that appears in the console output.

    Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-95

In the left pane, switch between the different sections to view the logged metadata.

See in Neptune 

Logging regression summary#

Start by preparing a fitted regressor.

Example
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3}

rfr = RandomForestRegressor(**parameters)

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

rfr.fit(X_train, y_train)

We'll use the rfr object to log metadata to the Neptune run.

  1. Create a run:

    import neptune.new as neptune
    
    run = neptune.init_run(  # (1)
        name="regression example",  # optional
        tags=["GradientBoostingClassifier", "classification"],  # optional
    )
    
    1. If you haven't set up your credentials, you can log anonymously: neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/sklearn-integration")
  2. In a namespace of your choice, log the regressor summary:

    import neptune.new.integrations.sklearn as npt_utils
    
    run["rfr_summary"] = npt_utils.create_regressor_summary(
        rfr, X_train, X_test, y_train, y_test
    )
    

    In the snippet above, the namespace is rfr_summary.

  3. To stop the connection to Neptune and sync all data, call the stop() method:

    run.stop()
    
    Warning

    Always call stop() in interactive environments, such as a Python interpreter or Jupyter notebook. The connection to Neptune is not stopped when the cell has finished executing, but rather when the entire notebook stops.

    If you're running a script, the connection is stopped automatically when the script finishes executing. However, it's a best practice to call stop() when the connection is no longer needed.

  4. Run your script.

    To open the run, click the Neptune link that appears in the console output.

    Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-92

In the left pane, switch between the different sections to view the logged metadata.

See in Neptune 

Logging k-means clustering summary#

Start by preparing a KMeans object and example data.

Example
parameters = {"n_init": 11, "max_iter": 270}

km = KMeans(**parameters)
X, y = make_blobs(n_samples=579, n_features=17, centers=7, random_state=28743)
  1. Create a run:

    import neptune.new as neptune
    
    run = neptune.init_run(  # (1)
        name="clustering-example",  # optional
        tags=["KMeans", "clustering"],  # optional
    )
    
    1. If you haven't set up your credentials, you can log anonymously: neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/sklearn-integration")
  2. In a namespace of your choice, log the regressor summary:

    import neptune.new.integrations.sklearn as npt_utils
    
    run["kmeans_summary"] = npt_utils.create_kmeans_summary(km, X, n_clusters=17)
    

    In the snippet above, the namespace is kmeans_summary.

  3. To stop the connection to Neptune and sync all data, call the stop() method:

    run.stop()
    
    Warning

    Always call stop() in interactive environments, such as a Python interpreter or Jupyter notebook. The connection to Neptune is not stopped when the cell has finished executing, but rather when the entire notebook stops.

    If you're running a script, the connection is stopped automatically when the script finishes executing. However, it's a best practice to call stop() when the connection is no longer needed.

  4. Run your script.

    To open the run, click the Neptune link that appears in the console output.

    Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-96

In the left pane, switch between the different sections to view the logged metadata.

See in Neptune 

Manually logging metadata#

If you have other types of metadata that are not covered in this guide, you can still log them using the Neptune client library (neptune-client).

When you initialize the run, you get a run object, to which you can assign different types of metadata in a structure of your own choosing.

from neptune.new import neptune

# Create a new Neptune run
run = neptune.init_run()

# Log metrics or other values inside loops
for epoch in range(n_epochs):
    ...  # Your training loop

    run["train/epoch/loss"].log(loss)  # Each log() appends a value
    run["train/epoch/accuracy"].log(acc)

# Upload files
run["test/preds"].upload("path/to/test_preds.csv")

# Track and version artifacts
run["train/images"].track_files("./datasets/images")

# Record numbers or text
run["tokenizer"] = "regexp_tokenize"