Working with scikit-learn#
Scikit-learn (also known as sklearn) is an open-source machine learning framework commonly used for building predictive models.
With the Neptune–scikit-learn integration, you can track your classifiers, regressors, and k-means clustering results, specifically:
- Classifier and regressor parameters
- Pickled model
- Test predictions and their probabilities
- Test scores
- Classifier and regressor visualizations, such as confusion matrix, precision–recall chart, and feature importance chart
- K-means cluster labels and clustering visualizations
- Code snapshots and Git information
- Custom model-building metadata
See example in Neptune  Code examples 
Related
- API reference ≫ scikit-learn
- neptune-sklearn repo on GitHub
- scikit-learn on GitHub
- For other types of metadata you can track, see What you can log and display.
Before you start#
Tip
If you'd rather follow the guide without any setup, you can run the example in Colab .
- Set up Neptune. Instructions:
Installing the Neptune–scikit-learn integration#
On the command line or in a terminal app, such as Command Prompt, enter the following:
scikit-learn logging example#
This example shows how to log and observe metadata as you train your model with scikit-learn.
-
Create and fit an example estimator.
Prepare a fitted estimator. The blow snippet illustrates the idea:
from sklearn.datasets import fetch_california_housing from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3} estimator = RandomForestRegressor(**parameters) X, y = fetch_california_housing(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.20, ) estimator.fit(X_train, y_train)
-
Create a Neptune run:
- If you haven't set up your credentials, you can log anonymously:
neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/sklearn-integration")
- If you haven't set up your credentials, you can log anonymously:
-
To log parameters of your model training run, pass them to the namespace of your choice.
For example, to log them under the namespace "params":
-
Similarly, log scores on the test data under the namespaces and fields of your choice.
-
To stop the connection to Neptune and sync all data, call the
stop()
method:Using
stop()
is especially important in Jupyter Notebook or other interactive sessions, as the connection otherwise remains open until the session ends completely. -
Run your script.
To open the run, click the Neptune link that appears in the console output.
Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-115
Logging estimator parameters#
To only log estimator parameters, use the get_estimator_params()
function:
import neptune.integrations.sklearn as npt_utils
from neptune.utils import stringify_unsupported
rfc = RandomForestClassifier()
run = neptune.init_run(name="only estimator params") # name is optional
run["estimator/params"] = stringify_unsupported(npt_utils.get_estimator_params(rfr))
run.stop()
Logging pickled model#
To log a fitted model as a pickled file, use the get_pickled_model()
function:
import neptune.integrations.sklearn as npt_utils
rfc = RandomForestClassifier()
rfc.fit(X, y)
run = neptune.init_run(
name="only pickled model", # optional
)
run["estimator/pickled-model"] = npt_utils.get_pickled_model(rfc)
run.stop()
Logging confusion matrix#
Use the create_confusion_matrix_chart()
function to log a confusion matrix chart:
import neptune.integrations.sklearn as npt_utils
rfc = RandomForestClassifier()
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=28743
)
rfc.fit(X_train, y_train)
run = neptune.init_run(
name="only confusion matrix", # optional
)
run["confusion-matrix"] = npt_utils.create_confusion_matrix_chart(
rfc, X_train, X_test, y_train, y_test
)
run.stop()
More options#
You can also log regressor, classifier, or k-means summary information to Neptune. The summary includes:
- All parameters
- Visualizations
- Logged metadata
- Code snapshot and Git metadata
- K-means: Cluster labels
- Classifier and regressor:
- Pickled model
- Test predictions and their probabilities
- Test scores
Logging classification summary#
Start by preparing a fitted classifier.
from sklearn.datasets import load_digits
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
parameters = {
"n_estimators": 120,
"learning_rate": 0.12,
"min_samples_split": 3,
"min_samples_leaf": 2,
}
gbc = GradientBoostingClassifier(**parameters)
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
gbc.fit(X_train, y_train)
We'll use the gbc
object to log metadata to the Neptune run.
-
Create a run:
import neptune run = neptune.init_run( # (1)! name="classification example", # optional tags=["GradientBoostingClassifier", "classification"], # optional )
- If you haven't set up your credentials, you can log anonymously:
neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/sklearn-integration")
- If you haven't set up your credentials, you can log anonymously:
-
In a namespace of your choice, log the classifier summary:
import neptune.integrations.sklearn as npt_utils run["cls_summary"] = npt_utils.create_classifier_summary( gbc, X_train, X_test, y_train, y_test )
In the snippet above, the namespace is
cls_summary
. -
To stop the connection to Neptune and sync all data, call the
stop()
method:Using
stop()
is especially important in Jupyter Notebook or other interactive sessions, as the connection otherwise remains open until the session ends completely. -
Run your script.
To open the run, click the Neptune link that appears in the console output.
Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-95
In the left pane, switch between the different sections to view the logged metadata.
Logging regression summary#
Start by preparing a fitted regressor.
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
parameters = {"n_estimators": 70, "max_depth": 7, "min_samples_split": 3}
rfr = RandomForestRegressor(**parameters)
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
rfr.fit(X_train, y_train)
We'll use the rfr
object to log metadata to the Neptune run.
-
Create a run:
import neptune run = neptune.init_run( # (1)! name="regression example", # optional tags=["GradientBoostingClassifier", "classification"], # optional )
- If you haven't set up your credentials, you can log anonymously:
neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/sklearn-integration")
- If you haven't set up your credentials, you can log anonymously:
-
In a namespace of your choice, log the regressor summary:
import neptune.integrations.sklearn as npt_utils run["rfr_summary"] = npt_utils.create_regressor_summary( rfr, X_train, X_test, y_train, y_test )
In the snippet above, the namespace is
rfr_summary
. -
To stop the connection to Neptune and sync all data, call the
stop()
method:Using
stop()
is especially important in Jupyter Notebook or other interactive sessions, as the connection otherwise remains open until the session ends completely. -
Run your script.
To open the run, click the Neptune link that appears in the console output.
Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-92
In the left pane, switch between the different sections to view the logged metadata.
Logging k-means clustering summary#
Start by preparing a KMeans
object and example data.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
parameters = {"n_init": 11, "max_iter": 270}
km = KMeans(**parameters)
X, y = make_blobs(n_samples=579, n_features=17, centers=7, random_state=28743)
-
Create a run:
import neptune run = neptune.init_run( # (1)! name="clustering-example", # optional tags=["KMeans", "clustering"], # optional )
- If you haven't set up your credentials, you can log anonymously:
neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/sklearn-integration")
- If you haven't set up your credentials, you can log anonymously:
-
In a namespace of your choice, log the regressor summary:
import neptune.integrations.sklearn as npt_utils run["kmeans_summary"] = npt_utils.create_kmeans_summary(km, X, n_clusters=17)
In the snippet above, the namespace is
kmeans_summary
. -
To stop the connection to Neptune and sync all data, call the
stop()
method:Using
stop()
is especially important in Jupyter Notebook or other interactive sessions, as the connection otherwise remains open until the session ends completely. -
Run your script.
To open the run, click the Neptune link that appears in the console output.
Example link: https://app.neptune.ai/o/common/org/sklearn-integration/e/SKLEAR-96
In the left pane, switch between the different sections to view the logged metadata.
Manually logging metadata#
If you have other types of metadata that are not covered in this guide, you can still log them using the Neptune client library.
When you initialize the run, you get a run
object, to which you can assign different types of metadata in a structure of your own choosing.
import neptune
# Create a new Neptune run
run = neptune.init_run()
# Log metrics or other values inside loops
for epoch in range(n_epochs):
... # Your training loop
run["train/epoch/loss"].append(loss) # Each append() appends a value
run["train/epoch/accuracy"].append(acc)
# Upload files
run["test/preds"].upload("path/to/test_preds.csv")
# Track and version artifacts
run["train/images"].track_files("./datasets/images")
# Record numbers or text
run["tokenizer"] = "regexp_tokenize"
Related
- What you can log and display
- Resuming a run
- Adding Neptune to your code
- API reference ≫ scikit-learn