scikit-learn is an open source machine learning framework commonly used for building predictive models. Neptune helps with keeping track of model training metadata.
With Neptune + Sklearn integration you can track your classifiers, regressors and k-means clustering results, specifically:
log classifier and regressor parameters,
log pickled model,
log test predictions,
log test predictions probabilities,
log test scores,
log classifier and regressor visualizations, like confusion matrix, precision-recall chart and feature importance chart,
log KMeans cluster labels and clustering visualizations,
log metadata including git summary info.
You can log many other run metadata like interactive charts, video, audio and more. See the full list.
This integration is tested with scikit-learn==0.23.2
, neptune-client==0.9.4
, neptune-sklearn==0.9.1
.
To get started with this integration follow the quickstart below (recommended as a first step).
You can also go to the demonstration of the functions that log regressor, classifier or K-Means summary information to Neptune. Such summary includes parameters, pickled model, visualizations and much more:
Finally if you want to log only specific information to Neptune you can make use of the convenience functions, such as:
get_estimator_parameters
,
get_pickled_model
,
create_prediction_error_chart
.
If you want to try things out and focus only on the code you can either:
You have Python 3.x
and following libraries installed:
neptune-client
. See neptune-client installation guide.
scikit-learn
. See scikit-learn installation guide.
neptune-sklearn
pip install scikit-learn==0.23.2 neptune-client==0.9.4 neptune-sklearn==0.9.1
You also need minimal familiarity with scikit-learn. Have a look at this scikit-learn guide to get started.
This quickstart will show you how to use Neptune with sklearn:
Create the first run in project,
Log estimator parameters and scores,
Explore results in the Neptune UI.
Prepare fitted estimator that will be further used to log it’s summary. Below snippet shows the idea:
parameters = {'n_estimators': 70,'max_depth': 7,'min_samples_split': 3}estimator = RandomForestRegressor(**parameters)X, y = load_boston(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)estimator.fit(X_train, y_train)
Add to your script (at the top):
import neptune.new as neptunerun = neptune.init(project='common/sklearn-integration',api_token='ANONYMOUS')
This opens a new “run” in Neptune to which you can log various objects.
You need to tell Neptune who you are and where you want to log things. To do that you specify:
project=my_workspace/my_project
: your workspace name and project name,
api_token=YOUR_API_TOKEN
: your Neptune API token.
If you configured your Neptune API token correctly, as described in this docs page, you can skip 'api_token' argument.
If you are using .py
scripts for training Neptune will also log your training script automatically.
To log parameters of your model training run you just need to pass them to the base_namespace
of your choice.
run['parameters'] = parameters
Log scores on the test data under the base_namespace
of your choice.
y_pred = estimator.predict(X_test)run['scores/max_error'] = max_error(y_test, y_pred)run['scores/mean_absolute_error'] = mean_absolute_error(y_test, y_pred)run['scores/r2_score'] = r2_score(y_test, y_pred)
Neptune-Scikit-learn integration also lets you log regressor, classifier or K-Means summary information to Neptune. Such summary includes parameters, pickled model, visualizations and much more:
Finally if you want to log only specific information to Neptune you can make use of the convenience functions, such as:
get_estimator_parameters
,
get_pickled_model
,
create_prediction_error_chart
.
You can log classification summary that includes:
Prepare fitted classifier that will be further used to log it’s summary. Below snippet shows the idea:
parameters = {'n_estimators': 120,'learning_rate': 0.12,'min_samples_split': 3,'min_samples_leaf': 2}gbc = GradientBoostingClassifier(**parameters)X, y = load_digits(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)gbc.fit(X_train, y_train)
gbc
object will be later used to log various metadata to the run.
Add the following snippet at the top of your script.
import neptune.new as neptunerun = neptune.init(project='common/sklearn-integration',api_token='ANONYMOUS',name='classification-example',tags=['GradientBoostingClassifier', 'classification'])
This creates a link to the run. Open the link in a new tab.
The run will currently be empty, but keep the window open. You will be able to see estimator summary there.
When you create a run, Neptune will look for the .git
directory in your project and get the last commit information saved.
If you are using .py
scripts for training Neptune will also log your training script automatically.
Log classifier summary under the base_namespace
of your choice.
import neptune.new.integrations.sklearn as npt_utilsrun['cls_summary'] = npt_utils.create_classifier_summary(gbc, X_train, X_test, y_train, y_test)
Once data is logged you can switch to the Neptune tab which you had opened previously to explore results. You can check:
Remember that you can try it out with zero setup:
You can log regression summary that includes:
Prepare fitted regressor that will be further used to log it’s summary. Below snippet shows the idea:
parameters = {'n_estimators': 70,'max_depth': 7,'min_samples_split': 3}rfr = RandomForestRegressor(**parameters)X, y = load_boston(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)rfr.fit(X_train, y_train)
rfr
object will be later used to log various metadata to the run.
Add the following snippet at the top of your script.
import neptune.new as neptunerun = neptune.init(project='common/sklearn-integration',api_token='ANONYMOUS',name='regression-example',tags=['RandomForestRegressor', 'regression'])
This creates a link to the run. Open the link in a new tab.
The run will currently be empty, but keep the window open. You will be able to see estimator summary there.
When you create a run, Neptune will look for the .git
directory in your project and get the last commit information saved.
If you are using .py
scripts for training Neptune will also log your training script automatically.
Log regressor summary under the base_namespace
of your choice.
import neptune.new.integrations.sklearn as npt_utilsrun['rfr_summary'] = npt_utils.create_regressor_summary(rfr, X_train, X_test, y_train, y_test)
Once data is logged you can switch to the Neptune tab which you had opened previously to explore results. You can check:
Remember that you can try it out with zero setup:
You can log K-Means clustering summary that includes:
Prepare K-Means object and example data that will be further used to log it’s summary. Below snippet shows the idea:
parameters = {'n_init': 11,'max_iter': 270}km = KMeans(**parameters)X, y = make_blobs(n_samples=579, n_features=17, centers=7, random_state=28743)
Add the following snippet at the top of your script.
import neptune.new as neptunerun = neptune.init(project='common/sklearn-integration',api_token='ANONYMOUS',name='clustering-example',tags=['KMeans', 'clustering'])
This creates a link to the run. Open the link in a new tab.
The run will currently be empty, but keep the window open. You will be able to see estimator summary there.
When you create a run, Neptune will look for the .git
directory in your project and get the last commit information saved.
If you are using .py
scripts for training Neptune will also log your training script automatically.
Log K-Means clustering summary under the base_namespace
of your choice.
import neptune.new.integrations.sklearn as npt_utilsrun['kmeans_summary'] = npt_utils.create_kmeans_summary(km, X, n_clusters=17)
Once data is logged you can switch to the Neptune tab which you had opened previously to explore results. You can check:
Remember that you can try it out with zero setup:
You can choose to only log estimator parameters.
import neptune.new.integrations.sklearn as npt_utilsrfc = RandomForestClassifier()run = neptune.init(project='common/sklearn-integration',api_token='ANONYMOUS',name='other-options')run['estimator/parameters'] = npt_utils.get_estimator_params(rfc)
You can choose to log fitted model as pickled file.
import neptune.new.integrations.sklearn as npt_utilsrfc = RandomForestClassifier()rfc.fit(X, y)run = neptune.init(project='common/sklearn-integration',api_token='ANONYMOUS',name='other-options')run['estimator/pickled-model'] = npt_utils.get_pickled_model(rfc)
You can choose to log confusion matrix chart.
import neptune.new.integrations.sklearn as npt_utilsrfc = RandomForestClassifier()X, y = load_digits(return_X_y=True)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=28743)rfc.fit(X_train, y_train)run = neptune.init(project='common/sklearn-integration',api_token='ANONYMOUS',name='other-options')run['confusion-matrix'] = npt_utils.create_confusion_matrix_chart(rfc, X_train, X_test, y_train, y_test)
Remember that you can try it out with zero setup:
Please visit the Getting help page. Everything regarding support is there.
You may also like these two integrations: