Skip to content

Logging with Neptune in a sequential pipeline#

This tutorial shows how to log all the metadata from a sequence of steps in an ML pipeline to the same run.

In this example, our pipeline consists of a handful of scripts. We want to track metadata from three of them. To do that:

  • We'll set a custom run ID so that we can access the same run from each step.
  • We'll use namespace handlers to create a namespace (folder) for each step, so that the metadata is organized by step inside the run.

We'll set up the steps and namespaces as follows:

  1. data_preprocessing.py run["preprocessing/..."]
  2. model_training.py run["training/..."]
  3. model_validation.py run["validation/..."]
  4. Additionally, we provide an example script that does model promotion: model_promotion.py .

The metadata structure of the resulting run will be:

All metadata
monitoring
|-- preprocessing
    |-- cpu
    |-- memory
    |-- ...
|-- training
|-- validation
preprocessing
|-- <preprocessing metadata>
source_code
sys
training
|-- <training metadata>
validation
|-- <validation metadata>

See example run in Neptune  See full code example on GitHub 

Before you start#

Installing Neptune and the scikit-learn integration#

If you want to reproduce the example pipeline, install the Neptune–scikit-learn integration (and, if needed, Neptune itself).

To use your preinstalled version of Neptune together with the integration:

pip
pip install -U neptune-sklearn
conda
conda install -c conda-forge neptune-sklearn

To install both Neptune and the integration:

pip
pip install -U "neptune[sklearn]"
conda
conda install -c conda-forge neptune neptune-sklearn

Preparation: Exporting a custom run ID#

By exporting a custom run ID as an environment variable, you ensure that every time a run is created by the pipeline, it uses the same ID. This way, the same run is initialized in each step, instead of a new run being created each time.

export NEPTUNE_CUSTOM_RUN_ID=`date +"%Y%m%d%H%M%s%N" | md5sum`
export NEPTUNE_CUSTOM_RUN_ID=`date +"%Y%m%d%H%M%s%N" | md5`
set NEPTUNE_CUSTOM_RUN_ID=`date +"%Y%m%d%H%M%s%N"`

For example: You can have the following in your .sh script, before the model training scripts are executed:

Ubuntu example
export export NEPTUNE_CUSTOM_RUN_ID=`date +"%Y%m%d%H%M%s%N" | md5sum`
export NEPTUNE_PROJECT="workspace-name/project-name" # (1)!
...
  1. The full project name. For example, "ml-team/classification".

    To copy it to your clipboard, navigate to the project settings in the top-right () and select Edit project details.

For the full run_examples.sh, requirements.txt and utils.py, see the example scripts in the neptune-ai/examples repository .

Example step#

We'll break down one step (script) of the pipeline. The other steps follow the same idea.

In the first step, we download and process some data.

import neptune
from sklearn.datasets import fetch_lfw_people
from utils import *

dataset = fetch_lfw_people(min_faces_per_person=70, resize=0.4)

Next, we create a new Neptune run. In order to organize monitoring metrics per step, we also set a custom name for the monitoring namespace.

run = neptune.init_run(
    monitoring_namespace="monitoring/preprocessing",
    ..., # (1)!
)
  1. We recommend saving your API token and project name as environment variables.

    If needed, you can pass them as arguments when initializing Neptune:

    neptune.init_run(
        project="workspace-name/project-name",
        api_token="YourNeptuneApiToken",
    )
    

Specify the dataset configuration.

dataset_config = {
    "target_names": str(dataset.target_names.tolist()),
    "n_classes": dataset.target_names.shape[0],
    "n_samples": dataset.images.shape[0],
    "height": dataset.images.shape[1],
    "width": dataset.images.shape[2],
}

Next, we set up a "preprocessing" namespace inside the run. This will be the base namespace where all the preprocessing metadata is logged.

preprocessing_handler = run["preprocessing"]

From here on, we'll use the namespace handler for logging. The other steps of the pipeline will use their own namespaces, so that the metadata from each step is separate but still logged inside the same run.

First we log the dataset details we specified earlier.

preprocessing_handler["dataset/config"] = dataset_config

Next, we preprocess the dataset:

dataset_transform = Preprocessing(
    dataset,
    dataset_config["n_samples"],
    dataset_config["target_names"],
    dataset_config["n_classes"],
    (dataset_config["height"], dataset_config["width"]),
)
path_to_scaler = dataset_transform.scale_data()
path_to_features = dataset_transform.create_and_save_features(
    data_filename="features"
)
dataset_transform.describe()

Finally, we log scaler and features files:

preprocessing_handler["dataset/scaler"].upload(path_to_scaler)
preprocessing_handler["dataset/features"].upload(path_to_features)

We're ready to run the script.

If Neptune can't find your project name or API token

As a best practice, you should save your Neptune API token and project name as environment variables:

export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jv...Yh3Kb8"
export NEPTUNE_PROJECT="ml-team/classification"

Alternatively, you can pass the information when using a function that takes api_token and project as arguments:

run = neptune.init_run( # (1)!
    api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jv...Yh3Kb8",  # your token here
    project="ml-team/classification",  # your full project name here
)
  1. Also works for init_model(), init_model_version(), init_project(), and integrations that create Neptune runs underneath the hood, such as NeptuneLogger or NeptuneCallback.

  2. API token: In the bottom-left corner, expand the user menu and select Get my API token.

  3. Project name: You can copy the path from the project details ( Edit project details).

If you haven't registered, you can log anonymously to a public project:

api_token=neptune.ANONYMOUS_API_TOKEN
project="common/quickstarts"

Make sure not to publish sensitive data through your code!

Other steps#

The next step in our example pipeline is the training stage.

Note

For all the imports, see the full code example on GitHub .

import neptune
import neptune.integrations.sklearn as npt_utils
from utils import get_data_features
...

We again initialize a run object with a custom monitoring namespace. Because we have exported a custom run ID, that identifier will be used to connect to the same run that was created in the previous step.

run = neptune.init_run(monitoring_namespace="monitoring/training")

We can access and fetch the features from the preprocessing stage:

run["preprocessing/dataset/features"].download()

Next, we set up the "training" namespace inside the run. This way, the training metadata is logged to the same run as in the previous step, but organized under a different namespace.

training_handler = run["training"]

We continue with the training as normal:

dataset = get_data_features("features.npz")
...

Logging metadata to the training namespace will look like this:

training_handler["metrics/scores"] = npt_utils.get_scores(clf, X_train_pca, y_train)

You can use this same pattern to assign any other kind of metadata to the namespace handler:

training_handler["some/structure"] = some_metadata

Result

You have the metadata from your entire pipeline logged to the same run, organized by step.

Structure of run namespaces
monitoring                       <-- System metrics, hardware consumption
|-- preprocessing                <-- Each step has its own monitoring namespace
    |-- cpu
    |-- memory
    |-- ...
|-- training
|-- validation
preprocessing                    <-- Base namespace for the preprocessing metadata
|-- <preprocessing metadata>
source_code                      <-- Auto-generated source code namespace
sys                              <-- Auto-generated system namespace, with basic info about the run
training                         <-- Base namespace for the training metadata
|-- <training metadata>
validation                       <-- Base namespace for the validation metadata
|-- <validation metadata>

Tip

You can create custom dashboards to quickly view important metadata for all the steps. The example run has three custom dashboards, one each for the preprocessing, training, and validation steps.

See full code example on GitHub