Logging with Neptune in a sequential pipeline#
This tutorial shows how to log all the metadata from a sequence of steps in an ML pipeline to the same run.
In this example, our pipeline consists of a handful of scripts. We want to track metadata from three of them. To do that:
- We'll set a custom run ID so that we can access the same run from each step.
- We'll use namespace handlers to create a namespace (folder) for each step, so that the metadata is organized by step inside the run.
We'll set up the steps and namespaces as follows:
- Additionally, we provide an example script that does model promotion:
The metadata structure of the resulting run will be:
Before you start#
- Sign up at neptune.ai/register.
- Create a project for storing your metadata.
- To follow this tutorial, have scikit-learn installed.
Installing Neptune and the scikit-learn integration#
If you want to reproduce the example pipeline, install the Neptune–scikit-learn integration (and, if needed, Neptune itself).
Preparation: Exporting a custom run ID#
By exporting a custom run ID as an environment variable, you ensure that every time a run is created by the pipeline, it uses the same ID. This way, the same run is initialized in each step, instead of a new run being created each time.
For example: You can have the following in your
.sh script, before the model training scripts are executed:
The full project name. For example,
To copy it, navigate to the project settings in the top-right () and select Edit project details.
For the full
utils.py, see the example scripts in the neptune-ai/examples repository .
We'll break down one step (script) of the pipeline. The other steps follow the same idea.
In the first step, we download and process some data.
Next, we create a new Neptune run. In order to organize monitoring metrics per step, we also set a custom name for the monitoring namespace.
If needed, you can pass them as arguments when initializing Neptune:
Specify the dataset configuration.
Next, we set up a
"preprocessing" namespace inside the run. This will be the base namespace where all the preprocessing metadata is logged.
From here on, we'll use the namespace handler for logging. The other steps of the pipeline will use their own namespaces, so that the metadata from each step is separate but still logged inside the same run.
First we log the dataset details we specified earlier.
Next, we preprocess the dataset:
dataset_transform = Preprocessing( dataset, dataset_config["n_samples"], dataset_config["target_names"], dataset_config["n_classes"], (dataset_config["height"], dataset_config["width"]), ) path_to_scaler = dataset_transform.scale_data() path_to_features = dataset_transform.create_and_save_features( data_filename="features" ) dataset_transform.describe()
Finally, we log scaler and features files:
We're ready to run the script.
If Neptune can't find your project name or API token
As a best practice, you should save your Neptune API token and project name as environment variables:
You can, however, also pass them as arguments when initializing Neptune:
Also works for
API token: In the bottom-left corner, expand the user menu and select Get my API token.
- Project name: in the top-right menu: → Edit project details.
If you haven't registered, you can also log anonymously to a public project (make sure not to publish sensitive data through your code!):
The next step in our example pipeline is the training stage.
For all the imports, see the full code example on GitHub .
We again initialize a run object with a custom monitoring namespace. Because we have exported a custom run ID, that identifier will be used to connect to the same run that was created in the previous step.
We can access and fetch the features from the preprocessing stage:
Next, we set up the
"training" namespace inside the run. This way, the training metadata is logged to the same run as in the previous step, but organized under a different namespace.
We continue with the training as normal:
Logging metadata to the training namespace will look like this:
You can use this same pattern to assign any other kind of metadata to the namespace handler:
You have the metadata from your entire pipeline logged to the same run, organized by step.
monitoring <-- System metrics, hardware consumption |-- preprocessing <-- Each step has its own monitoring namespace |-- cpu |-- memory |-- ... |-- training |-- validation preprocessing <-- Base namespace for the preprocessing metadata |-- <preprocessing metadata> source_code <-- Auto-generated source code namespace sys <-- Auto-generated system namespace, with basic info about the run training <-- Base namespace for the training metadata |-- <training metadata> validation <-- Base namespace for the validation metadata |-- <validation metadata>