Logging with Neptune in a sequential pipeline#
This tutorial shows how to log all the metadata from a sequence of steps in an ML pipeline to the same run.
In this example, our pipeline consists of a handful of scripts. We want to track metadata from three of them. To do that:
- We'll set a custom run ID so that we can access the same run from each step.
- We'll use namespace handlers to create a namespace (folder) for each step, so that the metadata is organized by step inside the run.
We'll set up the steps and namespaces as follows:
data_preprocessing.py
→run["preprocessing/..."]
model_training.py
→run["training/..."]
model_validation.py
→run["validation/..."]
- Additionally, we provide an example script that does model promotion:
model_promotion.py
.
The metadata structure of the resulting run will be:
monitoring
|-- preprocessing
|-- cpu
|-- memory
|-- ...
|-- training
|-- validation
preprocessing
|-- <preprocessing metadata>
source_code
sys
training
|-- <training metadata>
validation
|-- <validation metadata>
See example run in Neptune  See full code example on GitHub 
Before you start#
- Sign up at neptune.ai/register.
- Create a project for storing your metadata.
- To follow this tutorial, have scikit-learn installed.
Installing Neptune and the scikit-learn integration#
If you want to reproduce the example pipeline, install the Neptune–scikit-learn integration (and, if needed, Neptune itself).
Preparation: Exporting a custom run ID#
By exporting a custom run ID as an environment variable, you ensure that every time a run is created by the pipeline, it uses the same ID. This way, the same run is initialized in each step, instead of a new run being created each time.
For example: You can have the following in your .sh
script, before the model training scripts are executed:
export export NEPTUNE_CUSTOM_RUN_ID=`date +"%Y%m%d%H%M%s%N" | md5sum`
export NEPTUNE_PROJECT="workspace-name/project-name" # (1)!
...
-
The full project name. For example,
"ml-team/classification"
.- You can copy the name from the project details ( → Details & privacy)
- You can also find a pre-filled
project
string in Experiments → Create a new run.
For the full run_examples.sh
, requirements.txt
and utils.py
, see the example scripts in the neptune-ai/examples repository .
Example step#
We'll break down one step (script) of the pipeline. The other steps follow the same idea.
In the first step, we download and process some data.
import neptune
from sklearn.datasets import fetch_lfw_people
from utils import *
dataset = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
Next, we create a new Neptune run. In order to organize monitoring metrics per step, we also set a custom name for the monitoring namespace.
-
We recommend saving your API token and project name as environment variables.
If needed, you can pass them as arguments when initializing Neptune:
Specify the dataset configuration.
dataset_config = {
"target_names": str(dataset.target_names.tolist()),
"n_classes": dataset.target_names.shape[0],
"n_samples": dataset.images.shape[0],
"height": dataset.images.shape[1],
"width": dataset.images.shape[2],
}
Next, we set up a "preprocessing"
namespace inside the run. This will be the base namespace where all the preprocessing metadata is logged.
From here on, we'll use the namespace handler for logging. The other steps of the pipeline will use their own namespaces, so that the metadata from each step is separate but still logged inside the same run.
First we log the dataset details we specified earlier.
Next, we preprocess the dataset:
dataset_transform = Preprocessing(
dataset,
dataset_config["n_samples"],
dataset_config["target_names"],
dataset_config["n_classes"],
(dataset_config["height"], dataset_config["width"]),
)
path_to_scaler = dataset_transform.scale_data()
path_to_features = dataset_transform.create_and_save_features(
data_filename="features"
)
dataset_transform.describe()
Finally, we log scaler and features files:
preprocessing_handler["dataset/scaler"].upload(path_to_scaler)
preprocessing_handler["dataset/features"].upload(path_to_features)
We're ready to run the script.
If Neptune can't find your project name or API token
As a best practice, you should save your Neptune API token and project name as environment variables:
Alternatively, you can pass the information when using a function that takes api_token
and project
as arguments:
run = neptune.init_run(
api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jv...Yh3Kb8", # (1)!
project="ml-team/classification", # (2)!
)
- In the bottom-left corner, expand the user menu and select Get my API token.
- You can copy the path from the project details ( → Details & privacy).
If you haven't registered, you can log anonymously to a public project:
Make sure not to publish sensitive data through your code!
Other steps#
The next step in our example pipeline is the training stage.
Note
For all the imports, see the full code example on GitHub .
import neptune
import neptune.integrations.sklearn as npt_utils
from utils import get_data_features
...
We again initialize a run object with a custom monitoring namespace. Because we have exported a custom run ID, that identifier will be used to connect to the same run that was created in the previous step.
We can access and fetch the features from the preprocessing stage:
Next, we set up the "training"
namespace inside the run. This way, the training metadata is logged to the same run as in the previous step, but organized under a different namespace.
We continue with the training as normal:
Logging metadata to the training namespace will look like this:
You can use this same pattern to assign any other kind of metadata to the namespace handler:
Result
You have the metadata from your entire pipeline logged to the same run, organized by step.
monitoring <-- System metrics, hardware consumption
|-- preprocessing <-- Each step has its own monitoring namespace
|-- cpu
|-- memory
|-- ...
|-- training
|-- validation
preprocessing <-- Base namespace for the preprocessing metadata
|-- <preprocessing metadata>
source_code <-- Auto-generated source code namespace
sys <-- Auto-generated system namespace, with basic info about the run
training <-- Base namespace for the training metadata
|-- <training metadata>
validation <-- Base namespace for the validation metadata
|-- <validation metadata>