How to monitor ML runs live: step by step guide

Introduction

This guide will show you how to:

  • Monitor training and evaluation metrics and losses live

  • Monitor hardware resources during training

By the end of it, you will monitor your metrics, losses, and hardware live in Neptune!

Before you start

Make sure you meet the following prerequisites before starting:

Step 1: Create a basic training script

As an example I’ll use a script that trains a Keras model on MNIST dataset.

You don’t have to use Keras to monitor your training runs live with Neptune.

I am using it as an easy to follow example.

There are links to integrations with other ML frameworks and useful articles about monitoring in the text.

Create a file train.py and copy the script below.

train.py
from tensorflow import keras
PARAMS = {'epoch_nr': 100,
'batch_size': 256,
'lr': 0.005,
'momentum': 0.4,
'use_nesterov': True,
'unit_nr': 256,
'dropout': 0.05}
mnist = keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = keras.models.Sequential([
keras.layers.Flatten(),
keras.layers.Dense(PARAMS['unit_nr'], activation=keras.activations.relu),
keras.layers.Dropout(PARAMS['dropout']),
keras.layers.Dense(10, activation=keras.activations.softmax)
])
optimizer = keras.optimizers.SGD(lr=PARAMS['lr'],
momentum=PARAMS['momentum'],
nesterov=PARAMS['use_nesterov'],)
model.compile(optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train,
epochs=PARAMS['epoch_nr'],
batch_size=PARAMS['batch_size'])

Run training to make sure that it works correctly.

python train.py

Step 2: Install psutil

To monitor hardware consumption in Neptune you need to have psutil installed.

pip
conda
pip
pip install psutil
conda
conda install -c anaconda psutil

Step 3: Connect Neptune to your script

At the top of your script add

import neptune.new as neptune
run = neptune.init(project='common/quickstarts',
api_token='ANONYMOUS')

This opens a new “run” in Neptune to which you can log various objects.

You need to tell Neptune who you are and where you want to log things. To do that you specify:

  • project=my_workspace/my_project: your workspace name and project name,

  • api_token=YOUR_API_TOKEN : your Neptune API token.

If you configured your Neptune API token correctly, as described in this docs page, you can skip 'api_token' argument.

Step 5. Add logging for metrics and losses

To log a metric or loss during training you should use:

loss = ...
run["train/loss"].log(loss)

Few explanations here:

  • "train/loss" is a name of the log with hierarchical structure.

  • "train/loss" is a series of values - you can log multiple values to this log.

  • You can have one or multiple log names like 'train/acc', 'val/f1_score’, ‘train/log-loss’, ‘test/acc’).

  • argument of the log() method is the actual value you want to log.

Typically during training there will be some sort of a loop where those losses are logged. You can simply call run["train/loss"].log(loss) multiple times at each step.

for i in range(epochs):
...
run["train/loss"].log(loss)
run["train/acc"].log(accuracy)

Many frameworks, like Keras, let you create a callback that is executed inside of the training loop.

Now that you know all this.

Steps for Keras

Create a Neptune callback.

class NeptuneMonitor(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
for metric_name, metric_value in logs.items():
run["train/{}".format(metric_name)].log(metric_value)

Pass callback to the model.fit() method:

model.fit(x_train, y_train,
epochs=PARAMS['epoch_nr'],
batch_size=PARAMS['batch_size'],
callbacks=[NeptuneMonitor()])

You don’t actually have to implement this callback yourself and can use the Callback that we created for Keras. It is one of many integrations with ML frameworks that Neptune has.

You may want to read our article on monitoring ML/DL runs:

Step 6. Run your script and see results in Neptune

Run training script.

python train.py

If it worked correctly you should see:

  • a link to Neptune run. Click on it and go to the app,

  • metrics and losses in the Logs and Charts sections of the UI,

  • hardware consumption and console logs in the Monitoring section of the UI.

What's next?

Now that you know how to create runs and log metrics you can learn:

Other useful articles