How to monitor ML runs live: step by step guide

Introduction

This guide will show you how to:

  • Monitor training and evaluation metrics and losses live

  • Monitor hardware resources during training

By the end of it, you will monitor your metrics, losses, and hardware live in Neptune!

Before you start

Make sure you meet the following prerequisites before starting:

Note

You can run this how-to on Google Colab with zero setup.

Just click on the Open in Colab button on the top of the page.

Step 1: Create a basic training script

As an example I’ll use a script that trains a Keras model on mnist dataset.

Note

You don’t have to use Keras to monitor your training runs live with Neptune.

I am using it as an easy to follow example.

There are links to integrations with other ML frameworks and useful articles about monitoring in the text.

  1. Create a file train.py and copy the script below.

train.py

import keras

PARAMS = {'epoch_nr': 100,
          'batch_size': 256,
          'lr': 0.005,
          'momentum': 0.4,
          'use_nesterov': True,
          'unit_nr': 256,
          'dropout': 0.05}

mnist = keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = keras.models.Sequential([
  keras.layers.Flatten(),
  keras.layers.Dense(PARAMS['unit_nr'], activation=keras.activations.relu),
  keras.layers.Dropout(PARAMS['dropout']),
  keras.layers.Dense(10, activation=keras.activations.softmax)
])

optimizer = keras.optimizers.SGD(lr=PARAMS['lr'],
                                 momentum=PARAMS['momentum'],
                                 nesterov=PARAMS['use_nesterov'],)

model.compile(optimizer=optimizer,
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train,
          epochs=PARAMS['epoch_nr'],
          batch_size=PARAMS['batch_size'])
  1. Run training to make sure that it works correctly.

python train.py

Step 2: Install psutil

To monitor hardware consumption in Neptune you need to have psutil installed.

pip

pip install psutil

conda

conda install -c anaconda psutil

Step 3: Connect Neptune to your script

At the top of your script add

import neptune

neptune.init(project_qualified_name='shared/onboarding',
             api_token='ANONYMOUS',
             )

You need to tell Neptune who you are and where you want to log things.

To do that you specify:

  • project_qualified_name=USERNAME/PROJECT_NAME: Neptune username and project

  • api_token=YOUR_API_TOKEN: your Neptune API token.

Note

If you configured your Neptune API token correctly, as described in Configure Neptune API token on your system, you can skip api_token argument:

neptune.init(project_qualified_name='YOUR_USERNAME/YOUR_PROJECT_NAME')

Step 4. Create an experiment

neptune.create_experiment(name='great-idea')

This opens a new “experiment” namespace in Neptune to which you can log various objects.

Step 5. Add logging for metrics and losses

To log a metric or loss to Neptune you should use neptune.log_metric method:

neptune.log_metric('loss', 0.26)

The first argument is the name of the log. You can have one or multiple log names (like ‘acc’, ‘f1_score’, ‘log-loss’, ‘test-acc’). The second argument is the value of the log.

Typically during training there will be some sort of a loop where those losses are logged. You can simply call neptune.log_metric multiple times on the same log name to log it at each step.

for i in range(epochs):
    ...
    neptune.log_metric('loss', loss)
    neptune.log_metric('metric', accuracy)

Many frameworks, like Keras, let you create a callback that is executed inside of the training loop.

Now that you know all this.

Steps for Keras

  1. Create a Neptune callback.

class NeptuneMonitor(keras.callbacks.Callback):
     def on_epoch_end(self, epoch, logs=None):
          for metric_name, metric_value in logs.items():
               neptune.log_metric(metric_name, metric_value)
  1. Pass callback to the model.fit() method:

model.fit(x_train, y_train,
           epochs=PARAMS['epoch_nr'],
           batch_size=PARAMS['batch_size'],
           callbacks=[NeptuneMonitor()])

Note

You don’t actually have to implement this callback yourself and can use the Callback that we created for Keras. It is one of many integrations with ML frameworks that Neptune has.

Tip

You may want to read our article on monitoring ML/DL experiments:

Step 6. Run your script and see results in Neptune

Run training script.

python train.py

If it worked correctly you should see:

  • a link to Neptune experiment. Click on it and go to the app

  • metrics and losses in the Logs and Charts sections of the UI

  • hardware consumption and console logs in the Monitoring section of the UI

Full script

import keras
import neptune

# set project
neptune.init(api_token='ANONYMOUS',
             project_qualified_name='shared/onboarding')

# parameters
PARAMS = {'epoch_nr': 100,
          'batch_size': 256,
          'lr': 0.005,
          'momentum': 0.4,
          'use_nesterov': True,
          'unit_nr': 256,
          'dropout': 0.05}

# start experiment
neptune.create_experiment(name='great-idea')

class NeptuneMonitor(keras.callbacks.Callback):
     def on_epoch_end(self, logs={}):
          for metric_name, metric_value in logs.items():
               neptune.log_metric(metric_name, metric_value)

mnist = keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = keras.models.Sequential([
  keras.layers.Flatten(),
  keras.layers.Dense(PARAMS['unit_nr'], activation=keras.activations.relu),
  keras.layers.Dropout(PARAMS['dropout']),
  keras.layers.Dense(10, activation=keras.activations.softmax)
])

optimizer = keras.optimizers.SGD(lr=PARAMS['lr'],
                                 momentum=PARAMS['momentum'],
                                 nesterov=PARAMS['use_nesterov'],)

model.compile(optimizer=optimizer,
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train,
          epochs=PARAMS['epoch_nr'],
          batch_size=PARAMS['batch_size'],
          callbacks=[NeptuneMonitor()])