Monitor model training live

Introduction

You can log any ML model metadata to Neptune and see it live in the Neptune UI.

In this guide, you will learn how to do basic model monitoring of your training process. You will:

  • See the training progress by looking at learning curves for loss and accuracy

  • Monitor hardware consumption during training across GPU/CPU/MEMORY

By the end of it, you will see your metrics, losses, and hardware live in Neptune!

Before you start

Make sure you meet the following prerequisites before starting:

Step 1: Create a basic training script

As an example, I’ll use a script that trains a Keras model on the MNIST dataset.

You don’t have to use Keras to monitor your training runs live with Neptune. You can use it with any Machine learning framework, Optimization framework, and any other code.

Create a file train.py and copy the script below.

train.py
from tensorflow import keras
PARAMS = {'epoch_nr': 100,
'batch_size': 256,
'lr': 0.005,
'momentum': 0.4,
'use_nesterov': True,
'unit_nr': 256,
'dropout': 0.05}
mnist = keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = keras.models.Sequential([
keras.layers.Flatten(),
keras.layers.Dense(PARAMS['unit_nr'], activation=keras.activations.relu),
keras.layers.Dropout(PARAMS['dropout']),
keras.layers.Dense(10, activation=keras.activations.softmax)
])
optimizer = keras.optimizers.SGD(lr=PARAMS['lr'],
momentum=PARAMS['momentum'],
nesterov=PARAMS['use_nesterov'],)
model.compile(optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train,
epochs=PARAMS['epoch_nr'],
batch_size=PARAMS['batch_size'])

Run training to make sure that it works correctly.

python train.py

Step 2: Install psutil

To monitor hardware consumption in Neptune you need to have psutil installed.

pip
conda
pip
pip install psutil
conda
conda install -c anaconda psutil

This has been tested with psutil==5.8.0.

Step 3: Connect Neptune to your script

At the top of your script add a snippet that connects it to Neptune:

import neptune.new as neptune
run = neptune.init(api_token='<YOUR_API_TOKEN>',
project='<YOUR_PROJECT_NAME>') # your credentials

You can use the api_token='ANONYMOUS' and project='common/quickstarts' to explore without having to create a Neptune account.

Executing this snippet will give you a link like this one: https://app.neptune.ai/o/common/org/quickstarts/e/QUI-28177 with common/quickstarts replaced by your_workspace/your_project_name, and QUI-28177replaced by your Run ID.

Step 4: Add logging for metrics and losses

To log a metric or loss during training you should use:

loss = ...
run["train/loss"].log(loss)

A few explanations here:

  • "train/loss" is a path to a log in Neptune called Namespace where you can log metrics, losses and many other model building metadata

  • You can have multiple log namespaces like 'train/acc', 'val/f1_score’, ‘train/log-loss’, ‘test/acc’).

  • The argument of the log() method is the actual value you want to log.

Typically during training, you log those metrics in a loop after every iteration. Call run["train/loss"].log(loss)at each step.

for i in range(epochs):
...
run["train/loss"].log(loss)
run["train/acc"].log(accuracy)

Many frameworks, like Keras, let you create a callback that is executed inside of the training loop.

Now that you know all this.

Steps for Keras

Create a Neptune callback.

class NeptuneMonitor(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
for metric_name, metric_value in logs.items():
run["train/{}".format(metric_name)].log(metric_value)

Pass the callback to the model.fit() method:

model.fit(x_train, y_train,
epochs=PARAMS['epoch_nr'],
batch_size=PARAMS['batch_size'],
callbacks=[NeptuneMonitor()])

You don’t actually have to implement this callback yourself. Import our callback implementation from TensorFlow / Keras integration.

Step 5: Stop logging

Once you are done logging, you should stop tracking the Run using the stop() method. This is a must only while logging from a notebook environment. While logging through a script, Neptune automatically stops tracking once the script has completed.

run.stop()

Step 6: Run your script and see results in Neptune

Run the training script.

python train.py

If it worked correctly you should see:

  • a link to Neptune run. Click on it and go to the app,

  • metrics and losses in the Logs and Charts sections of the UI,

  • hardware consumption and console logs in the Monitoring section of the UI.

What's next?