How to monitor ML runs live: step by step guide¶
Introduction¶
This guide will show you how to:
Monitor training and evaluation metrics and losses live
Monitor hardware resources during training
By the end of it, you will monitor your metrics, losses, and hardware live in Neptune!
Before you start¶
Make sure you meet the following prerequisites before starting:
Have Python 3.x installed
Have Tensorflow 2.x with Keras installed
Note
You can run this how-to on Google Colab with zero setup.
Just click on the Open in Colab
button on the top of the page.
Step 1: Create a basic training script¶
As an example I’ll use a script that trains a Keras model on mnist dataset.
Note
You don’t have to use Keras to monitor your training runs live with Neptune.
I am using it as an easy to follow example.
There are links to integrations with other ML frameworks and useful articles about monitoring in the text.
Create a file
train.py
and copy the script below.
train.py
import keras
PARAMS = {'epoch_nr': 100,
'batch_size': 256,
'lr': 0.005,
'momentum': 0.4,
'use_nesterov': True,
'unit_nr': 256,
'dropout': 0.05}
mnist = keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = keras.models.Sequential([
keras.layers.Flatten(),
keras.layers.Dense(PARAMS['unit_nr'], activation=keras.activations.relu),
keras.layers.Dropout(PARAMS['dropout']),
keras.layers.Dense(10, activation=keras.activations.softmax)
])
optimizer = keras.optimizers.SGD(lr=PARAMS['lr'],
momentum=PARAMS['momentum'],
nesterov=PARAMS['use_nesterov'],)
model.compile(optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train,
epochs=PARAMS['epoch_nr'],
batch_size=PARAMS['batch_size'])
Run training to make sure that it works correctly.
python train.py
Step 2: Install psutil¶
To monitor hardware consumption in Neptune you need to have psutil
installed.
pip
pip install psutil
conda
conda install -c anaconda psutil
Step 3: Connect Neptune to your script¶
At the top of your script add
import neptune
neptune.init(project_qualified_name='shared/onboarding',
api_token='ANONYMOUS',
)
You need to tell Neptune who you are and where you want to log things.
To do that you specify:
project_qualified_name=USERNAME/PROJECT_NAME
: Neptune username and projectapi_token=YOUR_API_TOKEN
: your Neptune API token.
Note
If you configured your Neptune API token correctly, as described in Configure Neptune API token on your system, you can skip api_token
argument:
neptune.init(project_qualified_name='YOUR_USERNAME/YOUR_PROJECT_NAME')
Step 4. Create an experiment¶
neptune.create_experiment(name='great-idea')
This opens a new “experiment” namespace in Neptune to which you can log various objects.
Step 5. Add logging for metrics and losses¶
To log a metric or loss to Neptune you should use neptune.log_metric
method:
neptune.log_metric('loss', 0.26)
The first argument is the name of the log. You can have one or multiple log names (like ‘acc’, ‘f1_score’, ‘log-loss’, ‘test-acc’). The second argument is the value of the log.
Typically during training there will be some sort of a loop where those losses are logged.
You can simply call neptune.log_metric
multiple times on the same log name to log it at each step.
for i in range(epochs):
...
neptune.log_metric('loss', loss)
neptune.log_metric('metric', accuracy)
Many frameworks, like Keras, let you create a callback that is executed inside of the training loop.
Now that you know all this.
Steps for Keras
Create a Neptune callback.
class NeptuneMonitor(keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
for metric_name, metric_value in logs.items():
neptune.log_metric(metric_name, metric_value)
Pass callback to the
model.fit()
method:
model.fit(x_train, y_train,
epochs=PARAMS['epoch_nr'],
batch_size=PARAMS['batch_size'],
callbacks=[NeptuneMonitor()])
Note
You don’t actually have to implement this callback yourself and can use the Callback that we created for Keras. It is one of many integrations with ML frameworks that Neptune has.
Check our TensorFlow / Keras integration
Tip
You may want to read our article on monitoring ML/DL experiments:
Step 6. Run your script and see results in Neptune¶
Run training script.
python train.py
If it worked correctly you should see:
a link to Neptune experiment. Click on it and go to the app
metrics and losses in the
Logs
andCharts
sections of the UIhardware consumption and console logs in the
Monitoring
section of the UI
What’s next¶
Now that you know how to create experiments and log metrics you can learn:
Check our integrations with other frameworks
Other useful articles: