This guide will show you how to:
Monitor training and evaluation metrics and losses live
Monitor hardware resources during training
By the end of it, you will monitor your metrics, losses, and hardware live in Neptune!
Make sure you meet the following prerequisites before starting:
Have Python 3.x installed
Have Tensorflow 2.x with Keras installed
As an example I’ll use a script that trains a Keras model on MNIST dataset.
You don’t have to use Keras to monitor your training runs live with Neptune.
I am using it as an easy to follow example.
There are links to integrations with other ML frameworks and useful articles about monitoring in the text.
Create a file train.py
and copy the script below.
train.pyfrom tensorflow import kerasPARAMS = {'epoch_nr': 100,'batch_size': 256,'lr': 0.005,'momentum': 0.4,'use_nesterov': True,'unit_nr': 256,'dropout': 0.05}mnist = keras.datasets.mnist(x_train, y_train),(x_test, y_test) = mnist.load_data()x_train, x_test = x_train / 255.0, x_test / 255.0model = keras.models.Sequential([keras.layers.Flatten(),keras.layers.Dense(PARAMS['unit_nr'], activation=keras.activations.relu),keras.layers.Dropout(PARAMS['dropout']),keras.layers.Dense(10, activation=keras.activations.softmax)])optimizer = keras.optimizers.SGD(lr=PARAMS['lr'],momentum=PARAMS['momentum'],nesterov=PARAMS['use_nesterov'],)model.compile(optimizer=optimizer,loss='sparse_categorical_crossentropy',metrics=['accuracy'])model.fit(x_train, y_train,epochs=PARAMS['epoch_nr'],batch_size=PARAMS['batch_size'])
Run training to make sure that it works correctly.
python train.py
To monitor hardware consumption in Neptune you need to have psutil
installed.
pip install psutil
conda install -c anaconda psutil
At the top of your script add
import neptune.new as neptunerun = neptune.init(project='common/quickstarts',api_token='ANONYMOUS')
This opens a new “run” in Neptune to which you can log various objects.
You need to tell Neptune who you are and where you want to log things. To do that you specify:
project=my_workspace/my_project
: your workspace name and project name,
api_token=YOUR_API_TOKEN
: your Neptune API token.
If you configured your Neptune API token correctly, as described in this docs page, you can skip 'api_token' argument.
To log a metric or loss during training you should use:
loss = ...run["train/loss"].log(loss)
Few explanations here:
"train/loss"
is a name of the log with hierarchical structure.
"train/loss"
is a series of values - you can log multiple values to this log.
You can have one or multiple log names like 'train/acc'
, 'val/f1_score’
, ‘train/log-loss’
, ‘test/acc’
).
argument of the log()
method is the actual value you want to log.
Typically during training there will be some sort of a loop where those losses are logged. You can simply call run["train/loss"].log(loss)
multiple times at each step.
for i in range(epochs):...run["train/loss"].log(loss)run["train/acc"].log(accuracy)
Many frameworks, like Keras, let you create a callback that is executed inside of the training loop.
Now that you know all this.
Create a Neptune callback.
class NeptuneMonitor(keras.callbacks.Callback):def on_epoch_end(self, epoch, logs=None):for metric_name, metric_value in logs.items():run["train/{}".format(metric_name)].log(metric_value)
Pass callback to the model.fit()
method:
model.fit(x_train, y_train,epochs=PARAMS['epoch_nr'],batch_size=PARAMS['batch_size'],callbacks=[NeptuneMonitor()])
You don’t actually have to implement this callback yourself and can use the Callback that we created for Keras. It is one of many integrations with ML frameworks that Neptune has.
Check our TensorFlow / Keras integration.
You may want to read our article on monitoring ML/DL runs:
Run training script.
python train.py
If it worked correctly you should see:
a link to Neptune run. Click on it and go to the app,
metrics and losses in the Logs
and Charts
sections of the UI,
hardware consumption and console logs in the Monitoring
section of the UI.
Now that you know how to create runs and log metrics you can learn: