Neptune-PyTorch Lightning Integration

What will you get?

PyTorch Lightning is a lightweight PyTorch wrapper for high-performance AI research. With Neptune integration you can:

  • see experiment as it is running,

  • log training, validation and testing metrics, and visualize them in Neptune UI,

  • log experiment parameters,

  • monitor hardware usage,

  • log any additional metrics of your choice,

  • log performance charts and images,

  • save model checkpoints.

Where to start?

To get started with this integration, follow the Quickstart below. You can also skip the basics and take a look at the advanced options.

If you want to try things out and focus only on the code you can either:

  1. Open Colab notebook (badge-link below) with quickstart code and run it as a “neptuner” user - zero setup, it just works,

  2. View quickstart code as a plain Python script on GitHub.

You can also check this public project with example experiments: PyTorch Lightning integration.

Note

This integration is tested with pytorch-lightning==1.0.0 and current latest, and neptune-client==0.4.123 and current latest.

Quickstart

This quickstart will show you how to log PyTorch Lightning experiments to Neptune using NeptuneLogger (part of the pytorch-lightning library).

As a result you will have an experiment logged to Neptune. It will have train loss and epoch (visualized as charts), parameters, hardware utilization charts and experiment metadata.

Before you start

You have Python 3.x and following libraries installed:

You also need minimal familiarity with the PyTorch Lightning. Have a look at the “Lightning in 2 steps” guide to get started.

Step 1: Import Libraries

Import necessary libraries.

import os

import torch
from torch.nn import functional as F
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision import transforms

import pytorch_lightning as pl

Notice pytorch_lightning at the bottom.

Step 2: Define Hyper-Parameters

Define Python dictionary with hyper-parameters for model training.

PARAMS = {'max_epochs': 3,
          'learning_rate': 0.005,
          'batch_size': 32}

This dictionary will later be passed to the Neptune logger (you will see how to do it in step 4), so that you will see hyper-parameters in experiment Parameters tab.

Step 3: Define LightningModule and DataLoader

Implement minimal example of the pl.LightningModule and simple DataLoader.

# pl.LightningModule
class LitModel(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.l1 = torch.nn.Linear(28 * 28, 10)

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self(x)
        loss = F.cross_entropy(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=PARAMS['learning_rate'])

# DataLoader
train_loader = DataLoader(MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()),
                          batch_size=PARAMS['batch_size'])

Few explanations here:

  • Cross entropy logging is defined in the training_step method in this way:

self.log('train_loss', loss)

This loss will be logged to Neptune during training as a train_loss. You will see it in the Experiment’s Charts tab (as “train_loss” chart) and Logs tab (as raw numeric values).

  • DataLoader is a pure PyTorch object.

  • Notice, that you pass learning_rate and batch_size from the PARAMS dictionary - all params will be logged as experiment parameters.

Step 4: Create NeptuneLogger

Instantiate NeptuneLogger with necessary parameters.

from pytorch_lightning.loggers.neptune import NeptuneLogger

neptune_logger = NeptuneLogger(
    api_key="ANONYMOUS",
    project_name="shared/pytorch-lightning-integration",
    params=PARAMS)

NeptuneLogger is an object that integrates Neptune with PyTorch Lightning allowing you to track experiments. It’s a part of the lightning library. In this minimalist example we use public user “neptuner”, who has public token: “ANONYMOUS”.

Tip

You can also use your API token. Read more about how to securely set Neptune API token.

Step 5: Pass NeptuneLogger to the Trainer

Pass instantiated NeptuneLogger to the pl.Trainer.

trainer = pl.Trainer(max_epochs=PARAMS['max_epochs'],
                     logger=neptune_logger)

Simply pass neptune_logger to the Trainer, so that lightning will use this logger. Notice, that max_epochs is from the PARAMS dictionary.

Step 6: Run experiment

Fit model to the data.

model = LitModel()

trainer.fit(model, train_loader)

At this point you are all set to fit the model. Neptune logger will collect metrics and show them in the UI.

Explore Results

You just learned how to start logging PyTorch Lightning experiments to Neptune, by using Neptune logger which is part of the lightning library.

Above training is logged to Neptune in near real-time. Click on the link that was outputted to the console or go here to explore an experiment similar to yours. In particular check:

  1. metrics,

  2. logged parameters,

  3. hardware usage statistics,

  4. metadata information including git summary info.

Check this experiment here or view quickstart code as a plain Python script on GitHub.

PyTorchLightning neptune.ai integration

Advanced options

To learn more about advanced options that Neptune logger has to offer, follow sections below as each describes one functionality.

If you want to try things out and focus only on the code you can either:

  1. Open Colab notebook (badge-link below) and run advanced example as a “neptuner” user - zero setup, it just works,

  2. View advanced example code as a plain Python script on GitHub.

You can also check this public project with example experiments: PyTorch Lightning integration.

Before you start

In addition to the contents of the “Before you start” section in Quickstart, you also need to have scikit-learn and scikit-plot installed.

pip install scikit-learn==0.23.2 scikit-plot==0.3.7

Check scikit-learn installation guide or scikit-plot github project for more info.

Advanced NeptuneLogger options

Create NeptuneLogger with advanced parameters.

from pytorch_lightning.loggers.neptune import NeptuneLogger

ALL_PARAMS = {...}

neptune_logger = NeptuneLogger(
    api_key="ANONYMOUS",
    project_name="shared/pytorch-lightning-integration",
    close_after_fit=False,
    experiment_name="train-on-MNIST",
    params=ALL_PARAMS,
    tags=['1.x', 'advanced'],
)

In the NeptuneLogger - besides required api_key and project_name, you can specify other options, notably:

  • params - are passed as Python dict, see example experiment parameters.

  • experiment_name and tags are set. You will use them later in the UI for experiment searching and filtering.

  • close_after_fit=False -> that will let us log more data after Trainer.fit() and Trainer.test() methods.

Tip

Use neptune_logger.experiment.ABC to call methods that you would use, when working with neptune client, for example:

  • neptune_logger.experiment.log_metric

  • neptune_logger.experiment.log_image

  • neptune_logger.experiment.set_property

Check more methods here: experiment methods.

Log loss during train, validation and test

In the pl.LightningModule loss logging for train, validation and test.

class LitModel(pl.LightningModule):
    (...)

    def training_step(self, batch, batch_idx):
        (...)
        loss = ...
        self.log('train_loss', loss, prog_bar=False)

    def validation_step(self, batch, batch_idx):
        (...)
        loss = ...
        self.log('val_loss', loss, prog_bar=False)

    def test_step(self, batch, batch_idx):
        (...)
        loss = ...
        self.log('test_loss', loss, prog_bar=False)

Loss values will be tracked in Neptune automatically.

Tip

Trainer parameter: log_every_n_steps controls how frequent the logging is. Keep this parameter relatively high, say >100 for longer experiments.

PyTorch Lightning train and validation loss

Log accuracy score after train, validation and test epoch

In the pl.LightningModule implement accuracy score and log it.

class LitModel(pl.LightningModule):
    (...)

    def training_epoch_end(self, outputs):
        for output in outputs:
            (...)
        acc = accuracy_score(y_true, y_pred)
        self.log('train_acc', acc)

    def validation_epoch_end(self, outputs):
        for output in outputs:
            (...)
        acc = accuracy_score(y_true, y_pred)
        self.log('val_acc', acc)

    def test_epoch_end(self, outputs):
        for output in outputs:
            (...)
        acc = accuracy_score(y_true, y_pred)
        self.log('test_acc', acc)

Accuracy score will be calculated and logged after every train, validation and test epoch.

PyTorch Lightning train and validation acc

Tip

You can find full implementation of all metrics logging in this GitHub or in .

Log learning rate changes

Implement learning rate monitor as Callback

from pytorch_lightning.callbacks import LearningRateMonitor

# Add scheduler to the optimizer
class LitModel(pl.LightningModule):
    (...)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        scheduler = LambdaLR(optimizer, lambda epoch: self.decay_factor ** epoch)
        return [optimizer], [scheduler]

# Instantiate LearningRateMonitor Callback
lr_logger = LearningRateMonitor(logging_interval='epoch')

# Pass lr_logger to the pl.Trainer as callback
trainer = pl.Trainer(logger=neptune_logger,
                     callbacks=[lr_logger])

Learning rate scheduler is defined in the configure_optimizers. It will change lr values after each epoch. These values will be tracked to Neptune automatically.

PyTorch Lightning lr-Adam chart

Log misclassified images for the test set

In the pl.LightningModule implement logic for identifying and logging misclassified images.

class LitModel(pl.LightningModule):
    (...)

    def test_step(self, batch, batch_idx):
        x, y = batch
        (...)
        y_true = ...
        y_pred = ...
        for j in np.where(np.not_equal(y_true, y_pred))[0]:
            img = np.squeeze(x[j].cpu().detach().numpy())
            img[img < 0] = 0
            img = (img / img.max()) * 256
            neptune_logger.experiment.log_image(
                'test_misclassified_images',
                img,
                description='y_pred={}, y_true={}'.format(y_pred[j], y_true[j]))
  • As a result you will automatically log misclassified images to Neptune during test.

  • Take a look at these misclassified images - look for the 'test_misclassified_images' tile.

PyTorch Lightning misclassified images

Log gradient norm

Set pl.Trainer to log gradient norm.

trainer = pl.Trainer(logger=neptune_logger,
                     track_grad_norm=2)

Neptune will visualize gradient norm automatically.

Tip

When you use track_grad_norm it’s recommended to also set log_every_n_steps to something >100, so that you will avoid logging large amount of data.

PyTorch Lightning misclassified images

Log model checkpoints

Use ModelCheckpoint to make checkpoint during training, then log saved checkpoints to Neptune.

from pytorch_lightning.callbacks import ModelCheckpoint

# Instantiate ModelCheckpoint
model_checkpoint = ModelCheckpoint(filepath='my_model/checkpoints/{epoch:02d}-{val_loss:.2f}',
                                   save_weights_only=True,
                                   save_top_k=3,
                                   monitor='val_loss',
                                   period=1)

# Pass it to the pl.Trainer
trainer = pl.Trainer(logger=neptune_logger,
                     checkpoint_callback=model_checkpoint)

# Log model checkpoint to Neptune
for k in model_checkpoint.best_k_models.keys():
    model_name = 'checkpoints/' + k.split('/')[-1]
    neptune_logger.experiment.log_artifact(k, model_name)

# Log score of the best model checkpoint.
neptune_logger.experiment.set_property('best_model_score', model_checkpoint.best_model_score.tolist())
  • model_checkpoint will keep top three model according to the 'val_loss' metric.

  • When train and test are done, simply upload model checkpoints to Neptune to keep them with an experiment.

  • Score of the best model checkpoint is in the details tab.

PyTorch Lightning model checkpoint

Tip

You can find full example implementation in this GitHub or in .

Log confusion matrix

Log confusion metrics after test time.

import matplotlib.pyplot as plt
from scikitplot.metrics import plot_confusion_matrix

model.freeze()
test_data = dm.test_dataloader()
y_true = np.array([])
y_pred = np.array([])

for i, (x, y) in enumerate(test_data):
    y = y.cpu().detach().numpy()
    y_hat = model.forward(x).argmax(axis=1).cpu().detach().numpy()

    y_true = np.append(y_true, y)
    y_pred = np.append(y_pred, y_hat)

fig, ax = plt.subplots(figsize=(16, 12))
plot_confusion_matrix(y_true, y_pred, ax=ax)
neptune_logger.experiment.log_image('confusion_matrix', fig)
PyTorch Lightning confusion metrics

Log auxiliary info

Log model summary and number of GPUs used in the experiment.

# Log model summary
for chunk in [x for x in str(model).split('\n')]:
    neptune_logger.experiment.log_text('model_summary', str(chunk))

# Log number of GPU units used
neptune_logger.experiment.set_property('num_gpus', trainer.num_gpus)
  • You will find model summary in the Logs tab, num_gpus in the details tab.

  • In similar way you can log more information that you feel is relevant to your experimentation.

PyTorch Lightning confusion metrics

Stop Neptune logger (Notebooks only)

Close Neptune logger and experiment once everything is logged.

neptune_logger.experiment.stop()

NeptuneLogger was created with close_after_fit=False, so we need to close Neptune experiment explicitly at the end. Again, this is only for Notebooks, as in scripts logger is closed automatically at the end of the script execution.

Explore Results

You just learned how to log PyTorch Lightning experiments to Neptune, by using Neptune logger which is part of the lightning library.

Above training is logged to Neptune in near real-time. Click on the link that was outputted to the console or charts to explore an experiment similar to yours.

In particular check:

Check this experiment (charts) or view above code snippets as a plain Python script on GitHub.

Common problems

This integration is tested with pytorch-lightning==1.0.0 and current latest, and neptune-client==0.4.123 and current latest. Make sure that you use correct versions.

How to ask for help?

Please visit the Getting help page. Everything regarding support is there.

Other integrations you may like

Here are other integrations with libraries from the PyTorch ecosystem:

  1. PyTorch

  2. PyTorch Ignite

  3. Catalyst

  4. skorch

You may also like these two integrations:

  1. optuna

  2. plotly