Skip to content

PyTorch integration guide#

Open in Colab

Custom dashboard displaying metadata logged with PyTorch

Info

Neptune also integrates with several other libraries from the PyTorch ecosystem:

This guide walks you through keeping track of your model training metadata when using PyTorch . We'll use the NeptuneLogger class to:

  • Log training metrics
  • Upload model checkpoints
  • Log model predictions

See example in Neptune  Code examples 

Before you start#

  • Sign up at neptune.ai/register.
  • Create a project for storing your metadata.
  • Have PyTorch installed.
  • To follow the example, you'll also need to have torchvision, numpy, and torchviz installed.

Installing the integration#

To use your preinstalled version of Neptune together with the integration:

pip
pip install -U neptune-pytorch
conda
conda install -c conda-forge neptune-pytorch

To install both Neptune and the integration:

pip
pip install -U "neptune[pytorch]"
conda
conda install -c conda-forge neptune neptune-pytorch
How do I save my credentials as environment variables?

Set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

For example:

export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...6Lc" # (1)!
  1. On Windows, the command is set instead of export.
export NEPTUNE_PROJECT="ml-team/classification" # (1)!
  1. On Windows, the command is set instead of export.

Finding your credentials:

  • API token: In the bottom-left corner of the Neptune app, expand your user menu and select Get your API token.
  • Project: Your full project name has the form workspace-name/project-name. To copy the name, click the menu in the top-right corner and select Edit project details.

If you're working in Colab, you can set your credentials with the os and getpass libraries:

import os
from getpass import getpass
os.environ["NEPTUNE_API_TOKEN"] = getpass("Enter your Neptune API token: ")
os.environ["NEPTUNE_PROJECT"] = "workspace-name/project-name"

If you'd rather follow the guide without any setup, you can run the example in Colab .

Basic logging example#

Set up the model and training config#

  1. Import the needed libraries:

    import torch
    from torch import nn
    from torch import optim
    from torchvision import transforms, datasets
    import numpy as np
    
  2. Create a Neptune run:

    import neptune
    
    run = neptune.init_run() # (1)!
    
    1. If you haven't set up your credentials, you can log anonymously:

      neptune.init_run(
          api_token=neptune.ANONYMOUS_API_TOKEN,
          project="common/pytorch-integration",
      )
      
  3. Define your hyperparameters.

    parameters = {
        "lr": 1e-2,
        "bs": 128,
        "input_sz": 32 * 32 * 3,
        "n_classes": 10,
        "model_filename": "basemodel",
        "device": torch.device("cuda" if torch.cuda.is_available() else "cpu"),
        "epochs": 2,
    }
    
  4. Set up the model:

    class Model(nn.Module):
        def __init__(self, input_sz, hidden_dim, n_classes):
            super(Model, self).__init__()
            self.seq_model = nn.Sequential(
                nn.Linear(input_sz, hidden_dim * 2),
                nn.ReLU(),
                nn.Linear(hidden_dim * 2, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, hidden_dim // 2),
                nn.ReLU(),
                nn.Linear(hidden_dim // 2, n_classes),
            )
    
        def forward(self, input):
            x = input.view(-1, 32 * 32 * 3)
            return self.seq_model(x)
    
    
    model = Model(
        parameters["input_sz"], parameters["input_sz"], parameters["n_classes"]
    ).to(parameters["device"])
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=parameters["lr"])
    
  5. Download and transform the data for training:

    data_dir = "data/CIFAR10"
    compressed_ds = "./data/CIFAR10/cifar-10-python.tar.gz"
    data_tfms = {
        "train": transforms.Compose(
            [
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
            ]
        ),
        "val": transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
            ]
        ),
    }
    
    trainset = datasets.CIFAR10(
        data_dir, transform=data_tfms["train"], download=True
    )
    trainloader = torch.utils.data.DataLoader(
        trainset, batch_size=parameters["bs"], shuffle=True, num_workers=0
    )
    validset = datasets.CIFAR10(
        data_dir, train=False, transform=data_tfms["train"], download=True
    )
    validloader = torch.utils.data.DataLoader(
        validset, batch_size=parameters["bs"], num_workers=0
    )
    
    classes = [
        "airplane",
        "automobile",
        "bird",
        "cat",
        "deer",
        "dog",
        "frog",
        "horse",
        "ship",
        "truck",
    ]
    

Add Neptune logging#

  1. Create a NeptuneLogger instance:

    from neptune_pytorch import NeptuneLogger
    
    npt_logger = NeptuneLogger(
        run=run,
        model=model,
        log_model_diagram=True,
        log_gradients=True,
        log_parameters=True,
        log_freq=30,
    )
    
  2. Log the hyperparameters from earlier:

    from neptune.utils import stringify_unsupported
    
    run[npt_logger.base_namespace]["hyperparams"] = stringify_unsupported( # (1)!
        parameters
    )
    
    1. You can use the base_namespace attribute of the logger to log metadata consistently under the "base_namespace" namespace.
  3. Log metrics while training.

    In this example, the metrics are logged under the "batch" namespace every 30 steps.

    for epoch in range(parameters["epochs"]):
        for i, (x, y) in enumerate(trainloader, 0):
            x, y = x.to(parameters["device"]), y.to(parameters["device"])
            optimizer.zero_grad()
            outputs = model(x)
            _, preds = torch.max(outputs, 1)
            loss = criterion(outputs, y)
            acc = (torch.sum(preds == y.data)) / len(x)
    
            # Log after every 30 steps
            if i % 30 == 0:
                run[npt_logger.base_namespace]["batch/loss"].append(loss.item())
                run[npt_logger.base_namespace]["batch/acc"].append(acc.item())
    
            loss.backward()
            optimizer.step()
    
        npt_logger.log_checkpoint() # (1)!
    
    1. The checkpoint number is automatically incremented on the subsequent call:

      • Call 1 → ckpt_1.pt
      • Call 2 → ckpt_2.pt
  4. To stop the connection to Neptune and sync all data, call the stop() method:

    run.stop()
    

Run your script as you normally would. To open the run and explore the metrics, parameters, and predictions, click the Neptune link that appears in the console output.

Sample output

https://app.neptune.ai/workspace-name/project-name/e/RUN-100/metadata

The general format is https://app.neptune.ai/<workspace>/<project> followed by the Neptune ID of the initialized object.

See example in Neptune 

More options#

Saving checkpoint per epoch#

You can save the model checkpoint at the end of the training loop.

for epoch in range(parameters["epochs"]):
    ...

        # Log after every 30 steps
        if i % 30 == 0:
            run[npt_logger.base_namespace]["batch/loss"].append(loss.item())
            run[npt_logger.base_namespace]["batch/acc"].append(acc.item())

        loss.backward()
        optimizer.step()

    npt_logger.log_checkpoint()

The checkpoint number is automatically incremented on the subsequent call:

  • First call → ckpt_1.pt
  • Second call → ckpt_2.pt
  • And so on.

Logging model predictions#

You can log the predictions made by the model as follows:

from neptune.types import File

dataiter = iter(validloader)
images, labels = next(dataiter)

# Predict batch of n_samples
n_samples = 10
imgs = images[:n_samples].to(parameters["device"])
probs = torch.nn.functional.softmax(model(imgs), dim=1)

# Decode probs and log tensors as image
for i, ps in enumerate(probs):
    pred = classes[torch.argmax(ps)]
    ground_truth = classes[labels[i]]
    description = f"pred: {pred} | ground truth: {ground_truth}"

    # Log series of tensors as image and predictions
    run[npt_logger.base_namespace]["predictions"].append(
        File.as_image(imgs[i].cpu().squeeze().permute(2, 1, 0).clip(0, 1)),
        name=f"{i}_{pred}_{ground_truth}",
        description=description,
    )

The predictions are logged as a series of images under the "predictions" namespace.

Saving the final model#

Before stopping the run, you can save the final model as model.pt:

npt_logger.log_model("model")

run.stop()

Related