Skip to content

PyTorch integration guide#

Open in Colab

Custom dashboard displaying metadata logged with PyTorch

Info

Neptune also integrates with several other libraries from the PyTorch ecosystem:

This guide walks you through keeping track of your model training metadata when using PyTorch . We'll use the NeptuneLogger class to:

  • Log training metrics
  • Upload model checkpoints
  • Log model predictions

See example in Neptune  Code examples 

Before you start#

  • Sign up at neptune.ai/register.
  • Create a project for storing your metadata.
  • Have PyTorch installed.
  • To follow the example, you'll also need to have torchvision, numpy, and torchviz installed.

Installing the integration#

To use your preinstalled version of Neptune together with the integration:

pip
pip install -U neptune-pytorch

To install both Neptune and the integration:

pip
pip install -U "neptune[pytorch]"
How do I save my credentials as environment variables?

Set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...ifQ=="
export NEPTUNE_PROJECT="ml-team/classification"
export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...ifQ=="
export NEPTUNE_PROJECT="ml-team/classification"
setx NEPTUNE_API_TOKEN "h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...ifQ=="
setx NEPTUNE_PROJECT "ml-team/classification"

You can also navigate to SettingsEdit the system environment variables and add the variables there.

%env NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...ifQ=="
%env NEPTUNE_PROJECT="ml-team/classification"

To find your credentials:

  • API token: In the bottom-left corner of the Neptune app, expand your user menu and select Get your API token. If you need the token of a service account, go to the workspace or project settings and enter the Service accounts settings.
  • Project name: Your full project name has the form workspace-name/project-name. You can copy it from the project menu ( Details & privacy).

If you're working in Google Colab, you can set your credentials with the os and getpass libraries:

import os
from getpass import getpass
os.environ["NEPTUNE_API_TOKEN"] = getpass("Enter your Neptune API token: ")
os.environ["NEPTUNE_PROJECT"] = "workspace-name/project-name"

If you'd rather follow the guide without any setup, you can run the example in Colab .

Basic logging example#

Set up the model and training config#

  1. Import the needed libraries:

    import torch
    from torch import nn
    from torch import optim
    from torchvision import transforms, datasets
    import numpy as np
    
  2. Create a Neptune run:

    import neptune
    
    run = neptune.init_run() # (1)!
    
    1. If you haven't set up your credentials, you can log anonymously:

      neptune.init_run(
          api_token=neptune.ANONYMOUS_API_TOKEN,
          project="common/pytorch-integration",
      )
      
  3. Define your hyperparameters.

    parameters = {
        "lr": 1e-2,
        "bs": 128,
        "input_sz": 32 * 32 * 3,
        "n_classes": 10,
        "model_filename": "basemodel",
        "device": torch.device("cuda" if torch.cuda.is_available() else "cpu"),
        "epochs": 2,
    }
    
  4. Set up the model:

    class Model(nn.Module):
        def __init__(self, input_sz, hidden_dim, n_classes):
            super(Model, self).__init__()
            self.seq_model = nn.Sequential(
                nn.Linear(input_sz, hidden_dim * 2),
                nn.ReLU(),
                nn.Linear(hidden_dim * 2, hidden_dim),
                nn.ReLU(),
                nn.Linear(hidden_dim, hidden_dim // 2),
                nn.ReLU(),
                nn.Linear(hidden_dim // 2, n_classes),
            )
    
        def forward(self, input):
            x = input.view(-1, 32 * 32 * 3)
            return self.seq_model(x)
    
    
    model = Model(
        parameters["input_sz"], parameters["input_sz"], parameters["n_classes"]
    ).to(parameters["device"])
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=parameters["lr"])
    
  5. Download and transform the data for training:

    data_dir = "data/CIFAR10"
    compressed_ds = "./data/CIFAR10/cifar-10-python.tar.gz"
    data_tfms = {
        "train": transforms.Compose(
            [
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
            ]
        ),
        "val": transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
            ]
        ),
    }
    
    trainset = datasets.CIFAR10(
        data_dir, transform=data_tfms["train"], download=True
    )
    trainloader = torch.utils.data.DataLoader(
        trainset, batch_size=parameters["bs"], shuffle=True, num_workers=0
    )
    validset = datasets.CIFAR10(
        data_dir, train=False, transform=data_tfms["train"], download=True
    )
    validloader = torch.utils.data.DataLoader(
        validset, batch_size=parameters["bs"], num_workers=0
    )
    
    classes = [
        "airplane",
        "automobile",
        "bird",
        "cat",
        "deer",
        "dog",
        "frog",
        "horse",
        "ship",
        "truck",
    ]
    

Add Neptune logging#

  1. Create a NeptuneLogger instance:

    from neptune_pytorch import NeptuneLogger
    
    npt_logger = NeptuneLogger(
        run=run,
        model=model,
        log_model_diagram=True,
        log_gradients=True,
        log_parameters=True,
        log_freq=30,
    )
    
  2. Log the hyperparameters from earlier:

    from neptune.utils import stringify_unsupported
    
    run[npt_logger.base_namespace]["hyperparams"] = stringify_unsupported( # (1)!
        parameters
    )
    
    1. You can use the base_namespace attribute of the logger to log metadata consistently under the "base_namespace" namespace.
  3. Log metrics while training.

    In this example, the metrics are logged under the "batch" namespace every 30 steps.

    for epoch in range(parameters["epochs"]):
        for i, (x, y) in enumerate(trainloader, 0):
            x, y = x.to(parameters["device"]), y.to(parameters["device"])
            optimizer.zero_grad()
            outputs = model(x)
            _, preds = torch.max(outputs, 1)
            loss = criterion(outputs, y)
            acc = (torch.sum(preds == y.data)) / len(x)
    
            # Log after every 30 steps
            if i % 30 == 0:
                run[npt_logger.base_namespace]["batch/loss"].append(loss.item())
                run[npt_logger.base_namespace]["batch/acc"].append(acc.item())
    
            loss.backward()
            optimizer.step()
    
        npt_logger.log_checkpoint() # (1)!
    
    1. The checkpoint number is automatically incremented on the subsequent call:

      • Call 1 → ckpt_1.pt
      • Call 2 → ckpt_2.pt
  4. To stop the connection to Neptune and sync all data, call the stop() method:

    run.stop()
    

Run your script as you normally would. To open the run and explore the metrics, parameters, and predictions, click the Neptune link that appears in the console output.

Sample output

[neptune] [info ] Neptune initialized. Open in the app: https://app.neptune.ai/workspace/project/e/RUN-1

See example in Neptune 

More options#

Saving checkpoint per epoch#

You can save the model checkpoint at the end of the training loop.

for epoch in range(parameters["epochs"]):
    ...

        # Log after every 30 steps
        if i % 30 == 0:
            run[npt_logger.base_namespace]["batch/loss"].append(loss.item())
            run[npt_logger.base_namespace]["batch/acc"].append(acc.item())

        loss.backward()
        optimizer.step()

    npt_logger.log_checkpoint()

The checkpoint number is automatically incremented on the subsequent call:

  • First call → ckpt_1.pt
  • Second call → ckpt_2.pt
  • And so on.

Logging model predictions#

You can log the predictions made by the model as follows:

from neptune.types import File

dataiter = iter(validloader)
images, labels = next(dataiter)

# Predict batch of n_samples
n_samples = 10
imgs = images[:n_samples].to(parameters["device"])
probs = torch.nn.functional.softmax(model(imgs), dim=1)

# Decode probs and log tensors as image
for i, ps in enumerate(probs):
    pred = classes[torch.argmax(ps)]
    ground_truth = classes[labels[i]]
    description = f"pred: {pred} | ground truth: {ground_truth}"

    # Log series of tensors as image and predictions
    run[npt_logger.base_namespace]["predictions"].append(
        File.as_image(imgs[i].cpu().squeeze().permute(2, 1, 0).clip(0, 1)),
        name=f"{i}_{pred}_{ground_truth}",
        description=description,
    )

The predictions are logged as a series of images under the "predictions" namespace.

Saving the final model#

Before stopping the run, you can save the final model as model.pt:

npt_logger.log_model("model")

run.stop()

Related