Skip to content

Rerunning a failed experiment#

Open in Colab

When you're executing a model training script that's being tracked in Neptune and it fails in the middle, you can rerun it with the same metadata, such as hyperparameters, data, and code version.

In this guide, you'll learn how to:

  1. Resume a failed Neptune run and fetch the metadata needed to rerun the training.
  2. Log all metadata from the model training (or validation or testing) to a new run, to save results you didn't get from the crashed or incomplete run.

Code examples 

Before you start#

Assumptions

  • You have neptune installed and your Neptune credentials saved as environment variables.

    For details, see Install Neptune.

  • You have an existing (failed) run in a Neptune project that you have access to.

Fetching metadata from the failed run#

Obtain the ID of the failed run#

To resume the failed run and query the metadata we need from it, we need to know the Neptune ID of the run. It's a string that includes the project key and a counter – for example, SHOW-3305.

Getting the ID from the web app#

The ID is displayed:

  • In the leftmost column of the table views.
  • In the run information, which you can access through the menu () next to the run ID. From this view, you can copy the run ID directly to your clipboard.

Finding failed runs and accessing run information

Getting the ID programmatically#

You can also obtain the ID by fetching the runs table from the project and filtering the results by state. We're looking for failed runs, which we can obtain by further filtering the fetched runs table.

import neptune

# Fetch project
project = neptune.init_project(project="workspace-name/project-name") # (1)!

# Fetch only inactive runs
runs_table_df = project.fetch_runs_table(state="inactive").to_pandas()
  1. The full project name. For example, "ml-team/classification".

    To copy it to your clipboard, navigate to the project settings in the top-right () and select Edit project details.

In case there are multiple failed runs in the project, you ensure that you get the last run that failed by appending .values[0]:

failed_run_id = runs_table_df[runs_table_df["sys/failed"]==True]["sys/id"].values[0]

Related

Learn more about the system namespace: sys/state and sys/failed

Resume the failed run#

Initialize the failed run by passing its ID to the with_id argument.

failed_run = neptune.init_run(
    with_id="SHOW-3305",  # replace with the ID of your failed run
    mode="read-only",
)

Read-only mode

We're not logging new data, so we can resume the run in read-only mode.

You can do this whenever you're initializing an existing Neptune object that you only want to query metadata from.

Query the relevant metadata#

Next, we'll fetch the metadata we need to replicate the training run.

In this example, we have logged the following:

  • A parameters dictionary, under the "config/hyperparameters" field.
  • A dataset as an artifact, under the "artifacts/cifar-10" field.

We'll download the hyperparameters used in the failed run to instantiate a model with the same configuration, then download the dataset and its path.

Retrieve hyperparameters#

Use the fetch() method to retrieve hyperparameters:

# Fetch hyperparameters 
failed_run_params = failed_run["config/hyperparameters"].fetch()

Retrieve tracked dataset files#

Use the download() method to retrieve the dataset artifact to your local disk:

failed_run["artifacts/cifar-10"].download()

Creating a new run#

We're ready to create a new Neptune run that will be used to log all the metadata in the rerun session.

new_run = neptune.init_run()

You can now continue logging metadata to this new run.

new_run["config/hyperparameters"] = failed_run_params

for epoch in range(epochs): 

    for i, (x, y) in enumerate(trainloader, 0):

        # Log batch loss
        new_run["training/batch/loss"].append(loss)

        # Log batch accuracy
        new_run["training/batch/acc"].append(acc)

If there is a Neptune integration available for your framework, you can use the integrated logging instead of manually assigning metadata to the run object.

An integration will typically look like this:

neptune_callback = NeptuneCallback(run=new_run)
callbacks = [..., neptune_callback]
fit(..., callbacks = callbacks)

Once you're done fetching the metadata from the open runs, you should stop tracking the runs using the stop() method.

failed_run.stop()
new_run.stop()

Checking results of the new training run#

Once you run the Python script, a link to the Neptune web app is printed to the console output.

Sample output

[neptune] [info ] Neptune initialized. Open in the app: https://app.neptune.ai/workspace/project/e/RUN-1

In the above example, the run ID is RUN-1.

Click on the link to open the run in Neptune and watch the training progress.