Skip to content

Re-running a failed experiment#

Open in Colab

When you're executing a model training script that's being tracked in Neptune and it fails in the middle, you can re-run it with the same metadata, such as hyperparameters, data, and code version.

In this guide, you'll learn how to:

  1. Resume a failed Neptune run and fetch the metadata needed to re-run the training.
  2. Log all metadata from the model training (or validation or testing) to a new run, to save results you didn't get from the crashed or incomplete run.

Code examples 

Before you start#

Assumptions

  • You have neptune-client installed and your Neptune credentials saved as environment variables.

    For details, see Installing Neptune.

  • You have an existing (failed) run in a Neptune project that you have access to.

Fetching metadata from the failed run#

Obtain the ID of the failed run#

To resume the failed run and query the metadata we need from it, we need to know the Neptune ID of the run. It's a string that includes the project key and a counter – for example, SHOW-3305.

Getting the ID from the web app#

The ID is displayed:

  • In the leftmost column of the table views.
  • In the run details view, which you can access by clicking on the run and selecting Details in the left pane. From this view, you can copy the run ID directly to your clipboard.

Example query for finding failed runs in the runs table

Getting the ID programmatically#

You can also obtain the ID by fetching the runs table from the project and filtering the results by state. We're looking for failed runs, which we can obtain by further filtering the fetched runs table.

import neptune.new as neptune

# Fetch project
project = neptune.init_project(project="workspace-name/project-name")  # (1)!

# Fetch only inactive runs
runs_table_df = project.fetch_runs_table(state="idle").to_pandas()
  1. The full project name. For example, "ml-team/classification". To copy it, navigate to the project settingsProperties.

In case there are multiple failed runs in the project, you ensure that you get the last run that failed by appending .values[0]:

failed_run_id = runs_table_df[runs_table_df["sys/failed"]==True]["sys/id"].values[0]

Related

Learn more about the system namespace: sys/state and sys/failed

Resume the failed run#

Initialize the failed run by passing its ID to the with_id argument.

failed_run = neptune.init_run(
    with_id="SHOW-3305"  # replace with the ID of your failed run
    mode="read-only"
)

Read-only mode

We're not logging new data, so we can resume the run in read-only mode.

You can do this whenever you're initializing an existing Neptune object that you only want to query metadata from.

Query the relevant metadata#

Next, we'll fetch the metadata we need to replicate the training run.

In this example, we have logged the following:

  • A parameters dictionary, under the "config/hyperparameters" field.
  • A dataset as an artifact, under the "artifacts/cifar-10" field.

We'll download the hyperparameters used in the failed run to instantiate a model with the same configuration, then download the dataset and its path.

Retrieve hyperparameters#

Use the fetch() method to retrieve hyperparameters:

# Fetch hyperparameters 
failed_run_params = failed_run["config/hyperparameters"].fetch()

Retrieve tracked dataset files#

Use the download() method to retrieve the dataset artifact to your local disk:

failed_run["artifacts/cifar-10"].download()

Creating a new run#

We're ready to create a new Neptune run that will be used to log all the metadata in the re-run session.

new_run = neptune.init_run()

You can now continue logging metadata to this new run.

new_run["config/hyperparameters"] = failed_run_params

for epoch in range(epochs): 

    for i, (x, y) in enumerate(trainloader, 0):

        # Log batch loss
        new_run["training/batch/loss"].append(loss)

        # Log batch accuracy
        new_run["training/batch/acc"].append(acc)

If there is a Neptune integration available for your framework, you can use the integrated logging instead of manually assigning metadata to the run object.

An integration will typically look like this:

neptune_callback = NeptuneCallback(run=new_run)
callbacks = [..., neptune_callback)
fit(..., callbacks = callbacks)

Once you're done fetching the metadata from the open runs, you should stop tracking the runs using the stop() method.

failed_run.stop()
new_run.stop()

When to stop runs manually?

Stopping objects with stop() is needed only while logging from a Jupyter Notebook or other interactive environments.

When logging through a script, Neptune automatically stops tracking once the script has completed execution.

Checking results of the new training run#

Once you run the Python script, a link to the Neptune web app is printed to the console output.

Sample output

https://app.neptune.ai/workspace-name/project-name/e/RUN-100/

The general format is https://app.neptune.ai/<workspace>/<project> followed by the Neptune ID of the initialized object.

Click on the link to open the run in Neptune and watch the training progress.

Summary#

You've learned how to:

  • Open a failed run in order to fetch the metadata needed to re-run the training.
  • Use the fetched metadata to parametrize a new run with the same training loop.