Re-running a failed experiment#
When you're executing a model training script that's being tracked in Neptune and it fails in the middle, you can re-run it with the same metadata, such as hyperparameters, data, and code version.
In this guide, you'll learn how to:
- Resume a failed Neptune run and fetch the metadata needed to re-run the training.
- Log all metadata from the model training (or validation or testing) to a new run, to save results you didn't get from the crashed or incomplete run.
Before you start#
You have neptune-client installed and your Neptune credentials saved as environment variables.
For details, see Installing Neptune.
You have an existing (failed) run in a Neptune project that you have access to.
Fetching metadata from the failed run#
Obtain the ID of the failed run#
To resume the failed run and query the metadata we need from it, we need to know the Neptune ID of the run. It's a string that includes the project key and a counter – for example,
Getting the ID from the web app#
The ID is displayed:
- In the leftmost column of the table views.
- In the run details view, which you can access by clicking on the run and selecting Details in the left pane. From this view, you can copy the run ID directly to your clipboard.
Getting the ID programmatically#
You can also obtain the ID by fetching the runs table from the project and filtering the results by state. We're looking for failed runs, which we can obtain by further filtering the fetched runs table.
- The full project name. For example,
"ml-team/classification". To copy it, navigate to the project settings → Properties.
In case there are multiple failed runs in the project, you ensure that you get the last run that failed by appending
Resume the failed run#
Initialize the failed run by passing its ID to the
We're not logging new data, so we can resume the run in read-only mode.
You can do this whenever you're initializing an existing Neptune object that you only want to query metadata from.
- For details, see Connection modes: Read-only mode
Query the relevant metadata#
Next, we'll fetch the metadata we need to replicate the training run.
In this example, we have logged the following:
- A parameters dictionary, under the
- A dataset as an artifact, under the
We'll download the hyperparameters used in the failed run to instantiate a model with the same configuration, then download the dataset and its path.
fetch() method to retrieve hyperparameters:
Retrieve tracked dataset files#
download() method to retrieve the dataset artifact to your local disk:
Creating a new run#
We're ready to create a new Neptune run that will be used to log all the metadata in the re-run session.
You can now continue logging metadata to this new run.
If there is a Neptune integration available for your framework, you can use the integrated logging instead of manually assigning metadata to the
An integration will typically look like this:
Once you're done fetching the metadata from the open runs, you should stop tracking the runs using the
When to stop runs manually?
Stopping objects with
stop() is needed only while logging from a Jupyter Notebook or other interactive environments.
When logging through a script, Neptune automatically stops tracking once the script has completed execution.
Checking results of the new training run#
Once you run the Python script, a link to the Neptune web app is printed to the console output.
The general format is
https://app.neptune.ai/<workspace>/<project> followed by the Neptune ID of the initialized object.
Click on the link to open the run in Neptune and watch the training progress.
You've learned how to:
- Open a failed run in order to fetch the metadata needed to re-run the training.
- Use the fetched metadata to parametrize a new run with the same training loop.