Rerunning a failed experiment#
When you're executing a model training script that's being tracked in Neptune and it fails in the middle, you can rerun it with the same metadata, such as hyperparameters, data, and code version.
In this guide, you'll learn how to:
- Resume a failed Neptune run and fetch the metadata needed to rerun the training.
- Log all metadata from the model training (or validation or testing) to a new run, to save results you didn't get from the crashed or incomplete run.
Before you start#
Assumptions
-
You have neptune installed and your Neptune credentials saved as environment variables.
For details, see Install Neptune.
-
You have an existing (failed) run in a Neptune project that you have access to.
Fetching metadata from the failed run#
Obtain the ID of the failed run#
To resume the failed run and query the metadata we need from it, we need to know the Neptune ID of the run. It's a string that includes the project key and a counter – for example, SHOW-3305
.
Getting the ID from the web app#
The ID is displayed:
- In the leftmost column of the table views.
- In the run information, which you can access through the menu ( ) next to the run ID. From this view, you can copy the run ID directly to your clipboard.
Getting the ID programmatically#
You can also obtain the ID by fetching the experiments table from the project and filtering the results by state. We're looking for failed runs, which we can obtain by further filtering the fetched experiments table.
import neptune
# Fetch project
project = neptune.init_project(project="workspace-name/project-name") # (1)!
# Fetch only inactive runs
runs_table_df = project.fetch_runs_table(state="inactive").to_pandas()
-
The full project name. For example,
"ml-team/classification"
.- You can copy the name from the project details ( → Details & privacy)
- You can also find a pre-filled
project
string in Experiments → Create a new run.
In case there are multiple failed runs in the project, you ensure that you get the last run that failed by appending .values[0]
:
Related
Learn more about the system namespace: sys/state
and sys/failed
Resume the failed run#
Initialize the failed run by passing its ID to the with_id
argument.
failed_run = neptune.init_run(
with_id="SHOW-3305", # replace with the ID of your failed run
mode="read-only",
)
Read-only mode
We're not logging new data, so we can resume the run in read-only mode.
You can do this whenever you're initializing an existing Neptune object that you only want to query metadata from.
- For details, see Connection modes: Read-only mode
Query the relevant metadata#
Next, we'll fetch the metadata we need to replicate the training run.
In this example, we have logged the following:
- A parameters dictionary, under the
"config/hyperparameters"
field. - A dataset as an artifact, under the
"artifacts/cifar-10"
field.
We'll download the hyperparameters used in the failed run to instantiate a model with the same configuration, then download the dataset and its path.
Retrieve hyperparameters#
Use the fetch()
method to retrieve hyperparameters:
Retrieve tracked dataset files#
Use the download()
method to retrieve the dataset artifact to your local disk:
Creating a new run#
We're ready to create a new Neptune run that will be used to log all the metadata in the rerun session.
You can now continue logging metadata to this new run.
If there is a Neptune integration available for your framework, you can use the integrated logging instead of manually assigning metadata to the run
object.
An integration will typically look like this:
Once you're done fetching the metadata from the open runs, you should stop tracking the runs using the stop()
method.
Checking results of the new training run#
Once you run the Python script, a link to the Neptune web app is printed to the console output.
Sample output
[neptune] [info ] Neptune initialized. Open in the app:
https://app.neptune.ai/workspace/project/e/RUN-1
Click on the link to open the run in Neptune and watch the training progress.