Re-running a failed experiment#
When you're executing a model training script that's being tracked in Neptune and it fails in the middle, you can re-run it with the same metadata, such as hyperparameters, data, and code version.
In this guide, you'll learn how to:
- Resume a failed Neptune run and fetch the metadata needed to re-run the training.
- Log all metadata from the model training (or validation or testing) to a new run, to save results you didn't get from the crashed or incomplete run.
Before you start#
Assumptions
-
You have neptune-client installed and your Neptune credentials saved as environment variables.
For details, see Installing Neptune.
-
You have an existing (failed) run in a Neptune project that you have access to.
Fetching metadata from the failed run#
Obtain the ID of the failed run#
To resume the failed run and query the metadata we need from it, we need to know the Neptune ID of the run. It's a string that includes the project key and a counter – for example, SHOW-3305
.
Getting the ID from the web app#
The ID is displayed:
- In the leftmost column of the table views.
- In the run information, which you can access through the menu () next to the run title. From this view, you can copy the run ID directly to your clipboard.
Getting the ID programmatically#
You can also obtain the ID by fetching the runs table from the project and filtering the results by state. We're looking for failed runs, which we can obtain by further filtering the fetched runs table.
import neptune
# Fetch project
project = neptune.init_project(project="workspace-name/project-name") # (1)!
# Fetch only inactive runs
runs_table_df = project.fetch_runs_table(state="inactive").to_pandas()
-
The full project name. For example,
"ml-team/classification"
.To copy it, navigate to the project settings in the top-right () and select Properties.
In case there are multiple failed runs in the project, you ensure that you get the last run that failed by appending .values[0]
:
Related
Learn more about the system namespace: sys/state
and sys/failed
Resume the failed run#
Initialize the failed run by passing its ID to the with_id
argument.
failed_run = neptune.init_run(
with_id="SHOW-3305", # replace with the ID of your failed run
mode="read-only",
)
Read-only mode
We're not logging new data, so we can resume the run in read-only mode.
You can do this whenever you're initializing an existing Neptune object that you only want to query metadata from.
- For details, see Connection modes: Read-only mode
Query the relevant metadata#
Next, we'll fetch the metadata we need to replicate the training run.
In this example, we have logged the following:
- A parameters dictionary, under the
"config/hyperparameters"
field. - A dataset as an artifact, under the
"artifacts/cifar-10"
field.
We'll download the hyperparameters used in the failed run to instantiate a model with the same configuration, then download the dataset and its path.
Retrieve hyperparameters#
Use the fetch()
method to retrieve hyperparameters:
Retrieve tracked dataset files#
Use the download()
method to retrieve the dataset artifact to your local disk:
Creating a new run#
We're ready to create a new Neptune run that will be used to log all the metadata in the re-run session.
You can now continue logging metadata to this new run.
If there is a Neptune integration available for your framework, you can use the integrated logging instead of manually assigning metadata to the run
object.
An integration will typically look like this:
Once you're done fetching the metadata from the open runs, you should stop tracking the runs using the stop()
method.
Checking results of the new training run#
Once you run the Python script, a link to the Neptune web app is printed to the console output.
Sample output
https://app.neptune.ai/workspace-name/project-name/e/RUN-100/metadata
The general format is https://app.neptune.ai/<workspace>/<project>
followed by the Neptune ID of the initialized object.
Click on the link to open the run in Neptune and watch the training progress.