Skip to content

Track artifacts#

Previewing artifact metadata in Neptune

Instead of uploading entire files, you can track and version them in Neptune as artifacts. With the track_files() method, you can log metadata about datasets, models, and any other artifacts that can be stored as files.

An artifact can refer to a file as well as a collection of files. For instance, if you track a folder with files inside, Neptune logs the metadata of each individual file and the whole folder.

You can track the following for each artifact:

  • The URL and file path
  • The MD5 hash
  • Size
  • Last modified
About the MD5 hash

The hash of the artifact is calculated based on the file contents and metadata, such as the path, size, and last modification time. A change to any of these will result in a different hash, even if the file contents are exactly the same.

For details, see API referenceArtifact.

Example#

  1. Pass the path to a file or folder as an argument to the track_files() method:

    Single file
    run["train/dataset"].track_files("./datasets/train.csv")
    
    Folder
    run["train/images"].track_files("./datasets/images")
    
  2. In the Neptune web app, open the run and navigate to the Artifacts tab.

  3. Select artifacts to preview them and inspect the metadata.

See example in Neptune 

Passing an absolute Windows file path#

The file path is expected to be a URI, such as file://c:/path/to/file.

To correctly parse an absolute Windows path (C:\Path\to\file):

  1. Work around the backslashes in one of the following ways:
    1. Escape any backslashes
    2. Convert the backslashes to forward slashes
    3. Convert the file path to a raw string
  2. Prepend file:// to the path.

For example:

import neptune

run = neptune.init_run()

path1 = "C:\\Path\\to\\file"
path2 = "C:/Path/to/file"
path3 = r"C:\Path\to\file"

run["artifact1"].track_files(f"file://{path1}")
run["artifact2"].track_files(f"file://{path2}")
run["artifact3"].track_files(f"file://{path3}")

Tracking artifacts from S3-compatible storage#

You can version datasets or models stored on Amazon S3 or compatible storage (s3://...), such as MinIO or Google Cloud Storage (GCS).

Amazon S3#

You need to store your credentials for Amazon Web Services (AWS) as environment variables.

For example, on Amazon S3, configure an IAM group policy with "S3ReadAccessOnly" permissions.

Then, export the user access keys:

export AWS_SECRET_ACCESS_KEY='Your_AWS_key_here'
export AWS_ACCESS_KEY_ID='Your_AWS_ID_here'
export AWS_SECRET_ACCESS_KEY='Your_AWS_key_here'
export AWS_ACCESS_KEY_ID='Your_AWS_ID_here'
setx AWS_SECRET_ACCESS_KEY 'Your_AWS_key_here'
setx AWS_ACCESS_KEY_ID 'Your_AWS_ID_here'
Where to enter the command
  • Linux: Command line
  • macOS: Terminal app
  • Windows: PowerShell or Command Prompt
  • Jupyter Notebook: In a cell, prefixed with an exclamation mark: ! your-command-here

For more information, see the AWS documentation:

Google Cloud Storage#

For GCS, you need to set the storage endpoint URL (https://storage.googleapis.com) to an environment variable named S3_ENDPOINT_URL.

export S3_ENDPOINT_URL='https://storage.googleapis.com'
export S3_ENDPOINT_URL='https://storage.googleapis.com'
setx S3_ENDPOINT_URL 'https://storage.googleapis.com'

To set permanently:

setx S3_ENDPOINT_URL 'https://storage.googleapis.com'

Also set your GCS credentials to the following environment variables:

export AWS_ACCESS_KEY_ID='Your_GCS_service_account_key_here'
export AWS_SECRET_ACCESS_KEY='Your_GCS_service_account_secret_here'

To find your information:

  1. On the Google Cloud console, go to the Cloud Storage Buckets page.
  2. Navigate to SettingsInteroperability.
  3. The Storage URI is the value you need for the S3_ENDPOINT_URL environment variable.
  4. Check the HMAC key identifiers:
    • The access key is the value for AWS_ACCESS_KEY_ID.
    • The secret is the value for AWS_SECRET_ACCESS_KEY.

For details, see the Google Cloud docs .

When specifying the URL to the GCS asset to track with Neptune, use the S3 protocol:

run["asset"].track_files("s3://path/to/asset")

Other providers#

To access other S3-compatible storage providers, you need to set the storage endpoint URL to an environment variable named S3_ENDPOINT_URL.

export S3_ENDPOINT_URL='https://your/storage/endpoint.com'

Example#

Once you've set up your credentials (and possibly endpoint), pass the S3 path to the track_files() method:

Single file
run["train_dataset"].track_files("s3://datasets/train.csv")
Folder
run["train/images"].track_files("s3://datasets/images")

See example in Neptune 

Logging a custom hash#

Apart from the default information tracked with the track_files() method, you can log additional metadata for your artifact.

For example, to log a custom hash, use:

run["train/dataset"].track_files("./datasets/train.csv")
run["train/latest_custom_hash"] = "custom hash"

If you log the custom hash to the same namespace as the artifact, the MD5 hash and the custom hash appear together in the All metadata tab in the Neptune app:

Custom hash displayed in the All metadata tab in the Neptune app

You can also include the logged metadata in custom dashboards.

Querying artifact metadata#

For how to download artifact metadata via API, see Download artifact metadata.


Related