Skip to content

Tracking artifacts#

Instead of uploading entire files, you can track and version them in Neptune as artifacts.

With the track_files() method, you can log metadata about datasets, models, and any other artifacts that can be stored as files.

You can track the following for each artifact:

  • The URL and file path
  • The MD5 hash
  • Size
  • Last modified
About the MD5 hash

The hash of the artifact depends on the file contents and metadata, such as the path, size, and last modification time. A change to any of these will result in a different hash, even if the file contents are exactly the same.

For details, see API referencetrack_files().

Example#

  1. Pass the path to a file or folder as an argument to the track_files() method:

    # Single file
    run["train/dataset"].track_files("./datasets/train.csv")
    
    # Folder
    run["train/images"].track_files("./datasets/images")
    
  2. Navigate to the run view.

  3. In the left pane, select Artifacts.
  4. Select artifacts to preview them and inspect the metadata.

See example in Neptune 

Tracking artifacts from S3-compatible storage#

You can version datasets or models stored on Amazon S3 or compatible storage (s3://...), such as MinIO or Google Cloud Storage (GCS).

Amazon S3#

You need to store your credentials for Amazon Web Services (AWS) as environment variables.

For example, on Amazon S3, configure an IAM group policy with "S3ReadAccessOnly" permissions.

Then, export the user access keys:

export AWS_SECRET_ACCESS_KEY='Your_AWS_key_here'
export AWS_ACCESS_KEY_ID='Your_AWS_ID_here'
export AWS_SECRET_ACCESS_KEY='Your_AWS_key_here'
export AWS_ACCESS_KEY_ID='Your_AWS_ID_here'
set AWS_SECRET_ACCESS_KEY='Your_AWS_key_here'
set AWS_ACCESS_KEY_ID='Your_AWS_ID_here'

To set permanently:

setx AWS_SECRET_ACCESS_KEY 'Your_AWS_key_here'
setx AWS_ACCESS_KEY_ID 'Your_AWS_ID_here'
Where to enter the command
  • Linux: Command line
  • macOS: Terminal app
  • Windows: PowerShell or Command Prompt
  • Jupyter Notebook: In a cell, prefixed with an exclamation mark: ! your-command-here

For more information, see the AWS documentation:

Google Cloud Storage#

For GCS, you need to set the storage endpoint URL (https://storage.googleapis.com) to an environment variable named S3_ENDPOINT_URL.

export S3_ENDPOINT_URL='https://storage.googleapis.com'
export S3_ENDPOINT_URL='https://storage.googleapis.com'
set S3_ENDPOINT_URL='https://storage.googleapis.com'

To set permanently:

setx S3_ENDPOINT_URL 'https://storage.googleapis.com'

Also set your GCS credentials to the following environment variables:

export AWS_ACCESS_KEY_ID='Your_GCS_service_account_key_here'
export AWS_SECRET_ACCESS_KEY='Your_GCS_service_account_secret_here'

To find your information:

  1. On the Google Cloud console, go to the Cloud Storage Buckets page.
  2. Navigate to SettingsInteroperability.
  3. The Storage URI is the value you need for the S3_ENDPOINT_URL environment variable.
  4. Check the HMAC key identifiers:
    • The access key is the value for AWS_ACCESS_KEY_ID.
    • The secret is the value for AWS_SECRET_ACCESS_KEY.

For details, see the Google Cloud docs .

When specifying the URL to the GCS asset to track with Neptune, use the S3 protocol:

run["asset"].track_files("s3://path/to/asset")

Other providers#

To access other S3-compatible storage providers, you need to set the storage endpoint URL to an environment variable named S3_ENDPOINT_URL.

export S3_ENDPOINT_URL='https://your/storage/endpoint.com'

Example#

Once you've set up your credentials (and possibly endpoint), pass the S3 path to the track_files() method:

# Single file
run["train_dataset"].track_files("s3://datasets/train.csv")

# Folder
run["train/images"].track_files("s3://datasets/images")

See example in Neptune 

Querying artifact metadata#

To obtain the hash of the artifact, use the fetch_hash() method on the artifact field:

>>> run["data_versions/train"].fetch_hash() 
'4e2f79947dfc5ca977c507f905792fae98c49a4b1df795d81e80279e3ce7be8c'

You can fetch other metadata about the artifact by using the fetch_files_list() method. This returns an ArtifactFileData object with the following properties:

  • file_hash: Hash of the file.
  • file_path: Path of the file, relative to the root of the virtual artifact directory.
  • size: Size of the file, in kilobytes.
  • metadata: Dictionary with the keys:
    • file_path: URL of the file (absolute path in local or S3-compatible storage).
    • last_modified: When the file was last modified.
Example
>>> import neptune.new as neptune
>>> run = neptune.init_run(with_id="CLS-45", mode="read-only")
https://app.neptune.ai/ml-team/classification/e/CLS-45 ...
>>> artifact_list = run["data_versions"].fetch_files_list()
>>> artifact_list[0].file_hash
'e54fdfced68d7e057eda168a05910fe609fc27f5'
>>> artifact_list[0].file_path
'train/sample.csv'
>>> artifact_list[0].metadata["last_modified"]
'2022-09-30 10:50:40'
>>> artifact_list[0].metadata["file_path"]
'file:///home/jackie/projects/text-classification/datasets/train/sample.csv'

For the full list of ways to interact with an artifact field, see APIField types: Artifact.

Related

TutorialsData versioning