Skip to content

Working with artifacts: Versioning datasets in runs#

Open in Colab

In this tutorial, we'll look at how we can use artifacts to:

  • Track datasets
  • Query the dataset version used in a run
  • Assert whether two runs used the same dataset version

We'll train a few models, making sure that the same dataset was used for each model.

See in Neptune  Code examples 

Before you start#

  • Sign up at neptune.ai/register.
  • Create a project for storing your metadata.
  • Install Neptune:

    pip install neptune
    
    conda install -c conda-forge neptune
    
    Installing through Anaconda Navigator

    To find neptune, you may need to update your channels and index.

    1. In the Navigator, select Environments.
    2. In the package view, click Channels.
    3. Click Add..., enter conda-forge, and click Update channels.
    4. In the package view, click Update index... and wait until the update is complete. This can take several minutes.
    5. You should now be able to search for neptune.

    Note: The displayed version may be outdated. The latest version of the package will be installed.

    Note: On Bioconda, there is a "neptune" package available which is not the neptune.ai client library. Make sure to specify the "conda-forge" channel when installing neptune.ai.

    Passing your Neptune credentials

    Once you've registered and created a project, set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

    export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...6Lc"
    

    To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.

    export NEPTUNE_PROJECT="ml-team/classification"
    

    Your full project name has the form workspace-name/project-name. You can copy it from the project settings: Click the menu in the top-right → Edit project details.

    On Windows, navigate to SettingsEdit the system environment variables, or enter the following in Command Prompt: setx SOME_NEPTUNE_VARIABLE 'some-value'


    While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.

    run = neptune.init_run(
        project="ml-team/classification",  # your full project name here
        api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8",  # your API token here
    )
    

    For more help, see Set Neptune credentials.

  • Have a couple of sample CSV files handy: train.csv and test.csv.

  • Have the scikit-learn Python library installed.

    What if I don't use scikit-learn?

    No worries, we're just using it for demonstration purposes. You can use any framework you like, and Neptune has intregrations with various popular frameworks. For details, see the Integrations tab.

Prepare a model training script#

To start, create a training script train.py where you:

  • Specify dataset paths for training and testing
  • Define model parameters
  • Calculate the score on the test set
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

TRAIN_DATASET_PATH = "train.csv"  # replace with your own if needed
TEST_DATASET_PATH = "test.csv"  # replace with your own if needed
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

Add tracking of dataset version#

Create a Neptune run:

import neptune

run = neptune.init_run() # (1)!
  1. If you haven't set up your credentials, you can log anonymously:

    neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

Log the dataset files as Neptune artifacts with the track_files() method:

run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Project-level metadata

To make collaboration easier, you can log metadata on project level.

This logs the dataset version under the Project metadata tab in Neptune:

project = neptune.init_project(
    project="workspace-name/project-name",  # replace with your own
)
project["datasets/train"].track_files(TRAIN_DATASET_PATH)

For a detailed example, see Sharing dataset versions on project level.

Tracking folders

You can also version an entire dataset folder:

Example
run["dataset_tables"].track_files("../datasets/")

Train model and log metadata#

  1. Log parameters to Neptune:

    run["parameters"] = PARAMS
    
  2. Log the score on the test set to Neptune:

    run["metrics/test_score"] = score
    
  3. Get the run ID of your model training from Neptune.

    This will be useful when asserting the same dataset versions on the baseline and new datasets.

    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    

    This outputs the Neptune ID of the run. For example, 'DAT-58' if you're using the sample project.

  4. Stop the run:

    run.stop()
    
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
  5. Run the training:

python train.py

Add version check for training and testing datasets#

Next, we'll fetch the dataset version hash from the baseline and compare it with the current version of the dataset.

  1. Create a new Neptune run and track the dataset version:

    new_run = neptune.init_run() # (1)!
    
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    1. If you haven't set up your credentials, you can log anonymously:

      neptune.init_run(
          api_token=neptune.ANONYMOUS_API_TOKEN,
          project="common/data-versioning",
      )
      
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
    # Create a new Neptune run and start logging
    new_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
  2. Get the Neptune run object for the baseline model:

    baseline_run = neptune.init_run(
        run=baseline_run_id,
        mode="read-only",
    )
    
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
    # Create a new Neptune run and start logging
    new_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Resume the baseline Neptune run
    baseline_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
        run=baseline_run_id,
        mode="read-only",
    )
    
  3. Fetch the dataset version with the fetch_hash() method and compare the current dataset version with the baseline version:

    baseline_run["datasets/train"].fetch_hash()
    new_run.wait() # (1)!
    
    assert (
        baseline_run["datasets/train"].fetch_hash()
        == new_run["datasets/train"].fetch_hash()
    )
    assert (
        baseline_run["datasets/test"].fetch_hash()
        == new_run["datasets/test"].fetch_hash()
    )
    
    1. Use wait() to ensure that all logging operations are finished. See also: API referencewait()
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
    # Create a new Neptune run and start logging
    new_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Resume the baseline Neptune run
    baseline_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
        run=baseline_run_id,
        mode="read-only",
    )
    
    # Fetch the dataset version of the baseline model training run
    baseline_run["datasets/train"].fetch_hash()
    new_run.wait()
    
    # Check if dataset versions changed or not between the runs
    assert (
        baseline_run["datasets/train"].fetch_hash()
        == new_run["datasets/train"].fetch_hash()
    )
    assert (
        baseline_run["datasets/test"].fetch_hash()
        == new_run["datasets/test"].fetch_hash()
    )
    
About the MD5 hash

The hash of the artifact depends on the file contents and metadata like path, size, and last modification time.

A change to any of these will result in a different hash, even if the file contents are exactly the same.

Run model training with new parameters#

Let's create some differences in the metadata, then run the model training again.

  1. Change the parameters:

    PARAMS = {
        "n_estimators": 10,
        "max_depth": 3,
        "max_features": 2,
    }
    new_run["parameters"] = PARAMS
    
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    new_run["metrics/test_score"] = score
    
  2. Stop logging to active Neptune runs at the end of your script:

    new_run.stop()
    baseline_run.stop()
    
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
    # Create a new Neptune run and start logging
    new_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Resume the baseline Neptune run
    baseline_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
        run=baseline_run_id,
        mode="read-only",
    )
    
    # Fetch the dataset version of the baseline model training run
    baseline_run["datasets/train"].fetch_hash()
    new_run.wait()
    
    # Check if dataset versions changed or not between the runs
    assert (
        baseline_run["datasets/train"].fetch_hash()
        == new_run["datasets/train"].fetch_hash()
    )
    assert (
        baseline_run["datasets/test"].fetch_hash()
        == new_run["datasets/test"].fetch_hash()
    )
    
    # Define new parameters and log them to the new run
    PARAMS = {
        "n_estimators": 10,
        "max_depth": 3,
        "max_features": 2,
    }
    new_run["parameters"] = PARAMS
    
    # Calculate the test score and log it to the new run
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    new_run["metrics/test_score"] = score
    
    # Stop logging to the active Neptune Run
    new_run.stop()
    baseline_run.stop()
    
    #
    # Go to Neptune to see how the results changed making sure that the training dataset versions were the same!
    #
    
  3. Rerun the training:

    python train.py
    

Explore results in Neptune#

It's time to explore the logged runs in Neptune.

  1. Navigate to the Runs table.
  2. To add datasets/train, test_score, and parameters/* to the runs table, click Add column and enter the names of each field.

See in Neptune 

Next steps#