Skip to content

Working with artifacts: Versioning datasets in runs#

Open in Colab

In this tutorial, we'll look at how we can use artifacts to:

  • Track datasets
  • Query the dataset version used in a run
  • Assert whether two runs used the same dataset version

We'll train a few models, making sure that the same dataset was used for each model.

See in Neptune  Code examples 

Before you start#

  • Sign up at neptune.ai/register.
  • Create a project for storing your metadata.
  • Install Neptune:

    pip install neptune
    
    Passing your Neptune credentials

    Once you've registered and created a project, set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

    export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...6Lc"
    

    To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.

    export NEPTUNE_PROJECT="ml-team/classification"
    

    Your full project name has the form workspace-name/project-name. You can copy it from the project settings: Click the menu in the top-right → Details & privacy.

    On Windows, navigate to SettingsEdit the system environment variables, or enter the following in Command Prompt: setx SOME_NEPTUNE_VARIABLE 'some-value'


    While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.

    run = neptune.init_run(
        project="ml-team/classification",  # your full project name here
        api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8",  # your API token here
    )
    

    For more help, see Set Neptune credentials.

  • Have a couple of sample CSV files handy: train.csv and test.csv.

  • Have the scikit-learn Python library installed.

    What if I don't use scikit-learn?

    No worries, we're just using it for demonstration purposes. You can use any framework you like, and Neptune has intregrations with various popular frameworks. For details, see the Integrations tab.

Prepare a model training script#

To start, create a training script train.py where you:

  • Specify dataset paths for training and testing
  • Define model parameters
  • Calculate the score on the test set
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

TRAIN_DATASET_PATH = "train.csv"  # replace with your own if needed
TEST_DATASET_PATH = "test.csv"  # replace with your own if needed
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

Add tracking of dataset version#

Create a Neptune run:

import neptune

run = neptune.init_run() # (1)!
  1. If you haven't set up your credentials, you can log anonymously:

    neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

Log the dataset files as Neptune artifacts with the track_files() method:

run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Project-level metadata

To make collaboration easier, you can log metadata on project level.

This logs the dataset version under the Project metadata tab in Neptune:

project = neptune.init_project(
    project="workspace-name/project-name",  # replace with your own
)
project["datasets/train"].track_files(TRAIN_DATASET_PATH)

For a detailed example, see Sharing dataset versions on project level.

Tracking folders

You can also version an entire dataset folder:

Example
run["dataset_tables"].track_files("../datasets/")

Train model and log metadata#

  1. Log parameters to Neptune:

    run["parameters"] = PARAMS
    
  2. Log the score on the test set to Neptune:

    run["metrics/test_score"] = score
    
  3. Get the run ID of your model training from Neptune.

    This will be useful when asserting the same dataset versions on the baseline and new datasets.

    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    

    This outputs the Neptune ID of the run. For example, 'DAT-58' if you're using the sample project.

  4. Stop the run:

    run.stop()
    
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
  5. Run the training:

python train.py

Add version check for training and testing datasets#

Next, we'll fetch the dataset version hash from the baseline and compare it with the current version of the dataset.

  1. Create a new Neptune run and track the dataset version:

    new_run = neptune.init_run() # (1)!
    
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    1. If you haven't set up your credentials, you can log anonymously:

      neptune.init_run(
          api_token=neptune.ANONYMOUS_API_TOKEN,
          project="common/data-versioning",
      )
      
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
    # Create a new Neptune run and start logging
    new_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
  2. Get the Neptune run object for the baseline model:

    baseline_run = neptune.init_run(
        run=baseline_run_id,
        mode="read-only",
    )
    
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
    # Create a new Neptune run and start logging
    new_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Resume the baseline Neptune run
    baseline_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
        run=baseline_run_id,
        mode="read-only",
    )
    
  3. Fetch the dataset version with the fetch_hash() method and compare the current dataset version with the baseline version:

    baseline_run["datasets/train"].fetch_hash()
    new_run.wait() # (1)!
    
    assert (
        baseline_run["datasets/train"].fetch_hash()
        == new_run["datasets/train"].fetch_hash()
    )
    assert (
        baseline_run["datasets/test"].fetch_hash()
        == new_run["datasets/test"].fetch_hash()
    )
    
    1. Use wait() to ensure that all logging operations are finished. See also: API referencewait()
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
    # Create a new Neptune run and start logging
    new_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Resume the baseline Neptune run
    baseline_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
        run=baseline_run_id,
        mode="read-only",
    )
    
    # Fetch the dataset version of the baseline model training run
    baseline_run["datasets/train"].fetch_hash()
    new_run.wait()
    
    # Check if dataset versions changed or not between the runs
    assert (
        baseline_run["datasets/train"].fetch_hash()
        == new_run["datasets/train"].fetch_hash()
    )
    assert (
        baseline_run["datasets/test"].fetch_hash()
        == new_run["datasets/test"].fetch_hash()
    )
    
About the MD5 hash

The hash of the artifact depends on the file contents and metadata like path, size, and last modification time.

A change to any of these will result in a different hash, even if the file contents are exactly the same.

Run model training with new parameters#

Let's create some differences in the metadata, then run the model training again.

  1. Change the parameters:

    PARAMS = {
        "n_estimators": 10,
        "max_depth": 3,
        "max_features": 2,
    }
    new_run["parameters"] = PARAMS
    
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    new_run["metrics/test_score"] = score
    
  2. Stop logging to active Neptune runs at the end of your script:

    new_run.stop()
    baseline_run.stop()
    
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Get Neptune Run ID of the first, baseline model training run
    baseline_run_id = run["sys/id"].fetch()
    print(baseline_run_id)
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
    # Create a new Neptune run and start logging
    new_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    new_run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Resume the baseline Neptune run
    baseline_run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
        run=baseline_run_id,
        mode="read-only",
    )
    
    # Fetch the dataset version of the baseline model training run
    baseline_run["datasets/train"].fetch_hash()
    new_run.wait()
    
    # Check if dataset versions changed or not between the runs
    assert (
        baseline_run["datasets/train"].fetch_hash()
        == new_run["datasets/train"].fetch_hash()
    )
    assert (
        baseline_run["datasets/test"].fetch_hash()
        == new_run["datasets/test"].fetch_hash()
    )
    
    # Define new parameters and log them to the new run
    PARAMS = {
        "n_estimators": 10,
        "max_depth": 3,
        "max_features": 2,
    }
    new_run["parameters"] = PARAMS
    
    # Calculate the test score and log it to the new run
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    new_run["metrics/test_score"] = score
    
    # Stop logging to the active Neptune Run
    new_run.stop()
    baseline_run.stop()
    
    #
    # Go to Neptune to see how the results changed making sure that the training dataset versions were the same!
    #
    
  3. Rerun the training:

    python train.py
    

Explore results in Neptune#

It's time to explore the logged runs in Neptune.

  1. Navigate to the Experiments tab.
  2. To add datasets/train, test_score, and parameters/* to the experiments table, click Add column and enter the names of each field.

See in Neptune 

Next steps#