Skip to content

Working with artifacts: Comparing datasets between runs#

Open in Colab

In this tutorial, we'll look at how we can use artifacts to:

  • Group runs by dataset version used for training.
  • Compare dataset metadata between runs.

We'll train a few models and explore the runs in the Neptune app.

Tip

If you already took the dataset versioning tutorial, you can use the same script.

See in Neptune  Code examples 

Before you start#

  • Sign up at neptune.ai/register.
  • Create a project for storing your metadata.
  • Install Neptune:

    pip install neptune
    
    Passing your Neptune credentials

    Once you've registered and created a project, set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

    export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...6Lc"
    

    To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.

    export NEPTUNE_PROJECT="ml-team/classification"
    

    Your full project name has the form workspace-name/project-name. You can copy it from the project settings: Click the menu in the top-right → Details & privacy.

    On Windows, navigate to SettingsEdit the system environment variables, or enter the following in Command Prompt: setx SOME_NEPTUNE_VARIABLE 'some-value'


    While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.

    run = neptune.init_run(
        project="ml-team/classification",  # your full project name here
        api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8",  # your API token here
    )
    

    For more help, see Set Neptune credentials.

  • Have a couple of sample CSV files handy: train.csv and test.csv.

  • Have the scikit-learn Python library installed.

    What if I don't use scikit-learn?

    No worries, we're just using it for demonstration purposes. You can use any framework you like, and Neptune has intregrations with various popular frameworks. For details, see the Integrations tab.

Prepare a model training script#

To start, create a training script train.py where you:

  • Specify dataset paths for training and testing
  • Define model parameters
  • Calculate the score on the test set
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

TRAIN_DATASET_PATH = "train.csv"  # replace with your own if needed
TEST_DATASET_PATH = "test.csv"  # replace with your own if needed
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

Add tracking of dataset version#

Create a Neptune run:

import neptune

run = neptune.init_run() # (1)!
  1. If you haven't set up your credentials, you can log anonymously:

    neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

Log the dataset files as Neptune artifacts with the track_files() method:

run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Project-level metadata

To make collaboration easier, you can log metadata on project level.

This logs the dataset version under the Project metadata tab in Neptune:

project = neptune.init_project(
    project="workspace-name/project-name",  # replace with your own
)
project["datasets/train"].track_files(TRAIN_DATASET_PATH)

For a detailed example, see Sharing dataset versions on project level.

Tracking folders

You can also version an entire dataset folder:

Example
run["dataset_tables"].track_files("../datasets/")

Run model training and log metadata to Neptune#

  1. Log parameters to Neptune:

    run["parameters"] = PARAMS
    
  2. Log the score on the test set to Neptune:

    run["metrics/test_score"] = score
    
  3. Stop the run:

    run.stop()
    
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
  4. Run the training:

python train.py

Change the training dataset#

To have runs with different artifact metadata, let's change the file path to the training dataset:

TRAIN_DATASET_PATH = "train_v2.csv"
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


#
# Run model training and log dataset version, parameter and test score to Neptune
#

# Create a Neptune run and start logging
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

# Log parameters
run["parameters"] = PARAMS

# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score

# Stop logging to the active Neptune run
run.stop()

#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#

TRAIN_DATASET_PATH = "train_v2.csv"

Train model on new dataset#

Create a new Neptune run and log the new dataset versions and metadata:

# Create a new Neptune run and start logging
new_run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)

# Log parameters
new_run["parameters"] = PARAMS

# Calculate the test score and log it to the new run
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score

# Stop logging to the active Neptune run
new_run.stop()
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


#
# Run model training and log dataset version, parameter and test score to Neptune
#

# Create a Neptune run and start logging
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

# Log parameters
run["parameters"] = PARAMS

# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score

# Stop logging to the active Neptune run
run.stop()

#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#

TRAIN_DATASET_PATH = "train_v2.csv"

# Create a new Neptune run and start logging
new_run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)

# Log parameters
new_run["parameters"] = PARAMS

# Calculate the test score and log it to the new run
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score

# Stop logging to the active Neptune run
new_run.stop()

#
# Go to Neptune to see how the datasets changed between training runs!
#

Then rerun the script:

python train.py

Explore results in Neptune#

It's time to explore the logged runs in Neptune.

See example results in Neptune 

Viewing runs grouped by dataset version#

  1. Navigate to the Experiments tab.
  2. To add datasets/train, test_score and parameters/* to the experiments table, click Add column and enter the names of each field.
  3. Near the Experiments tab, switch to group mode ( ).
  4. CLick on the Group button to change the grouping.
  5. Enter the name of the field to group the runs by.
  6. To group the runs by the training dataset version, select the datasets/train field.

Comparing runs with different dataset versions#

You can contrast the metadata of logged artifacts between a source and target run.

  1. Toggle the eye icons ( ) to select two runs with different dataset versions.
  2. Switch to the Artifacts tab.

    • There should be no difference between the dataset/test artifacts.
    • There should be at least one difference between the dataset/train artifacts.

      Click on the file names to view the detailed diff.

See artifact comparison view in Neptune