Skip to content

Working with artifacts: Comparing datasets between runs#

Open in Colab

In this tutorial, we'll look at how we can use artifacts to:

  • Group runs by dataset version used for training.
  • Compare dataset metadata between runs.

We'll train a few models and explore the runs in the Neptune app.

Tip

If you already took the dataset versioning tutorial, you can use the same script.

See in Neptune  Code examples 

Before you start#

What if I don't use scikit-learn?

No worries, we're just using it for demonstration purposes. You can use any framework you like, and Neptune has intregrations with various popular frameworks. For details, see the Integrations tab.

Prepare a model training script#

To start, create a training script train.py where you:

  • Specify dataset paths for training and testing
  • Define model parameters
  • Calculate the score on the test set
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

TRAIN_DATASET_PATH = "train.csv"  # replace with your own if needed
TEST_DATASET_PATH = "test.csv"  # replace with your own if needed
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

Add tracking of dataset version#

Create a Neptune run:

import neptune

run = neptune.init_run()  # (1)!
  1. If you haven't set up your credentials, you can log anonymously: neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning")
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)

Log the dataset files as Neptune artifacts with the track_files() method:

run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Project-level metadata

To make collaboration easier, you can log metadata on project level.

This logs the dataset version under the Project metadata tab in Neptune:

project = neptune.init_project(
    project="workspace-name/project-name",  # replace with your own
)
project["datasets/train"].track_files(TRAIN_DATASET_PATH)

For a detailed example, see Sharing dataset versions on project-level.

Tracking folders

You can also version an entire dataset folder:

Example
run["dataset_tables"].track_files("../datasets/")

Run model training and log metadata to Neptune#

  1. Log parameters to Neptune:

    run["parameters"] = PARAMS
    
  2. Log the score on the test set to Neptune:

    run["metrics/test_score"] = score
    
  3. Stop the run:

    run.stop()
    
    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    
    import neptune
    
    TRAIN_DATASET_PATH = "train.csv"
    TEST_DATASET_PATH = "test.csv"
    PARAMS = {
        "n_estimators": 5,
        "max_depth": 1,
        "max_features": 2,
    }
    
    
    def train_model(params, train_path, test_path):
        train = pd.read_csv(train_path)
        test = pd.read_csv(test_path)
    
        FEATURE_COLUMNS = [
            "sepal.length",
            "sepal.width",
            "petal.length",
            "petal.width",
        ]
        TARGET_COLUMN = ["variety"]
        X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
        X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
    
        rf = RandomForestClassifier(params)
        rf.fit(X_train, y_train)
    
        score = rf.score(X_test, y_test)
        return score
    
    
    #
    # Run model training and log dataset version, parameter and test score to Neptune
    #
    
    # Create a Neptune run and start logging
    run = neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/data-versioning",
    )
    
    # Track dataset version
    run["datasets/train"].track_files(TRAIN_DATASET_PATH)
    run["datasets/test"].track_files(TEST_DATASET_PATH)
    
    # Log parameters
    run["parameters"] = PARAMS
    
    # Caclulate and log test score
    score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
    run["metrics/test_score"] = score
    
    # Stop logging to the active Neptune run
    run.stop()
    
    #
    # Run model training with different parameters and log metadata to Neptune
    #
    
  4. Run the training:

python train.py

Change the training dataset#

To have runs with different artifact metadata, let's change the file path to the training dataset:

TRAIN_DATASET_PATH = "train_v2.csv"
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


#
# Run model training and log dataset version, parameter and test score to Neptune
#

# Create a Neptune run and start logging
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

# Log parameters
run["parameters"] = PARAMS

# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score

# Stop logging to the active Neptune run
run.stop()

#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#

TRAIN_DATASET_PATH = "train_v2.csv"

Train model on new dataset#

Create a new Neptune run and log the new dataset versions and metadata:

# Create a new Neptune run and start logging
new_run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)

# Log parameters
new_run["parameters"] = PARAMS

# Calculate the test score and log it to the new run
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score

# Stop logging to the active Neptune run
new_run.stop()
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

import neptune

TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
    "n_estimators": 5,
    "max_depth": 1,
    "max_features": 2,
}


def train_model(params, train_path, test_path):
    train = pd.read_csv(train_path)
    test = pd.read_csv(test_path)

    FEATURE_COLUMNS = [
        "sepal.length",
        "sepal.width",
        "petal.length",
        "petal.width",
    ]
    TARGET_COLUMN = ["variety"]
    X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
    X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]

    rf = RandomForestClassifier(params)
    rf.fit(X_train, y_train)

    score = rf.score(X_test, y_test)
    return score


#
# Run model training and log dataset version, parameter and test score to Neptune
#

# Create a Neptune run and start logging
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)

# Log parameters
run["parameters"] = PARAMS

# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score

# Stop logging to the active Neptune run
run.stop()

#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#

TRAIN_DATASET_PATH = "train_v2.csv"

# Create a new Neptune run and start logging
new_run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,
    project="common/data-versioning",
)

# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)

# Log parameters
new_run["parameters"] = PARAMS

# Calculate the test score and log it to the new run
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score

# Stop logging to the active Neptune run
new_run.stop()

#
# Go to Neptune to see how the datasets changed between training runs!
#

Then rerun the script:

python train.py

Explore results in Neptune#

It's time to explore the logged runs in Neptune.

See example results in Neptune 

Viewing runs grouped by dataset version#

  1. Navigate to the runs table by clicking the Runs tab.
  2. To add datasets/train, test_score and parameters/* to the runs table, click Add column and enter the names of each field.
  3. To group the runs by the training dataset version, click Group by and select datasets/train.

Comparing runs with different dataset versions#

In the comparison view, you can contrast the metadata of logged artifacts.

  1. In the runs table, use the eye icons () to select two runs with different dataset versions.
  2. In the left pane, select Compare runs.
  3. In the left pane, select Artifacts.

    • There should be no difference between the dataset/test artifacts.
    • There should be at least one difference between the dataset/train artifacts.

      Click on the file names to view the detailed diff.

See artifact comparison view in Neptune