Working with artifacts: Comparing datasets between runs#
In this tutorial, we'll look at how we can use artifacts to:
- Group runs by dataset version used for training.
- Compare dataset metadata between runs.
We'll train a few models and explore the runs in the Neptune app.
Tip
If you already took the dataset versioning tutorial, you can use the same script.
- We don't need the assertion code anymore, so you can freely to remove it.
- Jump straight to Change the training dataset.
See in Neptune  Code examples 
Before you start#
- Sign up at neptune.ai/register.
- Create a project for storing your metadata.
-
Install Neptune:
Passing your Neptune credentials
Once you've registered and created a project, set your Neptune API token and full project name to the
NEPTUNE_API_TOKEN
andNEPTUNE_PROJECT
environment variables, respectively.To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.
Your full project name has the form
workspace-name/project-name
. You can copy it from the project settings: Click the menu in the top-right → Details & privacy.On Windows, navigate to Settings → Edit the system environment variables, or enter the following in Command Prompt:
setx SOME_NEPTUNE_VARIABLE 'some-value'
While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.
run = neptune.init_run( project="ml-team/classification", # your full project name here api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8", # your API token here )
For more help, see Set Neptune credentials.
-
Have a couple of sample CSV files handy:
train.csv
andtest.csv
. -
Have the
scikit-learn
Python library installed.What if I don't use scikit-learn?
No worries, we're just using it for demonstration purposes. You can use any framework you like, and Neptune has intregrations with various popular frameworks. For details, see the Integrations tab.
Prepare a model training script#
To start, create a training script train.py
where you:
- Specify dataset paths for training and testing
- Define model parameters
- Calculate the score on the test set
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = "train.csv" # replace with your own if needed
TEST_DATASET_PATH = "test.csv" # replace with your own if needed
PARAMS = {
"n_estimators": 5,
"max_depth": 1,
"max_features": 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = [
"sepal.length",
"sepal.width",
"petal.length",
"petal.width",
]
TARGET_COLUMN = ["variety"]
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Add tracking of dataset version#
Create a Neptune run:
-
If you haven't set up your credentials, you can log anonymously:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import neptune
TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
"n_estimators": 5,
"max_depth": 1,
"max_features": 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = [
"sepal.length",
"sepal.width",
"petal.length",
"petal.width",
]
TARGET_COLUMN = ["variety"]
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN,
project="common/data-versioning",
)
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Log the dataset files as Neptune artifacts with the track_files()
method:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import neptune
TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
"n_estimators": 5,
"max_depth": 1,
"max_features": 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = [
"sepal.length",
"sepal.width",
"petal.length",
"petal.width",
]
TARGET_COLUMN = ["variety"]
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN,
project="common/data-versioning",
)
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Project-level metadata
To make collaboration easier, you can log metadata on project level.
This logs the dataset version under the Project metadata tab in Neptune:
project = neptune.init_project(
project="workspace-name/project-name", # replace with your own
)
project["datasets/train"].track_files(TRAIN_DATASET_PATH)
For a detailed example, see Sharing dataset versions on project level.
Tracking folders
You can also version an entire dataset folder:
Run model training and log metadata to Neptune#
-
Log parameters to Neptune:
-
Log the score on the test set to Neptune:
-
Stop the run:
import pandas as pd from sklearn.ensemble import RandomForestClassifier import neptune TRAIN_DATASET_PATH = "train.csv" TEST_DATASET_PATH = "test.csv" PARAMS = { "n_estimators": 5, "max_depth": 1, "max_features": 2, } def train_model(params, train_path, test_path): train = pd.read_csv(train_path) test = pd.read_csv(test_path) FEATURE_COLUMNS = [ "sepal.length", "sepal.width", "petal.length", "petal.width", ] TARGET_COLUMN = ["variety"] X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN] X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN] rf = RandomForestClassifier(params) rf.fit(X_train, y_train) score = rf.score(X_test, y_test) return score # # Run model training and log dataset version, parameter and test score to Neptune # # Create a Neptune run and start logging run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version run["datasets/train"].track_files(TRAIN_DATASET_PATH) run["datasets/test"].track_files(TEST_DATASET_PATH) # Log parameters run["parameters"] = PARAMS # Caclulate and log test score score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH) run["metrics/test_score"] = score # Stop logging to the active Neptune run run.stop() # # Run model training with different parameters and log metadata to Neptune #
-
Run the training:
Change the training dataset#
To have runs with different artifact metadata, let's change the file path to the training dataset:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import neptune
TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
"n_estimators": 5,
"max_depth": 1,
"max_features": 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = [
"sepal.length",
"sepal.width",
"petal.length",
"petal.width",
]
TARGET_COLUMN = ["variety"]
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create a Neptune run and start logging
run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN,
project="common/data-versioning",
)
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = "train_v2.csv"
Train model on new dataset#
Create a new Neptune run and log the new dataset versions and metadata:
# Create a new Neptune run and start logging
new_run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN,
project="common/data-versioning",
)
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Calculate the test score and log it to the new run
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune run
new_run.stop()
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import neptune
TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
"n_estimators": 5,
"max_depth": 1,
"max_features": 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = [
"sepal.length",
"sepal.width",
"petal.length",
"petal.width",
]
TARGET_COLUMN = ["variety"]
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create a Neptune run and start logging
run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN,
project="common/data-versioning",
)
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = "train_v2.csv"
# Create a new Neptune run and start logging
new_run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN,
project="common/data-versioning",
)
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Calculate the test score and log it to the new run
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#
Then rerun the script:
Explore results in Neptune#
It's time to explore the logged runs in Neptune.
See example results in Neptune 
Viewing runs grouped by dataset version#
- Navigate to the Experiments tab.
- To add
datasets/train
,test_score
andparameters/*
to the experiments table, click Add column and enter the names of each field. - Near the Experiments tab, switch to group mode ( ).
- CLick on the Group button to change the grouping.
- Enter the name of the field to group the runs by.
- To group the runs by the training dataset version, select the
datasets/train
field.
Comparing runs with different dataset versions#
You can contrast the metadata of logged artifacts between a source and target run.
- Toggle the eye icons ( ) to select two runs with different dataset versions.
-
Switch to the Artifacts tab.
- There should be no difference between the
dataset/test
artifacts. -
There should be at least one difference between the
dataset/train
artifacts.Click on the file names to view the detailed diff.
- There should be no difference between the
See artifact comparison view in Neptune 
Related documentation