Working with artifacts: Versioning datasets in runs#
In this tutorial, we'll look at how we can use artifacts to:
- Track datasets
- Query the dataset version used in a run
- Assert whether two runs used the same dataset version
We'll train a few models, making sure that the same dataset was used for each model.
See in Neptune  Code examples 
Related documentation
Before you start#
- Sign up at neptune.ai/register.
- Create a project for storing your metadata.
-
Install Neptune:
Passing your Neptune credentials
Once you've registered and created a project, set your Neptune API token and full project name to the
NEPTUNE_API_TOKEN
andNEPTUNE_PROJECT
environment variables, respectively.To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.
Your full project name has the form
workspace-name/project-name
. You can copy it from the project settings: Click the menu in the top-right → Details & privacy.On Windows, navigate to Settings → Edit the system environment variables, or enter the following in Command Prompt:
setx SOME_NEPTUNE_VARIABLE 'some-value'
While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.
run = neptune.init_run( project="ml-team/classification", # your full project name here api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8", # your API token here )
For more help, see Set Neptune credentials.
-
Have a couple of sample CSV files handy:
train.csv
andtest.csv
. -
Have the
scikit-learn
Python library installed.What if I don't use scikit-learn?
No worries, we're just using it for demonstration purposes. You can use any framework you like, and Neptune has intregrations with various popular frameworks. For details, see the Integrations tab.
Prepare a model training script#
To start, create a training script train.py
where you:
- Specify dataset paths for training and testing
- Define model parameters
- Calculate the score on the test set
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = "train.csv" # replace with your own if needed
TEST_DATASET_PATH = "test.csv" # replace with your own if needed
PARAMS = {
"n_estimators": 5,
"max_depth": 1,
"max_features": 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = [
"sepal.length",
"sepal.width",
"petal.length",
"petal.width",
]
TARGET_COLUMN = ["variety"]
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Add tracking of dataset version#
Create a Neptune run:
-
If you haven't set up your credentials, you can log anonymously:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import neptune
TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
"n_estimators": 5,
"max_depth": 1,
"max_features": 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = [
"sepal.length",
"sepal.width",
"petal.length",
"petal.width",
]
TARGET_COLUMN = ["variety"]
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN,
project="common/data-versioning",
)
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Log the dataset files as Neptune artifacts with the track_files()
method:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import neptune
TRAIN_DATASET_PATH = "train.csv"
TEST_DATASET_PATH = "test.csv"
PARAMS = {
"n_estimators": 5,
"max_depth": 1,
"max_features": 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = [
"sepal.length",
"sepal.width",
"petal.length",
"petal.width",
]
TARGET_COLUMN = ["variety"]
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN,
project="common/data-versioning",
)
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
Project-level metadata
To make collaboration easier, you can log metadata on project level.
This logs the dataset version under the Project metadata tab in Neptune:
project = neptune.init_project(
project="workspace-name/project-name", # replace with your own
)
project["datasets/train"].track_files(TRAIN_DATASET_PATH)
For a detailed example, see Sharing dataset versions on project level.
Tracking folders
You can also version an entire dataset folder:
Train model and log metadata#
-
Log parameters to Neptune:
-
Log the score on the test set to Neptune:
-
Get the run ID of your model training from Neptune.
This will be useful when asserting the same dataset versions on the baseline and new datasets.
This outputs the Neptune ID of the run. For example,
'DAT-58'
if you're using the sample project. -
Stop the run:
import pandas as pd from sklearn.ensemble import RandomForestClassifier import neptune TRAIN_DATASET_PATH = "train.csv" TEST_DATASET_PATH = "test.csv" PARAMS = { "n_estimators": 5, "max_depth": 1, "max_features": 2, } def train_model(params, train_path, test_path): train = pd.read_csv(train_path) test = pd.read_csv(test_path) FEATURE_COLUMNS = [ "sepal.length", "sepal.width", "petal.length", "petal.width", ] TARGET_COLUMN = ["variety"] X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN] X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN] rf = RandomForestClassifier(params) rf.fit(X_train, y_train) score = rf.score(X_test, y_test) return score # # Run model training and log dataset version, parameter and test score to Neptune # # Create a Neptune run and start logging run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version run["datasets/train"].track_files(TRAIN_DATASET_PATH) run["datasets/test"].track_files(TEST_DATASET_PATH) # Log parameters run["parameters"] = PARAMS # Caclulate and log test score score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH) run["metrics/test_score"] = score # Get Neptune Run ID of the first, baseline model training run baseline_run_id = run["sys/id"].fetch() print(baseline_run_id) # Stop logging to the active Neptune run run.stop() # # Run model training with different parameters and log metadata to Neptune #
-
Run the training:
Add version check for training and testing datasets#
Next, we'll fetch the dataset version hash from the baseline and compare it with the current version of the dataset.
-
Create a new Neptune run and track the dataset version:
new_run = neptune.init_run() # (1)! new_run["datasets/train"].track_files(TRAIN_DATASET_PATH) new_run["datasets/test"].track_files(TEST_DATASET_PATH)
-
If you haven't set up your credentials, you can log anonymously:
import pandas as pd from sklearn.ensemble import RandomForestClassifier import neptune TRAIN_DATASET_PATH = "train.csv" TEST_DATASET_PATH = "test.csv" PARAMS = { "n_estimators": 5, "max_depth": 1, "max_features": 2, } def train_model(params, train_path, test_path): train = pd.read_csv(train_path) test = pd.read_csv(test_path) FEATURE_COLUMNS = [ "sepal.length", "sepal.width", "petal.length", "petal.width", ] TARGET_COLUMN = ["variety"] X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN] X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN] rf = RandomForestClassifier(params) rf.fit(X_train, y_train) score = rf.score(X_test, y_test) return score # # Run model training and log dataset version, parameter and test score to Neptune # # Create a Neptune run and start logging run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version run["datasets/train"].track_files(TRAIN_DATASET_PATH) run["datasets/test"].track_files(TEST_DATASET_PATH) # Log parameters run["parameters"] = PARAMS # Caclulate and log test score score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH) run["metrics/test_score"] = score # Get Neptune Run ID of the first, baseline model training run baseline_run_id = run["sys/id"].fetch() print(baseline_run_id) # Stop logging to the active Neptune run run.stop() # # Run model training with different parameters and log metadata to Neptune # # Create a new Neptune run and start logging new_run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version new_run["datasets/train"].track_files(TRAIN_DATASET_PATH) new_run["datasets/test"].track_files(TEST_DATASET_PATH)
-
-
Get the Neptune run object for the baseline model:
import pandas as pd from sklearn.ensemble import RandomForestClassifier import neptune TRAIN_DATASET_PATH = "train.csv" TEST_DATASET_PATH = "test.csv" PARAMS = { "n_estimators": 5, "max_depth": 1, "max_features": 2, } def train_model(params, train_path, test_path): train = pd.read_csv(train_path) test = pd.read_csv(test_path) FEATURE_COLUMNS = [ "sepal.length", "sepal.width", "petal.length", "petal.width", ] TARGET_COLUMN = ["variety"] X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN] X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN] rf = RandomForestClassifier(params) rf.fit(X_train, y_train) score = rf.score(X_test, y_test) return score # # Run model training and log dataset version, parameter and test score to Neptune # # Create a Neptune run and start logging run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version run["datasets/train"].track_files(TRAIN_DATASET_PATH) run["datasets/test"].track_files(TEST_DATASET_PATH) # Log parameters run["parameters"] = PARAMS # Caclulate and log test score score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH) run["metrics/test_score"] = score # Get Neptune Run ID of the first, baseline model training run baseline_run_id = run["sys/id"].fetch() print(baseline_run_id) # Stop logging to the active Neptune run run.stop() # # Run model training with different parameters and log metadata to Neptune # # Create a new Neptune run and start logging new_run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version new_run["datasets/train"].track_files(TRAIN_DATASET_PATH) new_run["datasets/test"].track_files(TEST_DATASET_PATH) # Resume the baseline Neptune run baseline_run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", run=baseline_run_id, mode="read-only", )
-
Fetch the dataset version with the
fetch_hash()
method and compare the current dataset version with the baseline version:baseline_run["datasets/train"].fetch_hash() new_run.wait() # (1)! assert ( baseline_run["datasets/train"].fetch_hash() == new_run["datasets/train"].fetch_hash() ) assert ( baseline_run["datasets/test"].fetch_hash() == new_run["datasets/test"].fetch_hash() )
- Use
wait()
to ensure that all logging operations are finished. See also: API reference ≫wait()
import pandas as pd from sklearn.ensemble import RandomForestClassifier import neptune TRAIN_DATASET_PATH = "train.csv" TEST_DATASET_PATH = "test.csv" PARAMS = { "n_estimators": 5, "max_depth": 1, "max_features": 2, } def train_model(params, train_path, test_path): train = pd.read_csv(train_path) test = pd.read_csv(test_path) FEATURE_COLUMNS = [ "sepal.length", "sepal.width", "petal.length", "petal.width", ] TARGET_COLUMN = ["variety"] X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN] X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN] rf = RandomForestClassifier(params) rf.fit(X_train, y_train) score = rf.score(X_test, y_test) return score # # Run model training and log dataset version, parameter and test score to Neptune # # Create a Neptune run and start logging run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version run["datasets/train"].track_files(TRAIN_DATASET_PATH) run["datasets/test"].track_files(TEST_DATASET_PATH) # Log parameters run["parameters"] = PARAMS # Caclulate and log test score score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH) run["metrics/test_score"] = score # Get Neptune Run ID of the first, baseline model training run baseline_run_id = run["sys/id"].fetch() print(baseline_run_id) # Stop logging to the active Neptune run run.stop() # # Run model training with different parameters and log metadata to Neptune # # Create a new Neptune run and start logging new_run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version new_run["datasets/train"].track_files(TRAIN_DATASET_PATH) new_run["datasets/test"].track_files(TEST_DATASET_PATH) # Resume the baseline Neptune run baseline_run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", run=baseline_run_id, mode="read-only", ) # Fetch the dataset version of the baseline model training run baseline_run["datasets/train"].fetch_hash() new_run.wait() # Check if dataset versions changed or not between the runs assert ( baseline_run["datasets/train"].fetch_hash() == new_run["datasets/train"].fetch_hash() ) assert ( baseline_run["datasets/test"].fetch_hash() == new_run["datasets/test"].fetch_hash() )
- Use
About the MD5 hash
The hash of the artifact depends on the file contents and metadata like path, size, and last modification time.
A change to any of these will result in a different hash, even if the file contents are exactly the same.
Run model training with new parameters#
Let's create some differences in the metadata, then run the model training again.
-
Change the parameters:
-
Stop logging to active Neptune runs at the end of your script:
import pandas as pd from sklearn.ensemble import RandomForestClassifier import neptune TRAIN_DATASET_PATH = "train.csv" TEST_DATASET_PATH = "test.csv" PARAMS = { "n_estimators": 5, "max_depth": 1, "max_features": 2, } def train_model(params, train_path, test_path): train = pd.read_csv(train_path) test = pd.read_csv(test_path) FEATURE_COLUMNS = [ "sepal.length", "sepal.width", "petal.length", "petal.width", ] TARGET_COLUMN = ["variety"] X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN] X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN] rf = RandomForestClassifier(params) rf.fit(X_train, y_train) score = rf.score(X_test, y_test) return score # # Run model training and log dataset version, parameter and test score to Neptune # # Create a Neptune run and start logging run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version run["datasets/train"].track_files(TRAIN_DATASET_PATH) run["datasets/test"].track_files(TEST_DATASET_PATH) # Log parameters run["parameters"] = PARAMS # Caclulate and log test score score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH) run["metrics/test_score"] = score # Get Neptune Run ID of the first, baseline model training run baseline_run_id = run["sys/id"].fetch() print(baseline_run_id) # Stop logging to the active Neptune run run.stop() # # Run model training with different parameters and log metadata to Neptune # # Create a new Neptune run and start logging new_run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", ) # Track dataset version new_run["datasets/train"].track_files(TRAIN_DATASET_PATH) new_run["datasets/test"].track_files(TEST_DATASET_PATH) # Resume the baseline Neptune run baseline_run = neptune.init_run( api_token=neptune.ANONYMOUS_API_TOKEN, project="common/data-versioning", run=baseline_run_id, mode="read-only", ) # Fetch the dataset version of the baseline model training run baseline_run["datasets/train"].fetch_hash() new_run.wait() # Check if dataset versions changed or not between the runs assert ( baseline_run["datasets/train"].fetch_hash() == new_run["datasets/train"].fetch_hash() ) assert ( baseline_run["datasets/test"].fetch_hash() == new_run["datasets/test"].fetch_hash() ) # Define new parameters and log them to the new run PARAMS = { "n_estimators": 10, "max_depth": 3, "max_features": 2, } new_run["parameters"] = PARAMS # Calculate the test score and log it to the new run score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH) new_run["metrics/test_score"] = score # Stop logging to the active Neptune Run new_run.stop() baseline_run.stop() # # Go to Neptune to see how the results changed making sure that the training dataset versions were the same! #
-
Rerun the training:
Explore results in Neptune#
It's time to explore the logged runs in Neptune.
- Navigate to the Experiments tab.
- To add
datasets/train
,test_score
, andparameters/*
to the experiments table, click Add column and enter the names of each field.
Next steps#
- Learn how to compare artifacts: Compare dataset versions between runs
- Learn how to work with datasets stored in project metadata: Share dataset versions on project level
Related documentation