Compare datasets between runs

You can version datasets, models, and other file objects as Artifacts in Neptune.

This guide shows how to:

  • Keep track of the dataset version with Neptune artifacts

  • See if models were trained on the same dataset version

  • Compare datasets in the Neptune UI to see what changed

By the end of this guide, you will train a few models on different dataset versions and compare those versions in the Neptune UI.

See this example in Neptune

Compare dataset versions in the Neptune UI

Keywords: Compare dataset versions, Data versioning, Data version control, Track dataset version

Before you start

Make sure you meet the following prerequisites before starting:

To use artifacts you need at least 0.10.10 version of neptune client.

pip install neptune-client>=0.10.10

Step 1: Prepare a model training script

Create a training script 'train_model.py' where you:

  • Specify dataset paths for training and testing

  • Define model parameters

  • Calculate the score on the test set

snippet
full script
snippet
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 5,
'max_depth':1,
'max_features':2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
score = train_model(**PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#

Step 2: Add tracking of the dataset version

snippet
full script
snippet
import neptune.new as neptune
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#
snippet
full script
snippet
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#

You can also version the entire dataset folder by running:

run["dataset_tables"].track_files('../datasets/tables')

Step 3: Run model training and log parameters and metrics to Neptune

  • Log parameters to Neptune

snippet
full script
snippet
PARAMS = {'n_estimators': 5,
'max_depth':1,
'max_features':2,
}
run["parameters"] = PARAMS
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#
  • Log score on the test set to Neptune:

snippet
full script
snippet
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#
  • Stop logging to the current Neptune Run

snippet
full script
snippet
run.stop()
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#
  • Run training

python train_model.py

Step 4: Change training dataset

Change the file path to the training dataset:

snippet
full script
snippet
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#

Step 5: Run model training on a new training dataset

  • Create a new Neptune Run:

snippet
full script
snippet
new_run = neptune.init(project="common/data-versioning",
api_token="ANONYMOUS")
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#
  • Log new dataset versions

snippet
full script
snippet
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#
  • Log parameters and test score

snippet
full script
snippet
new_run["parameters"] = PARAMS
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#
  • Stop logging to the current Neptune Run at the end of your script

snippet
full script
snippet
new_run.stop()
full script
train_model.py
import neptune.new as neptune
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
TRAIN_DATASET_PATH = '../datasets/tables/train.csv'
TEST_DATASET_PATH = '../datasets/tables/test.csv'
PARAMS = {'n_estimators': 7,
'max_depth': 2,
'max_features': 2,
}
def train_model(params, train_path, test_path):
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
FEATURE_COLUMNS = ['sepal.length', 'sepal.width', 'petal.length',
'petal.width']
TARGET_COLUMN = ['variety']
X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN]
X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN]
rf = RandomForestClassifier(**params)
rf.fit(X_train, y_train)
score = rf.score(X_test, y_test)
return score
#
# Run model training and log dataset version, parameter and test score to Neptune
#
# Create Neptune Run and start logging
run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Track dataset version
run["datasets/train"].track_files(TRAIN_DATASET_PATH)
run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
run["parameters"] = PARAMS
# Calculate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
run.stop()
#
# Change the training data
# Run model training log dataset version, parameter and test score to Neptune
#
TRAIN_DATASET_PATH = '../datasets/tables/train_v2.csv'
# Create a new Neptune Run and start logging
new_run = neptune.init(project='common/data-versioning',
api_token='ANONYMOUS')
# Log dataset versions
new_run["datasets/train"].track_files(TRAIN_DATASET_PATH)
new_run["datasets/test"].track_files(TEST_DATASET_PATH)
# Log parameters
new_run["parameters"] = PARAMS
# Caclulate and log test score
score = train_model(PARAMS, TRAIN_DATASET_PATH, TEST_DATASET_PATH)
new_run["metrics/test_score"] = score
# Stop logging to the active Neptune Run
new_run.stop()
#
# Go to Neptune to see how the datasets changed between training runs!
#
  • Run training

python train_model.py

Step 6: Compare model training runs in the Neptune UI

You can compare Runs on dataset versions and see how they changed with the Artifacts Compare view.

To do that:

  • Go to the Runs table in the Neptune UI

See this example in Neptune

Runs table with Runs for different data versions
  • Find the Runs with the same parameters but different dataset versions and select them by clicking the Eye Icon next to the Run ID.

Findings Runs with the same parameters but different dataset versions

To find the diff between runs quickly use the Side-by-side compare view.

  • Go to the Compare runs > Artifacts view

Finding Compare runs > Artifacts in the Neptune UI
  • See the difference between 'datasets/train' datasets. Notice that for the 'dataset/test' datasets there was no difference.

See this example in Neptune

Compare dataset versions in the Neptune UI

You can see that the change in the model training score was due to a different training set used, 'train_v2.csv' instead of 'train.csv'.

If the file content under the same file path changed you would see it in this view as well.

Summary

In this guide you learned:

See also