Working with artifacts: Sharing dataset versions on project level#
You can log and query metadata on project level, including dataset and model versions, text notes, images, notebook files, and anything else you can log to a single run.
In this tutorial, we'll track artifacts as project metadata. The flow includes:
- Logging versions of all the datasets used in a project.
- Organizing dataset version metadata in the Neptune app.
- Sharing all the currently used dataset versions with your team.
- Asserting that you're training on the latest dataset version available.
See results in Neptune  Code examples 
Before you start#
- Sign up at neptune.ai/register.
- Create a project for storing your metadata.
-
Install Neptune:
Passing your Neptune credentials
Once you've registered and created a project, set your Neptune API token and full project name to the
NEPTUNE_API_TOKEN
andNEPTUNE_PROJECT
environment variables, respectively.To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.
Your full project name has the form
workspace-name/project-name
. You can copy it from the project settings: Click the menu in the top-right → Details & privacy.On Windows, navigate to Settings → Edit the system environment variables, or enter the following in Command Prompt:
setx SOME_NEPTUNE_VARIABLE 'some-value'
While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.
run = neptune.init_run( project="ml-team/classification", # your full project name here api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8", # your API token here )
For more help, see Set Neptune credentials.
-
Have a couple of sample CSV files handy:
train.csv
andtest.csv
. - Have the
scikit-learn
Python library installed.
Tip
If you already took the dataset versioning tutorial, or just want to check out the artifact logging without training any models, you can skip to the Track several dataset versions in project metadata section and run the snippets from there.
Prepare a model training script#
To start, create a training script train.py
:
Track several dataset versions in project metadata#
-
To log project metadata through the API, initialize the project as a Neptune object.
You can log metadata to it just as you would to a run.
-
The full project name. For example,
"ml-team/classification"
.To find the project string in the Neptune app, in the Project metadata section, click Add new field.
-
-
Save a few dataset versions as Neptune artifacts to the project.
train = pd.read_csv("train.csv") for i in range(5): train_sample = train.sample(frac=0.5 + 0.1 * i) train_sample.to_csv("train_sampled.csv", index=None) project[f"datasets/train_sampled/v{i}"].track_files( "train_sampled.csv", wait=True # (1)! ) print(project.get_structure())
- Use
wait=True
to ensure that all logging operations are finished. See also: API reference ≫wait()
import pandas as pd from sklearn.ensemble import RandomForestClassifier import neptune # Initialize Neptune project project = neptune.init_project( project="common/data-versioning", api_token=neptune.ANONYMOUS_API_TOKEN, ) # Create a few versions of a dataset and track them as Neptune artifacts train = pd.read_csv("train.csv") for i in range(5): train_sample = train.sample(frac=0.5 + 0.1 * i) train_sample.to_csv("train_sampled.csv", index=None) project[f"datasets/train_sampled/v{i}"].track_files( "train_sampled.csv", wait=True ) print(project.get_structure())
- Use
-
Save the latest dataset version as a new artifact called
"latest"
:def get_latest_version(): # Get the latest version of the dataset and save it as 'latest' artifact_name = project.get_structure()["datasets"]["train_sampled"].keys() versions = [ int(version.replace("v", "")) for version in artifact_name if version != "latest" ] latest_version = max(versions) return latest_version latest_version = get_latest_version() print("latest version", latest_version) project["datasets/train_sampled/latest"].assign( project[f"datasets/train_sampled/v{latest_version}"].fetch(), wait=True )
import pandas as pd from sklearn.ensemble import RandomForestClassifier import neptune # Initialize Neptune project project = neptune.init_project( project="common/data-versioning", api_token=neptune.ANONYMOUS_API_TOKEN, ) # Create a few versions of a dataset and track them as Neptune artifacts train = pd.read_csv("train.csv") for i in range(5): train_sample = train.sample(frac=0.5 + 0.1 * i) train_sample.to_csv("train_sampled.csv", index=None) project[f"datasets/train_sampled/v{i}"].track_files( "train_sampled.csv", wait=True ) print(project.get_structure()) def get_latest_version(): # Get the latest version of the dataset and save it as 'latest' artifact_name = project.get_structure()["datasets"]["train_sampled"].keys() versions = [ int(version.replace("v", "")) for version in artifact_name if version != "latest" ] latest_version = max(versions) return latest_version latest_version = get_latest_version() print("latest version", latest_version) project["datasets/train_sampled/latest"].assign( project[f"datasets/train_sampled/v{latest_version}"].fetch(), wait=True )
Access dataset versions via API#
You can now list the available dataset versions with the get_structure()
method:
{'train_sampled': {'latest': <neptune.attributes.atoms.artifact.Artifact object at 0x000001BF367D0DC0>, 'v0': <neptune.attributes.atoms.artifact.Artifact object at 0x000001BF367A7700>, 'v1': <neptune.attributes.atoms.artifact.Artifact object at 0x000001BF36806DA0>, 'v2': <neptune.attributes.atoms.artifact.Artifact object at 0x000001BF36806E00>, 'v3': <neptune.attributes.atoms.artifact.Artifact object at 0x000001BF36806E60>, 'v4': <neptune.attributes.atoms.artifact.Artifact object at 0x000001BF36806EC0>}}
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import neptune
# Initialize Neptune project
project = neptune.init_project(
project="common/data-versioning",
api_token=neptune.ANONYMOUS_API_TOKEN,
)
# Create a few versions of a dataset and save them to Neptune
train = pd.read_csv("train.csv")
for i in range(5):
train_sample = train.sample(frac=0.5 + 0.1 * i)
train_sample.to_csv("train_sampled.csv", index=None)
project[f"datasets/train_sampled/v{i}"].track_files(
"train_sampled.csv", wait=True
)
print(project.get_structure())
def get_latest_version():
# Get the latest version of the dataset and save it as 'latest'
artifact_name = project.get_structure()["datasets"]["train_sampled"].keys()
versions = [
int(version.replace("v", ""))
for version in artifact_name
if version != "latest"
]
latest_version = max(versions)
return latest_version
latest_version = get_latest_version()
print("latest version", latest_version)
project["datasets/train_sampled/latest"].assign(
project[f"datasets/train_sampled/v{latest_version}"].fetch(), wait=True
)
print(project.get_structure()["datasets"])
View dataset versions in app#
To view the dataset versions in the Neptune app:
- Select the Project metadata tab.
- Click the datasets namespace, then the train_sampled namespace.
- Select each artifact in the list to preview the metadata on the right.
Going further: Assert that you're training on the latest dataset#
In this last part, we'll show an example of how you can interact with the tracked artifacts.
We'll fetch the dataset version marked as "latest" and assert that we're using that same version to train our model.
-
Create a Neptune run:
-
If you haven't set up your credentials, you can log anonymously:
-
-
Log the current dataset as an artifact:
-
Assert that the current dataset is the latest version:
-
Train the model and log the metadata to Neptune:
TEST_DATASET_PATH = "test.csv" # Log parameters PARAMS = { "n_estimators": 8, "max_depth": 3, "max_features": 2, } run["parameters"] = PARAMS # Train the model train = pd.read_csv(TRAIN_DATASET_PATH) test = pd.read_csv(TEST_DATASET_PATH) FEATURE_COLUMNS = [ "sepal.length", "sepal.width", "petal.length", "petal.width", ] TARGET_COLUMN = ["variety"] X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN] X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN] rf = RandomForestClassifier(**PARAMS) rf.fit(X_train, y_train) # Save the score score = rf.score(X_test, y_test) run["metrics/test_score"] = score
import pandas as pd from sklearn.ensemble import RandomForestClassifier import neptune # Initialize Neptune project project = neptune.init_project( project="common/data-versioning", api_token=neptune.ANONYMOUS_API_TOKEN, ) # Create a few versions of a dataset and track them as Neptune artifacts train = pd.read_csv("train.csv") for i in range(5): train_sample = train.sample(frac=0.5 + 0.1 * i) train_sample.to_csv("train_sampled.csv", index=None) project[f"datasets/train_sampled/v{i}"].track_files( "train_sampled.csv", wait=True ) print(project.get_structure()) def get_latest_version(): # Get the latest version of the dataset and save it as 'latest' artifact_name = project.get_structure()["datasets"]["train_sampled"].keys() versions = [ int(version.replace("v", "")) for version in artifact_name if version != "latest" ] latest_version = max(versions) return latest_version latest_version = get_latest_version() print("latest version", latest_version) project["datasets/train_sampled/latest"].assign( project[f"datasets/train_sampled/v{latest_version}"].fetch(), wait=True ) print(project.get_structure()["datasets"]) # Create a Neptune run run = neptune.init_run( project="common/data-versioning", api_token=neptune.ANONYMOUS_API_TOKEN, ) # Assert that you're training on the latest dataset version TRAIN_DATASET_PATH = "train_sampled.csv" run["datasets/train"].track_files(TRAIN_DATASET_PATH, wait=True) assert ( run["datasets/train"].fetch_hash() == project["datasets/train_sampled/latest"].fetch_hash() ) TEST_DATASET_PATH = "test.csv" # Log parameters PARAMS = { "n_estimators": 8, "max_depth": 3, "max_features": 2, } run["parameters"] = PARAMS # Train the model train = pd.read_csv(TRAIN_DATASET_PATH) test = pd.read_csv(TEST_DATASET_PATH) FEATURE_COLUMNS = [ "sepal.length", "sepal.width", "petal.length", "petal.width", ] TARGET_COLUMN = ["variety"] X_train, y_train = train[FEATURE_COLUMNS], train[TARGET_COLUMN] X_test, y_test = test[FEATURE_COLUMNS], test[TARGET_COLUMN] rf = RandomForestClassifier(**PARAMS) rf.fit(X_train, y_train) # Save the score score = rf.score(X_test, y_test) run["metrics/test_score"] = score # # Go to the Neptune app to see datasets logged at the project level! #
-
Stop the active Neptune objects:
To view the run in Neptune, click the link in the console output.
Sample output
[neptune] [info ] Neptune initialized. Open in the app:
https://app.neptune.ai/workspace/project/e/RUN-1
Related documentation