Compare Kedro pipelines
You can log, monitor, and compare metrics, parameters, dataset versions, and other metadata from Kedro pipelines in Neptune.
This guide shows how to:
    Log data versions, parameters, and metrics for every Kedro pipeline execution
    See the diff between Kedro pipelines executions in Neptune UI
    Group Kedro pipeline executions by dataset versions and compare them
By the end of this guide, you will log metadata from a few Kedro pipeline executions, group them by dataset versions, and compare results in the Neptune UI.
Kedro pipelines grouped by the dataset version in the Neptune UI.
Keywords: Kedro Neptune, Compare Kedro pipelines

Before you start

Make sure you meet the following prerequisites before starting:

Step 1: Add logging of data versions and parameters

    Define model training parameters in conf/base/parameters.yml. Once defined in 'parameters.yml' parameters will be logged to Neptune automatically.
snippet
parameters.yml
1
# Random forest parameters
2
rf_max_depth: 3
3
rf_max_features: 3
4
rf_n_estimators: 25
Copied!
1
# Parameters for the example pipeline. Feel free to delete these once you
2
# remove the example pipeline from hooks.py and the example nodes in
3
# `src/pipelines/`
4
5
# Data split parameters
6
example_test_data_ratio: 0.2
7
8
# Random forest parameters
9
rf_max_depth: 3
10
rf_max_features: 3
11
rf_n_estimators: 25
12
13
# MLP parameters
14
mlp_alpha: 0.02
15
mlp_max_iter: 50
Copied!
    Define training datasets in conf/base/catalog.yml or other catalogs. Once defined dataset metadata will be logged to Neptune automatically.
snippet
catalog.yml
1
example_iris_data:
2
type: pandas.CSVDataSet
3
filepath: data/01_raw/iris.csv
Copied!
1
# Here you can define all your data sets by using simple YAML syntax.
2
#
3
# Documentation for this file format can be found in "The Data Catalog"
4
# Link: https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html
5
#
6
# We support interacting with a variety of data stores including local file systems, cloud, network and HDFS
7
#
8
# An example data set definition can look as follows:
9
#
10
#bikes:
11
# type: pandas.CSVDataSet
12
# filepath: "data/01_raw/bikes.csv"
13
#
14
#weather:
15
# type: spark.SparkDataSet
16
# filepath: s3a://your_bucket/data/01_raw/weather*
17
# file_format: csv
18
# credentials: dev_s3
19
# load_args:
20
# header: True
21
# inferSchema: True
22
# save_args:
23
# sep: '|'
24
# header: True
25
#
26
#scooters:
27
# type: pandas.SQLTableDataSet
28
# credentials: scooters_credentials
29
# table_name: scooters
30
# load_args:
31
# index_col: ['name']
32
# columns: ['name', 'gear']
33
# save_args:
34
# if_exists: 'replace'
35
# # if_exists: 'fail'
36
# # if_exists: 'append'
37
#
38
# The Data Catalog supports being able to reference the same file using two different DataSet implementations
39
# (transcoding), templating and a way to reuse arguments that are frequently repeated. See more here:
40
# https://kedro.readthedocs.io/en/stable/05_data/01_data_catalog.html
41
#
42
# This is a data set used by the "Hello World" example pipeline provided with the project
43
# template. Please feel free to remove it once you remove the example pipeline.
44
45
example_iris_data:
46
type: pandas.CSVDataSet
47
filepath: data/01_raw/iris_v2.csv
48
49
rf_model:
50
type: kedro.extras.datasets.pickle.PickleDataSet
51
filepath: data/06_models/rf_model.pkl
52
53
mlp_model:
54
type: kedro.extras.datasets.pickle.PickleDataSet
55
filepath: data/06_models/mlp_model.pkl
56
57
predictions:
58
type: kedro.extras.datasets.json.JSONDataSet
59
filepath: data/07_model_output/predictions.json
60
62
type: kedro_neptune.NeptuneFileDataSet
63
filepath: data/07_model_output/predictions.json
64
Copied!

Step 2: Add model training and prediction nodes

    Create a model training node in the src/KEDRO_PROJECT/pipelines/data_science/nodes.py.
    Use parameters you defined in conf/base/parameters.yml.
    This node should output a trained model.
snippet
nodes.py
1
def train_rf_model(train_x: pd.DataFrame,
2
train_y: pd.DataFrame,
3
parameters: Dict[str, Any]):
4
5
max_depth = parameters["rf_max_depth"]
6
n_estimators = parameters["rf_n_estimators"]
7
max_features = parameters["rf_max_features"]
8
9
clf = RandomForestClassifier(max_depth=max_depth,
10
n_estimators=n_estimators,
11
max_features=max_features)
12
clf.fit(train_x, train_y.idxmax(axis=1))
13
14
return clf
Copied!
1
# Copyright 2021 QuantumBlack Visual Analytics Limited
2
#
3
# Licensed under the Apache License, Version 2.0 (the "License");
4
# you may not use this file except in compliance with the License.
5
# You may obtain a copy of the License at
6
#
7
# http://www.apache.org/licenses/LICENSE-2.0
8
#
9
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
10
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
11
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
12
# NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
13
# BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
14
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
15
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
#
17
# The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
18
# (either separately or in combination, "QuantumBlack Trademarks") are
19
# trademarks of QuantumBlack. The License does not grant you any right or
20
# license to the QuantumBlack Trademarks. You may not use the QuantumBlack
21
# Trademarks or any confusingly similar mark as a trademark for your product,
22
# or use the QuantumBlack Trademarks in any other manner that might cause
23
# confusion in the marketplace, including but not limited to in advertising,
24
# on websites, or on software.
25
#
26
# See the License for the specific language governing permissions and
27
# limitations under the License.
28
29
"""Example code for the nodes in the example pipeline. This code is meant
30
just for illustrating basic Kedro features.
31
32
Delete this when you start working on your own Kedro project.
33
"""
34
# pylint: disable=invalid-name
35
36
import logging
37
import matplotlib.pyplot as plt
38
import neptune.new as neptune
39
import numpy as np
40
import pandas as pd
41
from scikitplot.metrics import plot_roc_curve, plot_precision_recall_curve
42
from sklearn.ensemble import RandomForestClassifier
43
from sklearn.metrics import accuracy_score
44
from sklearn.neural_network import MLPClassifier
45
from typing import Any, Dict
46
47
48
def train_rf_model(
49
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
50
):
51
"""Node for training Random Forest model"""
52
max_depth = parameters["rf_max_depth"]
53
n_estimators = parameters["rf_n_estimators"]
54
max_features = parameters["rf_max_features"]
55
56
clf = RandomForestClassifier(max_depth=max_depth,
57
n_estimators=n_estimators,
58
max_features=max_features)
59
clf.fit(train_x, train_y.idxmax(axis=1))
60
61
return clf
62
63
64
def train_mlp_model(
65
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
66
):
67
"""Node for training MLP model"""
68
alpha = parameters["mlp_alpha"]
69
max_iter = parameters["mlp_max_iter"]
70
71
clf = MLPClassifier(alpha=alpha,
72
max_iter=max_iter)
73
clf.fit(train_x, train_y)
74
75
return clf
76
77
78
def get_predictions(rf_model: RandomForestClassifier, mlp_model: MLPClassifier,
79
test_x: pd.DataFrame) -> Dict[str, Any]:
80
"""Node for making predictions given a pre-trained model and a test set."""
81
predictions = {}
82
for name, model in zip(['rf', 'mlp'], [rf_model, mlp_model]):
83
y_pred = model.predict_proba(test_x).tolist()
84
predictions[name] = y_pred
85
86
return predictions
87
88
89
def evaluate_models(predictions: dict, test_y: pd.DataFrame,
90
neptune_run: neptune.run.Handler):
91
"""Node for evaluating Random Forest and MLP models and creating ROC and Precision-Recall Curves"""
92
93
for name, y_pred in predictions.items():
94
y_true = test_y.to_numpy().argmax(axis=1)
95
y_pred = np.array(y_pred)
96
97
accuracy = accuracy_score(y_true, y_pred.argmax(axis=1).ravel())
98
neptune_run[f'nodes/evaluate_models/metrics/accuracy_{name}'] = accuracy
99
100
fig, ax = plt.subplots()
101
plot_roc_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'ROC curve {name}')
102
neptune_run['nodes/evaluate_models/plots/plot_roc_curve'].log(fig)
103
104
fig, ax = plt.subplots()
105
plot_precision_recall_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'PR curve {name}')
106
neptune_run['nodes/evaluate_models/plots/plot_precision_recall_curve'].log(fig)
107
108
109
def ensemble_models(predictions: dict, test_y: pd.DataFrame,
110
neptune_run: neptune.run.Handler) -> np.ndarray:
111
"""Node for averaging predictions of Random Forest and MLP models"""
112
y_true = test_y.to_numpy().argmax(axis=1)
113
y_pred_averaged = np.stack(predictions.values()).mean(axis=0)
114
115
accuracy = accuracy_score(y_true, y_pred_averaged.argmax(axis=1).ravel())
116
neptune_run[f'nodes/ensemble_models/metrics/accuracy_ensemble'] = accuracy
117
Copied!
In this example, you will create a Kedro pipeline that trains and ensembles predictions from two models Random Forest and MLPClassifier.
For simplicity, we showed just the Random Forest code snippets below. See the full nodes.py for the MLPClassifier.
    Create a model prediction node in the src/KEDRO_PROJECT/pipelines/data_science/nodes.py. This node should output a dictionary with predictions for two models Random Forest and MLPClassifier.
snippet
nodes.py
1
def get_predictions(rf_model: RandomForestClassifier,
2
mlp_model: MLPClassifier,
3
test_x: pd.DataFrame):
4
"""Node for making predictions given a pre-trained model and a test set."""
5
predictions = {}
6
for name, model in zip(['rf', 'mlp'], [rf_model, mlp_model]):
7
y_pred = model.predict_proba(test_x).tolist()
8
predictions[name] = y_pred
9
10
return predictions
Copied!
1
# Copyright 2021 QuantumBlack Visual Analytics Limited
2
#
3
# Licensed under the Apache License, Version 2.0 (the "License");
4
# you may not use this file except in compliance with the License.
5
# You may obtain a copy of the License at
6
#
7
# http://www.apache.org/licenses/LICENSE-2.0
8
#
9
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
10
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
11
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
12
# NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
13
# BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
14
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
15
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
#
17
# The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
18
# (either separately or in combination, "QuantumBlack Trademarks") are
19
# trademarks of QuantumBlack. The License does not grant you any right or
20
# license to the QuantumBlack Trademarks. You may not use the QuantumBlack
21
# Trademarks or any confusingly similar mark as a trademark for your product,
22
# or use the QuantumBlack Trademarks in any other manner that might cause
23
# confusion in the marketplace, including but not limited to in advertising,
24
# on websites, or on software.
25
#
26
# See the License for the specific language governing permissions and
27
# limitations under the License.
28
29
"""Example code for the nodes in the example pipeline. This code is meant
30
just for illustrating basic Kedro features.
31
32
Delete this when you start working on your own Kedro project.
33
"""
34
# pylint: disable=invalid-name
35
36
import logging
37
import matplotlib.pyplot as plt
38
import neptune.new as neptune
39
import numpy as np
40
import pandas as pd
41
from scikitplot.metrics import plot_roc_curve, plot_precision_recall_curve
42
from sklearn.ensemble import RandomForestClassifier
43
from sklearn.metrics import accuracy_score
44
from sklearn.neural_network import MLPClassifier
45
from typing import Any, Dict
46
47
48
def train_rf_model(
49
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
50
):
51
"""Node for training Random Forest model"""
52
max_depth = parameters["rf_max_depth"]
53
n_estimators = parameters["rf_n_estimators"]
54
max_features = parameters["rf_max_features"]
55
56
clf = RandomForestClassifier(max_depth=max_depth,
57
n_estimators=n_estimators,
58
max_features=max_features)
59
clf.fit(train_x, train_y.idxmax(axis=1))
60
61
return clf
62
63
64
def train_mlp_model(
65
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
66
):
67
"""Node for training MLP model"""
68
alpha = parameters["mlp_alpha"]
69
max_iter = parameters["mlp_max_iter"]
70
71
clf = MLPClassifier(alpha=alpha,
72
max_iter=max_iter)
73
clf.fit(train_x, train_y)
74
75
return clf
76
77
78
def get_predictions(rf_model: RandomForestClassifier, mlp_model: MLPClassifier,
79
test_x: pd.DataFrame) -> Dict[str, Any]:
80
"""Node for making predictions given a pre-trained model and a test set."""
81
predictions = {}
82
for name, model in zip(['rf', 'mlp'], [rf_model, mlp_model]):
83
y_pred = model.predict_proba(test_x).tolist()
84
predictions[name] = y_pred
85
86
return predictions
87
88
89
def evaluate_models(predictions: dict, test_y: pd.DataFrame,
90
neptune_run: neptune.run.Handler):
91
"""Node for evaluating Random Forest and MLP models and creating ROC and Precision-Recall Curves"""
92
93
for name, y_pred in predictions.items():
94
y_true = test_y.to_numpy().argmax(axis=1)
95
y_pred = np.array(y_pred)
96
97
accuracy = accuracy_score(y_true, y_pred.argmax(axis=1).ravel())
98
neptune_run[f'nodes/evaluate_models/metrics/accuracy_{name}'] = accuracy
99
100
fig, ax = plt.subplots()
101
plot_roc_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'ROC curve {name}')
102
neptune_run['nodes/evaluate_models/plots/plot_roc_curve'].log(fig)
103
104
fig, ax = plt.subplots()
105
plot_precision_recall_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'PR curve {name}')
106
neptune_run['nodes/evaluate_models/plots/plot_precision_recall_curve'].log(fig)
107
108
109
def ensemble_models(predictions: dict, test_y: pd.DataFrame,
110
neptune_run: neptune.run.Handler) -> np.ndarray:
111
"""Node for averaging predictions of Random Forest and MLP models"""
112
y_true = test_y.to_numpy().argmax(axis=1)
113
y_pred_averaged = np.stack(predictions.values()).mean(axis=0)
114
115
accuracy = accuracy_score(y_true, y_pred_averaged.argmax(axis=1).ravel())
116
neptune_run[f'nodes/ensemble_models/metrics/accuracy_ensemble'] = accuracy
117
Copied!

Step 3: Add evaluation node and log accuracy score to Neptune

    Import Neptune client toward the top of the nodes.py
snippet
nodes.py
1
import neptune.new as neptune
Copied!
1
# Copyright 2021 QuantumBlack Visual Analytics Limited
2
#
3
# Licensed under the Apache License, Version 2.0 (the "License");
4
# you may not use this file except in compliance with the License.
5
# You may obtain a copy of the License at
6
#
7
# http://www.apache.org/licenses/LICENSE-2.0
8
#
9
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
10
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
11
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
12
# NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
13
# BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
14
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
15
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
#
17
# The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
18
# (either separately or in combination, "QuantumBlack Trademarks") are
19
# trademarks of QuantumBlack. The License does not grant you any right or
20
# license to the QuantumBlack Trademarks. You may not use the QuantumBlack
21
# Trademarks or any confusingly similar mark as a trademark for your product,
22
# or use the QuantumBlack Trademarks in any other manner that might cause
23
# confusion in the marketplace, including but not limited to in advertising,
24
# on websites, or on software.
25
#
26
# See the License for the specific language governing permissions and
27
# limitations under the License.
28
29
"""Example code for the nodes in the example pipeline. This code is meant
30
just for illustrating basic Kedro features.
31
32
Delete this when you start working on your own Kedro project.
33
"""
34
# pylint: disable=invalid-name
35
36
import logging
37
import matplotlib.pyplot as plt
38
import neptune.new as neptune
39
import numpy as np
40
import pandas as pd
41
from scikitplot.metrics import plot_roc_curve, plot_precision_recall_curve
42
from sklearn.ensemble import RandomForestClassifier
43
from sklearn.metrics import accuracy_score
44
from sklearn.neural_network import MLPClassifier
45
from typing import Any, Dict
46
47
48
def train_rf_model(
49
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
50
):
51
"""Node for training Random Forest model"""
52
max_depth = parameters["rf_max_depth"]
53
n_estimators = parameters["rf_n_estimators"]
54
max_features = parameters["rf_max_features"]
55
56
clf = RandomForestClassifier(max_depth=max_depth,
57
n_estimators=n_estimators,
58
max_features=max_features)
59
clf.fit(train_x, train_y.idxmax(axis=1))
60
61
return clf
62
63
64
def train_mlp_model(
65
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
66
):
67
"""Node for training MLP model"""
68
alpha = parameters["mlp_alpha"]
69
max_iter = parameters["mlp_max_iter"]
70
71
clf = MLPClassifier(alpha=alpha,
72
max_iter=max_iter)
73
clf.fit(train_x, train_y)
74
75
return clf
76
77
78
def get_predictions(rf_model: RandomForestClassifier, mlp_model: MLPClassifier,
79
test_x: pd.DataFrame) -> Dict[str, Any]:
80
"""Node for making predictions given a pre-trained model and a test set."""
81
predictions = {}
82
for name, model in zip(['rf', 'mlp'], [rf_model, mlp_model]):
83
y_pred = model.predict_proba(test_x).tolist()
84
predictions[name] = y_pred
85
86
return predictions
87
88
89
def evaluate_models(predictions: dict, test_y: pd.DataFrame,
90
neptune_run: neptune.run.Handler):
91
"""Node for evaluating Random Forest and MLP models and creating ROC and Precision-Recall Curves"""
92
93
for name, y_pred in predictions.items():
94
y_true = test_y.to_numpy().argmax(axis=1)
95
y_pred = np.array(y_pred)
96
97
accuracy = accuracy_score(y_true, y_pred.argmax(axis=1).ravel())
98
neptune_run[f'nodes/evaluate_models/metrics/accuracy_{name}'] = accuracy
99
100
fig, ax = plt.subplots()
101
plot_roc_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'ROC curve {name}')
102
neptune_run['nodes/evaluate_models/plots/plot_roc_curve'].log(fig)
103
104
fig, ax = plt.subplots()
105
plot_precision_recall_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'PR curve {name}')
106
neptune_run['nodes/evaluate_models/plots/plot_precision_recall_curve'].log(fig)
107
108
109
def ensemble_models(predictions: dict, test_y: pd.DataFrame,
110
neptune_run: neptune.run.Handler) -> np.ndarray:
111
"""Node for averaging predictions of Random Forest and MLP models"""
112
y_true = test_y.to_numpy().argmax(axis=1)
113
y_pred_averaged = np.stack(predictions.values()).mean(axis=0)
114
115
accuracy = accuracy_score(y_true, y_pred_averaged.argmax(axis=1).ravel())
116
neptune_run[f'nodes/ensemble_models/metrics/accuracy_ensemble'] = accuracy
117
Copied!
    Add neptune_run argument of type neptune.run.Handler to the report_accuracy function
snippet
nodes.py
1
def evaluate_models(predictions: dict, test_y: pd.DataFrame,
2
neptune_run: neptune.run.Handler):
3
...
Copied!
1
# Copyright 2021 QuantumBlack Visual Analytics Limited
2
#
3
# Licensed under the Apache License, Version 2.0 (the "License");
4
# you may not use this file except in compliance with the License.
5
# You may obtain a copy of the License at
6
#
7
# http://www.apache.org/licenses/LICENSE-2.0
8
#
9
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
10
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
11
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
12
# NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
13
# BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
14
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
15
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
#
17
# The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
18
# (either separately or in combination, "QuantumBlack Trademarks") are
19
# trademarks of QuantumBlack. The License does not grant you any right or
20
# license to the QuantumBlack Trademarks. You may not use the QuantumBlack
21
# Trademarks or any confusingly similar mark as a trademark for your product,
22
# or use the QuantumBlack Trademarks in any other manner that might cause
23
# confusion in the marketplace, including but not limited to in advertising,
24
# on websites, or on software.
25
#
26
# See the License for the specific language governing permissions and
27
# limitations under the License.
28
29
"""Example code for the nodes in the example pipeline. This code is meant
30
just for illustrating basic Kedro features.
31
32
Delete this when you start working on your own Kedro project.
33
"""
34
# pylint: disable=invalid-name
35
36
import logging
37
import matplotlib.pyplot as plt
38
import neptune.new as neptune
39
import numpy as np
40
import pandas as pd
41
from scikitplot.metrics import plot_roc_curve, plot_precision_recall_curve
42
from sklearn.ensemble import RandomForestClassifier
43
from sklearn.metrics import accuracy_score
44
from sklearn.neural_network import MLPClassifier
45
from typing import Any, Dict
46
47
48
def train_rf_model(
49
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
50
):
51
"""Node for training Random Forest model"""
52
max_depth = parameters["rf_max_depth"]
53
n_estimators = parameters["rf_n_estimators"]
54
max_features = parameters["rf_max_features"]
55
56
clf = RandomForestClassifier(max_depth=max_depth,
57
n_estimators=n_estimators,
58
max_features=max_features)
59
clf.fit(train_x, train_y.idxmax(axis=1))
60
61
return clf
62
63
64
def train_mlp_model(
65
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
66
):
67
"""Node for training MLP model"""
68
alpha = parameters["mlp_alpha"]
69
max_iter = parameters["mlp_max_iter"]
70
71
clf = MLPClassifier(alpha=alpha,
72
max_iter=max_iter)
73
clf.fit(train_x, train_y)
74
75
return clf
76
77
78
def get_predictions(rf_model: RandomForestClassifier, mlp_model: MLPClassifier,
79
test_x: pd.DataFrame) -> Dict[str, Any]:
80
"""Node for making predictions given a pre-trained model and a test set."""
81
predictions = {}
82
for name, model in zip(['rf', 'mlp'], [rf_model, mlp_model]):
83
y_pred = model.predict_proba(test_x).tolist()
84
predictions[name] = y_pred
85
86
return predictions
87
88
89
def evaluate_models(predictions: dict, test_y: pd.DataFrame,
90
neptune_run: neptune.run.Handler):
91
"""Node for evaluating Random Forest and MLP models and creating ROC and Precision-Recall Curves"""
92
93
for name, y_pred in predictions.items():
94
y_true = test_y.to_numpy().argmax(axis=1)
95
y_pred = np.array(y_pred)
96
97
accuracy = accuracy_score(y_true, y_pred.argmax(axis=1).ravel())
98
neptune_run[f'nodes/evaluate_models/metrics/accuracy_{name}'] = accuracy
99
100
fig, ax = plt.subplots()
101
plot_roc_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'ROC curve {name}')
102
neptune_run['nodes/evaluate_models/plots/plot_roc_curve'].log(fig)
103
104
fig, ax = plt.subplots()
105
plot_precision_recall_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'PR curve {name}')
106
neptune_run['nodes/evaluate_models/plots/plot_precision_recall_curve'].log(fig)
107
108
109
def ensemble_models(predictions: dict, test_y: pd.DataFrame,
110
neptune_run: neptune.run.Handler) -> np.ndarray:
111
"""Node for averaging predictions of Random Forest and MLP models"""
112
y_true = test_y.to_numpy().argmax(axis=1)
113
y_pred_averaged = np.stack(predictions.values()).mean(axis=0)
114
115
accuracy = accuracy_score(y_true, y_pred_averaged.argmax(axis=1).ravel())
116
neptune_run[f'nodes/ensemble_models/metrics/accuracy_ensemble'] = accuracy
117
Copied!
You can treat neptune_run like a normal Neptune Run and log any ML metadata to it.
You have to use a special string "neptune_run" to use the Neptune Run handler in Kedro pipelines.
    Calculate and log accuracy it to 'nodes/evaluate_models/metrics/accuracy_{model_name}' namespace.
snippet
nodes.py
1
def evaluate_models(predictions: dict, test_y: pd.DataFrame,
2
neptune_run: neptune.run.Handler):
3
...
4
5
for name, y_pred in predictions.items():
6
y_true = test_y.to_numpy().argmax(axis=1)
7
y_pred = np.array(y_pred)
8
9
accuracy = accuracy_score(y_true, y_pred.argmax(axis=1).ravel())
10
neptune_run[f'nodes/evaluate_models/metrics/accuracy_{name}'] = accuracy
Copied!
1
# Copyright 2021 QuantumBlack Visual Analytics Limited
2
#
3
# Licensed under the Apache License, Version 2.0 (the "License");
4
# you may not use this file except in compliance with the License.
5
# You may obtain a copy of the License at
6
#
7
# http://www.apache.org/licenses/LICENSE-2.0
8
#
9
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
10
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
11
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
12
# NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
13
# BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
14
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
15
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
#
17
# The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
18
# (either separately or in combination, "QuantumBlack Trademarks") are
19
# trademarks of QuantumBlack. The License does not grant you any right or
20
# license to the QuantumBlack Trademarks. You may not use the QuantumBlack
21
# Trademarks or any confusingly similar mark as a trademark for your product,
22
# or use the QuantumBlack Trademarks in any other manner that might cause
23
# confusion in the marketplace, including but not limited to in advertising,
24
# on websites, or on software.
25
#
26
# See the License for the specific language governing permissions and
27
# limitations under the License.
28
29
"""Example code for the nodes in the example pipeline. This code is meant
30
just for illustrating basic Kedro features.
31
32
Delete this when you start working on your own Kedro project.
33
"""
34
# pylint: disable=invalid-name
35
36
import logging
37
import matplotlib.pyplot as plt
38
import neptune.new as neptune
39
import numpy as np
40
import pandas as pd
41
from scikitplot.metrics import plot_roc_curve, plot_precision_recall_curve
42
from sklearn.ensemble import RandomForestClassifier
43
from sklearn.metrics import accuracy_score
44
from sklearn.neural_network import MLPClassifier
45
from typing import Any, Dict
46
47
48
def train_rf_model(
49
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
50
):
51
"""Node for training Random Forest model"""
52
max_depth = parameters["rf_max_depth"]
53
n_estimators = parameters["rf_n_estimators"]
54
max_features = parameters["rf_max_features"]
55
56
clf = RandomForestClassifier(max_depth=max_depth,
57
n_estimators=n_estimators,
58
max_features=max_features)
59
clf.fit(train_x, train_y.idxmax(axis=1))
60
61
return clf
62
63
64
def train_mlp_model(
65
train_x: pd.DataFrame, train_y: pd.DataFrame, parameters: Dict[str, Any]
66
):
67
"""Node for training MLP model"""
68
alpha = parameters["mlp_alpha"]
69
max_iter = parameters["mlp_max_iter"]
70
71
clf = MLPClassifier(alpha=alpha,
72
max_iter=max_iter)
73
clf.fit(train_x, train_y)
74
75
return clf
76
77
78
def get_predictions(rf_model: RandomForestClassifier, mlp_model: MLPClassifier,
79
test_x: pd.DataFrame) -> Dict[str, Any]:
80
"""Node for making predictions given a pre-trained model and a test set."""
81
predictions = {}
82
for name, model in zip(['rf', 'mlp'], [rf_model, mlp_model]):
83
y_pred = model.predict_proba(test_x).tolist()
84
predictions[name] = y_pred
85
86
return predictions
87
88
89
def evaluate_models(predictions: dict, test_y: pd.DataFrame,
90
neptune_run: neptune.run.Handler):
91
"""Node for evaluating Random Forest and MLP models and creating ROC and Precision-Recall Curves"""
92
93
for name, y_pred in predictions.items():
94
y_true = test_y.to_numpy().argmax(axis=1)
95
y_pred = np.array(y_pred)
96
97
accuracy = accuracy_score(y_true, y_pred.argmax(axis=1).ravel())
98
neptune_run[f'nodes/evaluate_models/metrics/accuracy_{name}'] = accuracy
99
100
fig, ax = plt.subplots()
101
plot_roc_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'ROC curve {name}')
102
neptune_run['nodes/evaluate_models/plots/plot_roc_curve'].log(fig)
103
104
fig, ax = plt.subplots()
105
plot_precision_recall_curve(test_y.idxmax(axis=1), y_pred, ax=ax, title=f'PR curve {name}')
106
neptune_run['nodes/evaluate_models/plots/plot_precision_recall_curve'].log(fig)
107
108
109
def ensemble_models(predictions: dict, test_y: pd.DataFrame,
110
neptune_run: neptune.run.Handler) -> np.ndarray:
111
"""Node for averaging predictions of Random Forest and MLP models"""
112
y_true = test_y.to_numpy().argmax(axis=1)
113
y_pred_averaged = np.stack(predictions.values()).mean(axis=0)
114
115
accuracy = accuracy_score(y_true, y_pred_averaged.argmax(axis=1).ravel())
116
neptune_run[f'nodes/ensemble_models/metrics/accuracy_ensemble'] = accuracy
117
Copied!

Step 4: Add Neptune Run handler to the Kedro pipeline

    Go to a pipeline definition, src/KEDRO_PROJECT/pipelines/data_science/pipelines.py
    Add neptune_run Run handler as an input to the evaluate_models node
snippet
pipelines.py
1
node(
2
evaluate_models,
3
dict(predictions="predictions",
4
test_y="example_test_y",
5
neptune_run="neptune_run"),
6
None,
7
name="evaluate_models",
8
),
Copied!
pipelines/data_science/pipelines.py
1
# Copyright 2021 QuantumBlack Visual Analytics Limited
2
#
3
# Licensed under the Apache License, Version 2.0 (the "License");
4
# you may not use this file except in compliance with the License.
5
# You may obtain a copy of the License at
6
#
7
# http://www.apache.org/licenses/LICENSE-2.0
8
#
9
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
10
# EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
11
# OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
12
# NONINFRINGEMENT. IN NO EVENT WILL THE LICENSOR OR OTHER CONTRIBUTORS
13
# BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
14
# ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF, OR IN
15
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
16
#
17
# The QuantumBlack Visual Analytics Limited ("QuantumBlack") name and logo
18
# (either separately or in combination, "QuantumBlack Trademarks") are
19
# trademarks of QuantumBlack. The License does not grant you any right or
20
# license to the QuantumBlack Trademarks. You may not use the QuantumBlack
21
# Trademarks or any confusingly similar mark as a trademark for your product,
22
# or use the QuantumBlack Trademarks in any other manner that might cause
23
# confusion in the marketplace, including but not limited to in advertising,
24
# on websites, or on software.
25
#
26
# See the License for the specific language governing permissions and
27
# limitations under the License.
28
29
"""Example code for the nodes in the example pipeline. This code is meant
30
just for illustrating basic Kedro features.
31
32
Delete this when you start working on your own Kedro project.
33
"""
34
35
from kedro.pipeline import Pipeline, node
36
37
from .nodes import predict, report_accuracy, train_model
38
39
40
def create_pipeline(**kwargs):
41
return Pipeline(
42
[
43
node(
44
train_model,
45
["example_train_x", "example_train_y", "parameters"],
46
"example_model",
47
name="train",
48
),
49
node(
50
predict,
51
dict(model="example_model", test_x="example_test_x"),
52
"example_predictions",
53
name="predict",
54
),
55
node(
56
report_accuracy,
57
["example_predictions", "example_test_y","neptune_run"],
58
None,
59
name="report",
60
),
61
]
62
)
63
Copied!

Step 5: Run training with different parameters and dataset versions

    Go to 'conf/base/parameters.yml' and change model training hyperparameters
snippet
parameters.yml
1
# Random forest parameters
2
rf_max_depth: 3
3
rf_max_features: 3
4
rf_n_estimators: 25
5
6
# MLP parameters
7
mlp_alpha: 0.02
8
mlp_max_iter: 50
Copied!
1
# remove the example pipeline from hooks.py and the example nodes in
2
# `src/pipelines/`
3
4
# Data split parameters
5
example_test_data_ratio: 0.2
6
7
# Random forest parameters
8