XGBoost

Learn how to log XGBoost metadata to Neptune

What will you get with this integration?

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.

Neptune + XGBoost integration, lets you automatically log many types of metadata during training.

What is logged?

metrics

parameters

learning rate

pickled model

visualizations (feature importance chart and trees visualizations)

hardware consumption (CPU, GPU, Memory)

stdout and stderr logs

training code and git commit information

Example dashboard with train-valid metrics and selected parameters

Where to start?

To get started with this integration, follow the quickstart below. You can also check more options section to see what you can do with this integration.

If you want to try it out now you can either:

Quickstart

This quickstart will show you how to:

  • install required libraries,

  • log metadata during training (metrics, parameters, pickled model, etc.),

  • check results in Neptune app.

At the end of this quickstart, you will be able to add Neptune to your XGBoost scripts and use it in your experimentation.

Install requirements

Before you start, make sure that:

Install neptune-client, xgboost and neptune-xgboost

This integration works with xgboost>=1.3.0. This release introduced new style Python callback API.

Depending on your operating system open a terminal or CMD and run this command. All required libraries are available via pip and conda:

This integration is tested with:

  • xgboost==1.4.2

  • neptune-xgboost==0.9.8

  • neptune-client==0.9.16.

We recommend to install these versions.

pip
conda
pip
pip install neptune-client==0.9.16 xgboost==1.4.2 neptune-xgboost==0.9.8
conda
conda install -c conda-forge \
neptune-client==0.9.16 xgboost==1.4.2 neptune-xgboost==0.9.8

For more help about neptune-client installation check:

Install psutil (optional, you can skip it)

If you want to have hardware monitoring logged (recommended) you should additionally install psutil.

pip
conda
pip
pip install psutil==5.8.0
conda
conda install psutil==5.8.0

Install graphviz (optional, you can skip it)

If you want to log visualized trees after training (recommended), you need to install graphviz.

Below installation is only for the pure Python interface to the graphviz software.

You need to install graphviz separately. Check graphviz docs for installation help.

pip
conda
pip
pip install graphviz==0.16
conda
conda install -c conda-forge python-graphviz==0.16

Log metadata during training

To start logging metadata (metrics, parameters, etc.) during training you need to use NeptuneCallback.

core code
full script
core code
import neptune.new as neptune
from neptune.new.integrations.xgboost import NeptuneCallback
# Create run
run = neptune.init(project="my_workspace/my_project")
# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
# Prepare data, parameters, etc.
...
# Train the model and log metadata to the run in Neptune
xgb.train(
params=model_params,
dtrain=dtrain,
callbacks=[neptune_callback],
)
full script
import neptune.new as neptune
import xgboost as xgb
from neptune.new.integrations.xgboost import NeptuneCallback
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init(
project="common/xgboost-integration",
api_token="ANONYMOUS",
name="xgb-train",
tags=["xgb-integration", "train"],
)
# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
# Prepare data
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_test, label=y_test)
# Define parameters
model_params = {
"eta": 0.7,
"gamma": 0.001,
"max_depth": 9,
"objective": "reg:squarederror",
"eval_metric": ["mae", "rmse"]
}
evals = [(dtrain, "train"), (dval, "valid")]
num_round = 57
# Train the model and log metadata to the run in Neptune
xgb.train(
params=model_params,
dtrain=dtrain,
num_boost_round=num_round,
evals=evals,
callbacks=[
neptune_callback,
xgb.callback.LearningRateScheduler(lambda epoch: 0.99**epoch),
xgb.callback.EarlyStopping(rounds=30)
],
)

Read docstrings of the NeptuneCallback to learn more about parameters.

In the snippet above

  • import NeptuneCallback that you will use to handle metadata logging,

  • Create new run in Neptune,

  • Pass run object to the NeptuneCallback,

  • Pass created neptune_callback to the train function.

At this point your script is ready to use Neptune as a logger.

Now, you can run your script and have metadata logged to Neptune for further inspection, comparison and sharing:

python main.py

In Neptune app it will look similar to this:

Logged metadata include parameters and train/valid metrics.

More about the run above

Run has following metadata:

Name

Description

booster_config

All parameters for the booster.

early_stopping

best_score and best_iteration (logged only if early stopping was activated).

epoch

Epochs (visualized as chart from 1 to last epoch).

learning_rate

Learning rate visualized as chart.

pickled_model

Trained model logged as pickled file.

plots

Feature importance and visualized trees.

train

Training metrics

valid

Validation metrics

Note that all metadata above is logged under common namespace "training". You can change this namespace when creating neptune_callback, like this:

core code
full script
core code
import neptune.new as neptune
from neptune.new.integrations.xgboost import NeptuneCallback
# Create run
run = neptune.init(project="my_workspace/my_project")
# Create neptune callback with custom base_namespace
neptune_callback = NeptuneCallback(
run=run,
base_namespace="my_custom_name",
)
full script
import neptune.new as neptune
import xgboost as xgb
from neptune.new.integrations.xgboost import NeptuneCallback
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init(
project="common/xgboost-integration",
api_token="ANONYMOUS",
name="xgb-train",
tags=["xgb-integration", "train"],
)
# Create neptune callback
neptune_callback = NeptuneCallback(
run=run,
base_namespace="my_custom_name",
log_tree=[0, 1, 2, 3],
)
# Prepare data
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_test, label=y_test)
# Define parameters
model_params = {
"eta": 0.7,
"gamma": 0.001,
"max_depth": 9,
"objective": "reg:squarederror",
"eval_metric": ["mae", "rmse"]
}
evals = [(dtrain, "train"), (dval, "valid")]
num_round = 57
# Train the model and log metadata to the run in Neptune
xgb.train(
params=model_params,
dtrain=dtrain,
num_boost_round=num_round,
evals=evals,
callbacks=[
neptune_callback,
xgb.callback.LearningRateScheduler(lambda epoch: 0.99**epoch),
xgb.callback.EarlyStopping(rounds=30)
],
)

What next?

You can run the example presented above by yourself or see it in Neptune:

More options

CV

You can use NeptuneCallback in the XGBoost.cv function:

core code
full script
core code
import neptune.new as neptune
from neptune.new.integrations.xgboost import NeptuneCallback
# Create run
run = neptune.init(project="my_workspace/my_project")
# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
# Prepare data, parameters, etc.
...
# Run cross validation and log metadata to the run in Neptune
xgb.cv(
params=model_params,
dtrain=dtrain,
callbacks=[neptune_callback],
)
full script
import neptune.new as neptune
import xgboost as xgb
from neptune.new.integrations.xgboost import NeptuneCallback
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init(
project="common/xgboost-integration",
api_token="ANONYMOUS",
name="xgb-cv",
tags=["xgb-integration", "cv"],
)
# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
# Prepare data
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_test, label=y_test)
# Define parameters
model_params = {
"eta": 0.7,
"gamma": 0.001,
"max_depth": 9,
"objective": "reg:squarederror",
"eval_metric": ["mae", "rmse"]
}
evals = [(dtrain, "train"), (dval, "valid")]
num_round = 57
# Run cross validation and log metadata to the run in Neptune
xgb.cv(
params=model_params,
dtrain=dtrain,
num_boost_round=num_round,
nfold=7,
callbacks=[neptune_callback],
)

Read docstrings of the NeptuneCallback to learn more about parameters.

In the snippet above

  • import NeptuneCallback that you will use to handle metadata logging,

  • Create new run in Neptune,

  • Pass run object to the NeptuneCallback,

  • Pass created neptune_callback to the XGBoost.cv function.

At this point your script is ready to use Neptune as a logger.

Now, you can run your script and have metadata logged to Neptune for further inspection, comparison and sharing:

python main.py

In Neptune it will look similar to this:

Example dashboard with cross-validation results.

More about this run

The run above has metadata:

Name

Description

epoch

Epochs (visualized as chart from 1 to last epoch)

learning_rate

Learning rate visualized as chart.

test

Test metrics

train

Train metrics

Also, this run has fold_0 to fold_6 namespaces. This is because for each fold in CV (7-fold CV in this case), Neptune logs metadata:

fold_N
|—— booster_config
|—— pickled_model
|—— plots
|—— importance
|—— trees

Name

Description

booster_config

All parameters for the booster.

pickled_model

Trained model logged as pickled file.

plots

Feature importance and visualized trees.

What next?

You can run the example presented above by yourself or see it in Neptune:

Scikit-learn API

You can use NeptuneCallback with the XGBoost sklearn API:

core code
full script
core code
import neptune.new as neptune
from neptune.new.integrations.xgboost import NeptuneCallback
# Create run
run = neptune.init(project="my_workspace/my_project")
# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
# Prepare data, parameters, etc.
...
# Create regressor object
reg = xgb.XGBRegressor(**model_params)
# Fit the model and log metadata to the run in Neptune
reg.fit(
X_train,
y_train,
callbacks=[neptune_callback],
)
full script
import neptune.new as neptune
import xgboost as xgb
from neptune.new.integrations.xgboost import NeptuneCallback
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init(
project="common/xgboost-integration",
api_token="ANONYMOUS",
name="xgb-sklearn-api",
tags=["xgb-integration", "sklearn-api"],
)
# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
# Prepare data
boston = load_boston()
y = boston['target']
X = boston['data']
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
# Define parameters
model_params = {
"n_estimators": 70,
"eta": 0.7,
"gamma": 0.001,
"max_depth": 9,
"objective": "reg:squarederror",
"eval_metric": ["mae", "rmse"],
}
reg = xgb.XGBRegressor(**model_params)
# Fit the model and log metadata to the run in Neptune
reg.fit(
X_train,
y_train,
early_stopping_rounds=30,
eval_metric=['mae', 'rmse'],
eval_set=[(X_train, y_train), (X_test, y_test)],
callbacks=[
neptune_callback,
xgb.callback.LearningRateScheduler(lambda epoch: 0.99**epoch),
]
)

Read docstrings of the NeptuneCallback to learn more about parameters.

In the snippet above

  • import NeptuneCallback that you will use to handle metadata logging,

  • Create new run in Neptune,

  • Pass run object to the NeptuneCallback,

  • Pass created neptune_callback to the .fit() method of the regressor (sklearn API).

At this point your script is ready to use Neptune as a logger.

Now, you can run your script and have metadata logged to Neptune for further inspection, comparison and sharing:

python main.py

In Neptune it will look similar to this:

Example dashboard, where sklearn API was used.

More about the run above

Run has following metadata:

Name

Description

booster_config

All parameters for the booster.

early_stopping

best_score and best_iteration (logged only if early stopping was activated).

epoch

Epochs (visualized as chart from 1 to last epoch).

learning_rate

Learning rate visualized as chart.

pickled_model

Trained model logged as pickled file.

plots

Feature importance and visualized trees.

validation_0

Metrics on the 1st validation set passed to the eval_set parameter of the model.fit() (sklearn API).

validation_1

Metrics on the 2nd validation set passed to the eval_set parameter of the model.fit() (sklearn API).

What next?

You can run the example presented above by yourself or see it in Neptune:

Resume run

You can resume run that you created before and continue logging to it. It comes useful when you train XGBoost model in the multiple training sessions. Here is how to do it:

Log other metadata

If you have other types of metadata that are not covered in this integration, you can still log them by using neptune-client. When you create the run, you have the my_run handle:

# Create new Neptune run
my_run = neptune.init(project="my_workspace/my_project")

You can use my_run object to log metadata. Here is more info about it:

How to ask for help?

Please visit the Getting help page. Everything regarding support is there:

Other pages you may like

You may also find the following pages useful: