Working with XGBoost#
XGBoost is an optimized distributed library that implements machine learning algorithms under the Gradient Boosting framework.
With the Neptune–XGBoost integration, the following metadata is logged automatically:
- Metrics
- Parameters
- The pickled model
- The feature importance chart
- Visualized trees
- Hardware consumption metrics
- stdout and stderr streams
- Training code and Git information
See in Neptune  Code examples 
Before you start#
Tip
If you'd rather follow the guide without any setup, you can run the example in Colab .
- Set up Neptune. Instructions:
-
Ensure that you have at least version 1.3.0 of XGBoost installed:
Installing the Neptune–XGBoost integration#
On the command line or in a terminal app, such as Command Prompt, enter the following:
If you want to log visualized trees after training (recommended), additionally install Graphviz:
Note
The above installation is only for the pure Python interface to the Graphviz software. You need to install Graphviz separately.
For installation help, see the Graphviz documentation .
XGBoost logging example#
This example walks you through logging metadata as you train your model with XGBoost.
You can log metadata during training with NeptuneCallback
.
Logging metadata during training#
-
Start a run:
- If you haven't set up your credentials, you can log anonymously:
neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/xgboost-integration")
- If you haven't set up your credentials, you can log anonymously:
-
Initialize the Neptune callback:
-
Prepare your data, parameters, and so on.
-
Pass the callback to the
train()
function and train the model: -
Run your script as you normally would.
To open the run, click the Neptune link that appears in the console output.
Example link: https://app.neptune.ai/common/xgboost-integration/e/XGBOOST-84
Stop the run when done
Once you are done logging, you should stop the Neptune run. You need to do this manually when logging from a Jupyter notebook or other interactive environment:
If you're running a script, the connection is stopped automatically when the script finishes executing. In notebooks, however, the connection to Neptune is not stopped when the cell has finished executing, but rather when the entire notebook stops.
Exploring results in Neptune#
In the run view, you can see the logged metadata organized into folder-like namespaces.
Name | Description |
---|---|
booster_config |
All parameters for the booster. |
early_stopping |
best_score and best_iteration (logged if early stopping was activated) |
epoch |
Epochs (visualized as a chart from first to last epoch). |
learning_rate |
Learning rate visualized as a chart. |
pickled_model |
Trained model logged as a pickled file. |
plots |
Feature importance and visualized trees. |
train |
Training metrics. |
valid |
Validation metrics. |
More options#
Changing the base namespace#
By default, the metadata is logged under the namespace training
.
You can change the namespace when creating the Neptune callback:
Using Neptune callback with CV function#
You can use NeptuneCallback
in the xgboost.cv function. Neptune will log additional metadata for each fold in CV.
Pass the Neptune callback to the callbacks
argument of lgb.cv()
:
import neptune
from neptune.integrations.xgboost import NeptuneCallback
# Create run
run = neptune.init_run()
# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
# Prepare data, params, etc.
...
# Run cross validation and log metadata to the run in Neptune
xgb.cv(
params=model_params,
dtrain=dtrain,
callbacks=[neptune_callback],
)
# Stop run
run.stop()
import neptune
import xgboost as xgb
from neptune.integrations.xgboost import NeptuneCallback
from sklearn.datasets import load_california_housing
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN, # (1)!
project="common/xgboost-integration", # (2)!
name="xgb-cv", # optional
tags=["xgb-integration", "cv"], # optional
)
# Create Neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
# Prepare data
X, y = load_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_test, label=y_test)
# Define parameters
model_params = {
"eta": 0.7,
"gamma": 0.001,
"max_depth": 9,
"objective": "reg:squarederror",
"eval_metric": ["mae", "rmse"]
}
evals = [(dtrain, "train"), (dval, "valid")]
num_round = 57
# Run cross validation and log metadata to the run in Neptune
xgb.cv(
params=model_params,
dtrain=dtrain,
num_boost_round=num_round,
nfold=7,
callbacks=[neptune_callback],
)
# Stop run
run.stop()
- The
api_token
argument is included to enable anonymous logging. Once you register, you should leave the token out of your script and instead save it as an environment variable. - Projects in the
common
workspace are public and can be used for testing. To log to your own workspace, pass the full name of your Neptune project:workspace-name/project-name
. For example,"ml-team/classification"
. To copy it, navigate to the project settings → Properties.
In the All metadata section of the run view, you can see a fold_n
namespace for each fold in an n-fold CV:
Namespaces inside the fold_n
namespace:
Name | Description |
---|---|
booster_config |
All parameters for the booster. |
pickled_model |
Trained model logged as a pickled file. |
plots |
Feature importance and visualized trees. |
Working with scikit-learn API#
You can use NeptuneCallback
in the scikit-learn API of XGBoost.
Pass the Neptune callback to the fit()
method of the regressor from the scikit-learn API:
import neptune
from neptune.integrations.xgboost import NeptuneCallback
# Create run
run = neptune.init_run()
# Create neptune callback
neptune_callback = NeptuneCallback(run=run)
# Prepare data, params, etc.
...
# Create regressor object
reg = xgb.XGBRegressor(**model_params)
# Fit the model and log metadata to the run in Neptune
reg.fit(
X_train,
y_train,
callbacks=[neptune_callback],
)
# Stop run
run.stop()
import xgboost as xgb
from sklearn.datasets import load_california_housing
from sklearn.model_selection import train_test_split
import neptune
from neptune.integrations.xgboost import NeptuneCallback
# Create run
run = neptune.init_run(
api_token=neptune.ANONYMOUS_API_TOKEN, # (1)!
project="common/xgboost-integration", # (2)!
name="xgb-sklearn-api", # optional
tags=["xgb-integration", "sklearn-api"], # optional
)
# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
# Prepare data
data = load_california_housing()
y = data["target"]
X = data["data"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=123
)
# Define parameters
model_params = {
"n_estimators": 70,
"eta": 0.7,
"gamma": 0.001,
"max_depth": 9,
"objective": "reg:squarederror",
"eval_metric": ["mae", "rmse"],
}
reg = xgb.XGBRegressor(**model_params)
# Fit the model and log metadata to the run in Neptune
reg.fit(
X_train,
y_train,
early_stopping_rounds=30,
eval_metric=["mae", "rmse"],
eval_set=[(X_train, y_train), (X_test, y_test)],
callbacks=[
neptune_callback,
xgb.callback.LearningRateScheduler(lambda epoch: 0.99**epoch),
],
)
# Stop run
run.stop()
- The
api_token
argument is included to enable anonymous logging. Once you register, you should leave the token out of your script and instead save it as an environment variable. - Projects in the
common
workspace are public and can be used for testing. To log to your own workspace, pass the full name of your Neptune project:workspace-name/project-name
. For example,"ml-team/classification"
. To copy it, navigate to the project settings → Properties.
The following new namespaces appear in the run metadata:
Name | Description |
---|---|
validation_0 |
Metrics on the first validation set passed to the eval_set parameter of the fit() method. |
validation_1 |
Metrics on the first validation set passed to the eval_set parameter of the fit() method. |
Manually logging metadata#
If you have other types of metadata that are not covered in this guide, you can still log them using the Neptune client library.
When you initialize the run, you get a run
object, to which you can assign different types of metadata in a structure of your own choosing.
import neptune
# Create a new Neptune run
run = neptune.init_run()
# Log metrics or other values inside loops
for epoch in range(n_epochs):
... # Your training loop
run["train/epoch/loss"].append(loss) # Each append() appends a value
run["train/epoch/accuracy"].append(acc)
# Upload files
run["test/preds"].upload("path/to/test_preds.csv")
# Track and version artifacts
run["train/images"].track_files("./datasets/images")
# Record numbers or text
run["tokenizer"] = "regexp_tokenize"
Related
- What you can log and display
- Resuming a run
- Adding Neptune to your code
- API reference ≫ XGBoost