Skip to content

XGBoost integration guide#

Open in Colab

Custom dashboard displaying metadata logged with XGBoost

XGBoost is an optimized distributed library that implements machine learning algorithms under the Gradient Boosting framework. With the Neptune-XGBoost integration, the following metadata is logged automatically:

  • Metrics
  • Parameters
  • The pickled model
  • The feature importance chart
  • Visualized trees
  • Hardware consumption metrics
  • stdout and stderr streams
  • Training code and Git information

See in Neptune  Code examples 

Before you start#

  • Sign up at neptune.ai/register.
  • Create a project for storing your metadata.
  • Ensure that you have at least version 1.3.0 of XGBoost installed:

    pip install xgboost>=1.3.0
    
    conda install -c conda-forge xgboost>=1.3.0
    

Installing the integration#

To use your preinstalled version of Neptune together with the integration:

pip
pip install -U neptune-xgboost
conda
conda install -c conda-forge neptune-xgboost

To install both Neptune and the integration:

pip
pip install -U "neptune[xgboost]"
conda
conda install -c conda-forge neptune neptune-xgboost
Passing your Neptune credentials

Once you've registered and created a project, set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...6Lc"

To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.

export NEPTUNE_PROJECT="ml-team/classification"

Your full project name has the form workspace-name/project-name. You can copy it from the project settings: Click the menu in the top-right → Edit project details.

On Windows, navigate to SettingsEdit the system environment variables, or enter the following in Command Prompt: setx SOME_NEPTUNE_VARIABLE 'some-value'


While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.

run = neptune.init_run(
    project="ml-team/classification",  # your full project name here
    api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8",  # your API token here
)

For more help, see Set Neptune credentials.

If you want to log visualized trees after training (recommended), additionally install Graphviz:

pip install -U graphviz
conda install -c conda-forge python-graphviz

Note

The above installation is only for the pure Python interface to the Graphviz software. You need to install Graphviz separately.

For installation help, see the Graphviz documentation .

If you'd rather follow the guide without any setup, you can run the example in Colab .

XGBoost logging example#

This example walks you through logging metadata as you train your model with XGBoost.

You can log metadata during training with NeptuneCallback.

Logging metadata during training#

  1. Start a run:

    import neptune
    run = neptune.init_run() # (1)!
    
    1. If you haven't set up your credentials, you can log anonymously:

      neptune.init_run(
          api_token=neptune.ANONYMOUS_API_TOKEN,
          project="common/xgboost-integration",
      )
      
  2. Initialize the Neptune callback:

    from neptune.integrations.xgboost import NeptuneCallback
    neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
    
  3. Prepare your data, parameters, and so on.

  4. Pass the callback to the train() function and train the model:

    xgb.train(
        params=model_params,
        dtrain=dtrain,
        callbacks=[neptune_callback],
    )
    
  5. To stop the connection to Neptune and sync all data, call the stop() method:

    run.stop()
    
  6. Run your script as you normally would.

    To open the run, click the Neptune link that appears in the console output.

    Example link: https://app.neptune.ai/common/xgboost-integration/e/XGBOOST-84

Exploring results in Neptune#

In the run view, you can see the logged metadata organized into folder-like namespaces.

Name Description
booster_config All parameters for the booster.
early_stopping best_score and best_iteration (logged if early stopping was activated)
epoch Epochs (visualized as a chart from first to last epoch).
learning_rate Learning rate visualized as a chart.
pickled_model Trained model logged as a pickled file.
plots Feature importance and visualized trees.
train Training metrics.
valid Validation metrics.

See example in Neptune 

More options#

Changing the base namespace#

By default, the metadata is logged under the namespace training.

You can change the namespace when creating the Neptune callback:

neptune_callback = NeptuneCallback(
    run=run,
    base_namespace="my_custom_name",
)

Using Neptune callback with CV function#

You can use NeptuneCallback in the xgboost.cv function. Neptune will log additional metadata for each fold in CV.

Pass the Neptune callback to the callbacks argument of lgb.cv():

import neptune
from neptune.integrations.xgboost import NeptuneCallback

# Create run
run = neptune.init_run()

# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])

# Prepare data, params, etc.
...

# Run cross validation and log metadata to the run in Neptune
xgb.cv(
    params=model_params,
    dtrain=dtrain,
    callbacks=[neptune_callback],
)

# Stop run
run.stop()

import neptune
import xgboost as xgb
from neptune.integrations.xgboost import NeptuneCallback
from sklearn.datasets import load_california_housing
from sklearn.model_selection import train_test_split

# Create run
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN, # (1)!
    project="common/xgboost-integration", # (2)!
    name="xgb-cv",  # optional
    tags=["xgb-integration", "cv"],  # optional
)

# Create Neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])

# Prepare data
X, y = load_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=123,
)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_test, label=y_test)

# Define parameters
model_params = {
    "eta": 0.7,
    "gamma": 0.001,
    "max_depth": 9,
    "objective": "reg:squarederror",
    "eval_metric": ["mae", "rmse"]
}
evals = [(dtrain, "train"), (dval, "valid")]
num_round = 57

# Run cross validation and log metadata to the run in Neptune
xgb.cv(
    params=model_params,
    dtrain=dtrain,
    num_boost_round=num_round,
    nfold=7,
    callbacks=[neptune_callback],
)

# Stop run
run.stop()

  1. The api_token argument is included to enable anonymous logging.

    Once you register, you should leave the token out of your script and instead save it as an environment variable.

  2. Projects in the common workspace are public and can be used for testing.

    To log to your own workspace, pass the full name of your Neptune project: workspace-name/project-name. For example, project="ml-team/classification".

    You can copy the name from the project details. Click the menu in the top-right corner and select Edit project details.

In the All metadata section of the run view, you can see a fold_n namespace for each fold in an n-fold CV:

fold_n
  |—— booster_config
  |—— pickled_model
  |—— plots
        |—— importance
        |—— trees

Namespaces inside the fold_n namespace:

Name Description
booster_config All parameters for the booster.
pickled_model Trained model logged as a pickled file.
plots Feature importance and visualized trees.

See in Neptune 

Working with scikit-learn API#

You can use NeptuneCallback in the scikit-learn API of XGBoost.

Pass the Neptune callback to the fit() method of the regressor from the scikit-learn API:

import neptune
from neptune.integrations.xgboost import NeptuneCallback

# Create run
run = neptune.init_run()

# Create neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data, params, etc.
...

# Create regressor object
reg = xgb.XGBRegressor(**model_params)

# Fit the model and log metadata to the run in Neptune
reg.fit(
    X_train,
    y_train,
    callbacks=[neptune_callback],
)

# Stop run
run.stop()
import xgboost as xgb
from sklearn.datasets import load_california_housing
from sklearn.model_selection import train_test_split

import neptune
from neptune.integrations.xgboost import NeptuneCallback

# Create run
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN, # (1)!
    project="common/xgboost-integration", # (2)!
    name="xgb-sklearn-api",  # optional
    tags=["xgb-integration", "sklearn-api"],  # optional
)

# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])

# Prepare data
data = load_california_housing()
y = data["target"]
X = data["data"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

# Define parameters
model_params = {
    "n_estimators": 70,
    "eta": 0.7,
    "gamma": 0.001,
    "max_depth": 9,
    "objective": "reg:squarederror",
    "eval_metric": ["mae", "rmse"],
}

reg = xgb.XGBRegressor(**model_params)

# Fit the model and log metadata to the run in Neptune
reg.fit(
    X_train,
    y_train,
    early_stopping_rounds=30,
    eval_metric=["mae", "rmse"],
    eval_set=[(X_train, y_train), (X_test, y_test)],
    callbacks=[
        neptune_callback,
        xgb.callback.LearningRateScheduler(lambda epoch: 0.99**epoch),
    ],
)

# Stop run
run.stop()
  1. The api_token argument is included to enable anonymous logging.

    Once you register, you should leave the token out of your script and instead save it as an environment variable.

  2. Projects in the common workspace are public and can be used for testing.

    To log to your own workspace, pass the full name of your Neptune project: workspace-name/project-name. For example, project="ml-team/classification".

    You can copy the name from the project details. Click the menu in the top-right corner and select Edit project details.

The following new namespaces appear in the run metadata:

Name Description
validation_0 Metrics on the first validation set passed to the eval_set parameter of the fit() method.
validation_1 Metrics on the first validation set passed to the eval_set parameter of the fit() method.

See in Neptune 

Manually logging metadata#

If you have other types of metadata that are not covered in this guide, you can still log them using the Neptune client library.

When you initialize the run, you get a run object, to which you can assign different types of metadata in a structure of your own choosing.

import neptune

# Create a new Neptune run
run = neptune.init_run()

# Log metrics inside loops
for epoch in range(n_epochs):
    # Your training loop

    run["train/epoch/loss"].append(loss)  # Each append() call appends a value
    run["train/epoch/accuracy"].append(acc)

# Track artifact versions and metadata
run["train/images"].track_files("./datasets/images")

# Upload entire files
run["test/preds"].upload("path/to/test_preds.csv")

# Log text or other metadata, in a structure of your choosing
run["tokenizer"] = "regexp_tokenize"

Related