Skip to content

Working with XGBoost#

Open in Colab

Custom dashboard displaying metadata logged with XGBoost

XGBoost is an optimized distributed library that implements machine learning algorithms under the Gradient Boosting framework.

With the Neptune–XGBoost integration, the following metadata is logged automatically:

  • Metrics
  • Parameters
  • The pickled model
  • The feature importance chart
  • Visualized trees
  • Hardware consumption metrics
  • stdout and stderr streams
  • Training code and Git information

See in Neptune  Code examples 

Related

Before you start#

Tip

If you'd rather follow the guide without any setup, you can run the example in Colab.

Installing the Neptune–XGBoost integration#

On the command line or in a terminal app, such as Command Prompt, enter the following:

pip install neptune-xgboost
conda install -c conda-forge neptune-xgboost

If you want to log visualized trees after training (recommended), additionally install Graphviz:

pip install graphviz
conda install -c conda-forge python-graphviz

Note

The above installation is only for the pure Python interface to the Graphviz software. You need to install Graphviz separately.

For installation help, see the Graphviz documentation .

XGBoost logging example#

This example walks you through logging metadata as you train your model with XGBoost.

You can log metadata during training with NeptuneCallback.

Logging metadata during training#

  1. Start a run:

    import neptune.new as neptune
    run = neptune.init_run()  # (1)
    
    1. If you haven't set up your credentials, you can log anonymously: neptune.init_run(api_token=neptune.ANONYMOUS_API_TOKEN, project="common/xgboost-integration")
  2. Initialize the Neptune callback:

    from neptune.new.integrations.xgboost import NeptuneCallback
    neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])
    
  3. Prepare your data, parameters, and so on.

  4. Pass the callback to the train() function and train the model:

    xgb.train(
        params=model_params,
        dtrain=dtrain,
        callbacks=[neptune_callback],
    )
    
  5. Run your script as you normally would.

    To open the run, click the Neptune link that appears in the console output.

    Example link: https://app.neptune.ai/common/xgboost-integration/e/XGBOOST-84

Stop the run when done

Once you are done logging, you should stop the Neptune run. You need to do this manually when logging from a Jupyter notebook or other interactive environment:

run.stop()

If you're running a script, the connection is stopped automatically when the script finishes executing. In notebooks, however, the connection to Neptune is not stopped when the cell has finished executing, but rather when the entire notebook stops.

Exploring results in Neptune#

In the run view, you can see the logged metadata organized into folder-like namespaces.

Name Description
booster_config All parameters for the booster.
early_stopping best_score and best_iteration (logged if early stopping was activated)
epoch Epochs (visualized as a chart from first to last epoch).
learning_rate Learning rate visualized as a chart.
pickled_model Trained model logged as a pickled file.
plots Feature importance and visualized trees.
train Training metrics.
valid Validation metrics.

See example in Neptune 

More options#

Changing the base namespace#

By default, the metadata is logged under the namespace training.

You can change the namespace when creating the Neptune callback:

neptune_callback = NeptuneCallback(
    run=run,
    base_namespace="my_custom_name",
)

Using Neptune callback with CV function#

You can use NeptuneCallback in the xgboost.cv function. Neptune will log additional metadata for each fold in CV.

Pass the Neptune callback to the callbacks argument of lgb.cv():

import neptune.new as neptune
from neptune.new.integrations.xgboost import NeptuneCallback

# Create run
run = neptune.init_run()

# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])

# Prepare data, params, etc.
...

# Run cross validation and log metadata to the run in Neptune
xgb.cv(
    params=model_params,
    dtrain=dtrain,
    callbacks=[neptune_callback],
)

# Stop run
run.stop()
import neptune.new as neptune
import xgboost as xgb
from neptune.new.integrations.xgboost import NeptuneCallback
from sklearn.datasets import load_california_housing
from sklearn.model_selection import train_test_split

# Create run
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,  # (1)
    project="common/xgboost-integration",  # (2)
    name="xgb-cv",  # optional
    tags=["xgb-integration", "cv"],  # optional
)

# Create Neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])

# Prepare data
X, y = load_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=123
)
dtrain = xgb.DMatrix(X_train, label=y_train)
dval = xgb.DMatrix(X_test, label=y_test)

# Define parameters
model_params = {
    "eta": 0.7,
    "gamma": 0.001,
    "max_depth": 9,
    "objective": "reg:squarederror",
    "eval_metric": ["mae", "rmse"]
}
evals = [(dtrain, "train"), (dval, "valid")]
num_round = 57

# Run cross validation and log metadata to the run in Neptune
xgb.cv(
    params=model_params,
    dtrain=dtrain,
    num_boost_round=num_round,
    nfold=7,
    callbacks=[neptune_callback],
)

# Stop run
run.stop()
  1. The api_token argument is included to enable anonymous logging. Once you register, you should leave the token out of your script and instead save it as an environment variable.
  2. Projects in the common workspace are public and can be used for testing. To log to your own workspace, pass the full name of your Neptune project: workspace-name/project-name. For example, "ml-team/classification". To copy it, navigate to the project settingsProperties.

In the All metadata section of the run view, you can see a fold_n namespace for each fold in an \(n\)-fold CV:

fold_n
  |—— booster_config
  |—— pickled_model
  |—— plots
        |—— importance
        |—— trees

Namespaces inside the fold_n namespace:

Name Description
booster_config All parameters for the booster.
pickled_model Trained model logged as a pickled file.
plots Feature importance and visualized trees.

See in Neptune 

Working with scikit-learn API#

You can use NeptuneCallback in the scikit-learn API of XGBoost.

Pass the Neptune callback to the fit() method of the regressor from the scikit-learn API:

import neptune.new as neptune
from neptune.new.integrations.xgboost import NeptuneCallback

# Create run
run = neptune.init_run()

# Create neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data, params, etc.
...

# Create regressor object
reg = xgb.XGBRegressor(**model_params)

# Fit the model and log metadata to the run in Neptune
reg.fit(
    X_train,
    y_train,
    callbacks=[neptune_callback],
)

# Stop run
run.stop()
import xgboost as xgb
from sklearn.datasets import load_california_housing
from sklearn.model_selection import train_test_split

import neptune.new as neptune
from neptune.new.integrations.xgboost import NeptuneCallback

# Create run
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN,  # (1)
    project="common/xgboost-integration",  # (2)
    name="xgb-sklearn-api",  # optional
    tags=["xgb-integration", "sklearn-api"],  # optional
)

# Create neptune callback
neptune_callback = NeptuneCallback(run=run, log_tree=[0, 1, 2, 3])

# Prepare data
data = load_california_housing()
y = data["target"]
X = data["data"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=123
)

# Define parameters
model_params = {
    "n_estimators": 70,
    "eta": 0.7,
    "gamma": 0.001,
    "max_depth": 9,
    "objective": "reg:squarederror",
    "eval_metric": ["mae", "rmse"],
}

reg = xgb.XGBRegressor(**model_params)

# Fit the model and log metadata to the run in Neptune
reg.fit(
    X_train,
    y_train,
    early_stopping_rounds=30,
    eval_metric=["mae", "rmse"],
    eval_set=[(X_train, y_train), (X_test, y_test)],
    callbacks=[
        neptune_callback,
        xgb.callback.LearningRateScheduler(lambda epoch: 0.99**epoch),
    ],
)

# Stop run
run.stop()
  1. The api_token argument is included to enable anonymous logging. Once you register, you should leave the token out of your script and instead save it as an environment variable.
  2. Projects in the common workspace are public and can be used for testing. To log to your own workspace, pass the full name of your Neptune project: workspace-name/project-name. For example, "ml-team/classification". To copy it, navigate to the project settingsProperties.

The following new namespaces appear in the run metadata:

Name Description
validation_0 Metrics on the first validation set passed to the eval_set parameter of the fit() method.
validation_1 Metrics on the first validation set passed to the eval_set parameter of the fit() method.

See in Neptune 

Manually logging metadata#

If you have other types of metadata that are not covered in this guide, you can still log them using the Neptune client library (neptune-client).

When you initialize the run, you get a run object, to which you can assign different types of metadata in a structure of your own choosing.

from neptune.new import neptune

# Create a new Neptune run
run = neptune.init_run()

# Log metrics or other values inside loops
for epoch in range(n_epochs):
    ...  # Your training loop

    run["train/epoch/loss"].log(loss)  # Each log() appends a value
    run["train/epoch/accuracy"].log(acc)

# Upload files
run["test/preds"].upload("path/to/test_preds.csv")

# Track and version artifacts
run["train/images"].track_files("./datasets/images")

# Record numbers or text
run["tokenizer"] = "regexp_tokenize"
Back to top