Skip to content

LightGBM integration guide#

Open in Colab

Custom dashboard displaying metadata logged with LightGBM

LightGBM is a gradient-boosting framework that uses tree-based learning algorithms. With the Neptune-LightGBM integration, the following metadata is logged automatically:

  • Training and validation metrics
  • Parameters
  • Feature names, num_features, and num_rows for the train set
  • Hardware consumption metrics
  • stdout and stderr streams
  • Training code and Git commit information

You can also log the trained LightGBM booster summary, which can contain:

  • The pickled model
  • The feature importance chart (gain and split)
  • Visualized trees
  • Trees saved as DataFrames
  • Confusion matrix (for classification problems)

See example in Neptune  Code examples 

Before you start#

Installing the integration#

To use your preinstalled version of Neptune together with the integration:

pip
pip install -U neptune-lightgbm

To install both Neptune and the integration:

pip
pip install -U "neptune[lightgbm]"
Passing your Neptune credentials

Once you've registered and created a project, set your Neptune API token and full project name to the NEPTUNE_API_TOKEN and NEPTUNE_PROJECT environment variables, respectively.

export NEPTUNE_API_TOKEN="h0dHBzOi8aHR0cHM.4kl0jvYh3Kb8...6Lc"

To find your API token: In the bottom-left corner of the Neptune app, expand the user menu and select Get my API token.

export NEPTUNE_PROJECT="ml-team/classification"

Your full project name has the form workspace-name/project-name. You can copy it from the project settings: Click the menu in the top-right → Details & privacy.

On Windows, navigate to SettingsEdit the system environment variables, or enter the following in Command Prompt: setx SOME_NEPTUNE_VARIABLE 'some-value'


While it's not recommended especially for the API token, you can also pass your credentials in the code when initializing Neptune.

run = neptune.init_run(
    project="ml-team/classification",  # your full project name here
    api_token="h0dHBzOi8aHR0cHM6Lkc78ghs74kl0jvYh...3Kb8",  # your API token here
)

For more help, see Set Neptune credentials.

To log visualized trees after training (recommended), additionally install Graphviz:

pip install -U graphviz

Note

The above installation is only for the pure Python interface to the Graphviz software. You need to install Graphviz separately.

For installation help, see the Graphviz documentation .

Quickstart#

Tip

This section is for LightGBM users who are familiar with Neptune and LightGBM callbacks.

The integration has two core components:

  • NeptuneCallback for logging metadata during training, such as parameters and metrics.
  • create_booster_summary() for logging additional metadata after training, such as visualizations and the pickled model.
from neptune.integrations.lightgbm import (
    NeptuneCallback, create_booster_summary
)

# Create run
run = neptune.init_run() # (1)!

# Create Neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data, params, etc.
...

# Pass the callback to the train function and train the model
gbm = lgb.train(params, lgb_train, callbacks=[neptune_callback])

# Compute test predictions
y_pred = ...

# Log summary metadata under the "lgbm_summary" namespace
run["lgbm_summary"] = create_booster_summary(
    booster=gbm,
    log_trees=True,
    list_trees=[0, 1, 2, 3, 4],
    log_confusion_matrix=True,
    y_pred=y_pred,
    y_true=y_test,
)

# When done logging, stop the run
run.stop()
  1. If you haven't set up your credentials, you can log anonymously:

    neptune.init_run(
        api_token=neptune.ANONYMOUS_API_TOKEN,
        project="common/lightgbm-integration",
    )
    
import lightgbm as lgb
import neptune
import numpy as np
from neptune.integrations.lightgbm import (
    NeptuneCallback, create_booster_summary
)
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Create run
run = neptune.init_run(
    project="common/lightgbm-integration",
    api_token=neptune.ANONYMOUS_API_TOKEN,
    name="train-cls",
    tags=["lgbm-integration", "train", "cls"],
)

# Create Neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=123,
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# Define parameters
params = {
    "boosting_type": "gbdt",
    "objective": "multiclass",
    "num_class": 10,
    "metric": ["multi_logloss", "multi_error"],
    "num_leaves": 21,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "max_depth": 12,
}

# Train the model
gbm = lgb.train(
    params,
    lgb_train,
    num_boost_round=200,
    valid_sets=[lgb_train, lgb_eval],
    valid_names=["training", "validation"],
    callbacks=[neptune_callback],
)

y_pred = np.argmax(gbm.predict(X_test), axis=1)

# Log summary metadata to the same run under the "lgbm_summary" namespace
run["lgbm_summary"] = create_booster_summary(
    booster=gbm,
    log_trees=True,
    list_trees=[0, 1, 2, 3, 4],
    log_confusion_matrix=True,
    y_pred=y_pred,
    y_true=y_test,
)

# When done logging, stop the run
run.stop()

See in Neptune 

Full walkthrough#

This example walks you through logging metadata as you train your model with LightGBM.

You can log metadata during training with NeptuneCallback, and after training with the create_booster_summary() function.

Logging metadata during training#

  1. Start a run:

    import neptune
    
    run = neptune.init_run() # (1)!
    
    1. If you haven't set up your credentials, you can log anonymously:

      neptune.init_run(
          api_token=neptune.ANONYMOUS_API_TOKEN,
          project="common/lightgbm-integration",
      )
      
  2. Initialize the Neptune callback:

    from neptune.integrations.lightgbm import NeptuneCallback
    
    neptune_callback = NeptuneCallback(run=run)
    
  3. Pass the callback to the train() function and train the model:

    gbm = lgb.train(params, lgb_train, callbacks=[neptune_callback])
    
  4. To stop the connection to Neptune and sync all data, call the stop() method:

    run.stop()
    
  5. Run your script as you normally would.

    To open the run, click the Neptune link that appears in the console output.

    Example link: https://app.neptune.ai/common/lightgbm-integration/e/LGBM-85

import neptune
from neptune.integrations.lightgbm import NeptuneCallback

run = neptune.init_run()
neptune_callback = NeptuneCallback(run=run)

# Prepare data, parameters, etc.
...

gbm = lgb.train(params, lgb_train, callbacks=[neptune_callback])

run.stop()

import lightgbm as lgb
import neptune
from neptune.integrations.lightgbm import NeptuneCallback
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Create Neptune run
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN, # (1)!
    project="common/lightgbm-integration", # (2)!
    name="train-cls",  # optional
    tags=["lgbm-integration", "train", "cls"],  # optional
)

# Create Neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=123,
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# Define parameters
params = {
    "boosting_type": "gbdt",
    "objective": "multiclass",
    "num_class": 10,
    "metric": ["multi_logloss", "multi_error"],
    "num_leaves": 21,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "max_depth": 12,
}

# Train the model
gbm = lgb.train(
    params,
    lgb_train,
    num_boost_round=200,
    valid_sets=[lgb_train, lgb_eval],
    valid_names=["training", "validation"],
    callbacks=[neptune_callback],
)

run.stop()

  1. The api_token argument is included to enable anonymous logging.

    Once you've registered, leave the token out of your script and instead save it as an environment variable.

  2. Projects in the common workspace are public and can be used for testing.

    To log to your own workspace, pass the full name of your Neptune project: workspace-name/project-name. For example, project="ml-team/classification".

    You can copy the name from the project details ( Details & privacy).

Exploring results in Neptune#

In the run view, you can see the logged metadata organized into folder-like namespaces.

Name Description
feature_names Names of features in the train set.
monitoring Hardware monitoring charts, stdout, and stderr.
params LightGBM model parameters.
source_code Python sources associated with this run.
sys Basic run metadata, like creation time, tags, description, and owner.
train_set num_features and num_rows in the train set.
training Training metrics.
validation Validation metrics.

Logging booster summary after training#

To log additional metadata that describes the trained model, you can use the create_booster_summary() function.

To have all the data in the same place, you can use the Neptune callback and create the booster summary in the same script. This way, you'll log all metadata to the same run in Neptune.

Related

You can also resume logging to a previously created run, by passing the ID of the run to the initialization function: neptune.init_run(with_id="CLS-13").

To learn more, see Resume a run.

In the snippet below, we train the model and log summary information after training:

from neptune.integrations.lightgbm import create_booster_summary

# Create new run
run = neptune.init_run()

# Prepare data and parameters
...

# Train the model
gbm = lgb.train(params, lgb_train)

# Compute test predictions
y_pred = ...

# Log summary metadata under the "lgbm_summary" namespace
run["lgbm_summary"] = create_booster_summary(
    booster=gbm,
    log_trees=True,
    list_trees=[0, 1, 2, 3, 4],
    log_confusion_matrix=True,
    y_pred=y_pred,
    y_true=y_test,
)

run.stop()
import lightgbm as lgb
import neptune
import numpy as np
from neptune.integrations.lightgbm import (
    NeptuneCallback, create_booster_summary)
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Create run
run = neptune.init_run(
    project="common/lightgbm-integration",
    api_token=neptune.ANONYMOUS_API_TOKEN,
    name="train-cls",
    tags=["lgbm-integration", "train", "cls"],
)

# Create neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=123,
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# Define parameters
params = {
    "boosting_type": "gbdt",
    "objective": "multiclass",
    "num_class": 10,
    "metric": ["multi_logloss", "multi_error"],
    "num_leaves": 21,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "max_depth": 12,
}

# Train the model
gbm = lgb.train(
    params,
    lgb_train,
    num_boost_round=200,
    valid_sets=[lgb_train, lgb_eval],
    valid_names=["training", "validation"],
    callbacks=[neptune_callback],
)

y_pred = np.argmax(gbm.predict(X_test), axis=1)

# Log summary metadata to the same run under the "lgbm_summary" namespace
run["lgbm_summary"] = create_booster_summary(
    booster=gbm,
    log_trees=True,
    list_trees=[0, 1, 2, 3, 4],
    log_confusion_matrix=True,
    y_pred=y_pred,
    y_true=y_test,
)

run.stop()

The create_booster_summary() function returns a regular Python dictionary that can be directly assigned to a namespace in the run. This way, you can organize your run in such a way that all the summary metadata – like visualizations and the pickled model – are under a common path.

  • The script is ready to be executed with additional metadata logging. To view the run in Neptune, click the Neptune app link in the console output.

This run has one extra path – lgbm_summary – with the following metadata organization:

lgbm_summary
    |—— pickled_model
    |—— trees_as_dataframe
    |—— visualizations
        |—— confusion_matrix
        |—— trees
        |—— feature_importances
            |—— gain
            |—— split

The lgbm_summary namespace contains the following:

Name Description
pickled_model Pickled model (booster).
trees_as_dataframe Trees represented as a DataFrame. Learn more in the LightGBM docs.
confusion_matrix Confusion matrix for test data logged as image.
trees Selected trees visualized as graphs.
gain Model's feature importances (total gains of splits that use the feature).
split Model's feature importances (number of times the feature is used in a model).

More options#

Using Neptune callback with CV function#

You can use NeptuneCallback in the lightgbm.cv function.

Pass the Neptune callback to the callbacks argument of lgb.cv():

from neptune.integrations.lightgbm import NeptuneCallback

# Create run
run = neptune.init_run()

# Create neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data, params, etc.
...

# Pass the callback to the CV function
gbm_cv = lgb.cv(params, lgb_train, callbacks=[neptune_callback])

# Stop run
run.stop()
import lightgbm as lgb
import neptune
from neptune.integrations.lightgbm import NeptuneCallback
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Create run
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN, # (1)!
    project="common/lightgbm-integration", # (2)!
    name="cv-cls",  # optional
    tags=["lgbm-integration", "cv", "cls"],  # optional
)

# Create neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=123,
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# Define parameters
params = {
    "boosting_type": "gbdt",
    "objective": "multiclass",
    "num_class": 10,
    "metric": ["multi_logloss", "multi_error"],
    "num_leaves": 21,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "max_depth": 12,
}

# Run CV
gbm_cv = lgb.cv(
    params,
    lgb_train,
    num_boost_round=200,
    nfold=7,
    callbacks=[neptune_callback],
)

# Stop run
run.stop()
  1. The api_token argument is included to enable anonymous logging.

    Once you've registered, leave the token out of your script and instead save it as an environment variable.

  2. Projects in the common workspace are public and can be used for testing.

    To log to your own workspace, pass the full name of your Neptune project: workspace-name/project-name. For example, project="ml-team/classification".

    You can copy the name from the project details ( Details & privacy).

Working with scikit-learn API#

You can use NeptuneCallback and create_booster_summary() in the scikit-learn API of LightGBM:

from neptune.integrations.lightgbm import (
    NeptuneCallback, create_booster_summary
)

# Create run
run = neptune.init_run()

# Create neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data, params, and create instance of the classifier object
...
gbm = lgb.LGBMClassifier(**params)

# Fit model and log metadata
gbm.fit(
    X_train,
    y_train,
    callbacks=[neptune_callback],
)

# Compute test predictions
y_pred = ...

# Log summary metadata to the same run under the "lgbm_summary" namespace
run["lgbm_summary"] = create_booster_summary(
    booster=gbm,
    log_trees=True,
    list_trees=[0, 1, 2, 3, 4],
    log_confusion_matrix=True,
    y_pred=y_pred,
    y_true=y_test,
)

# Stop run
run.stop()
import lightgbm as lgb
import neptune
from neptune.integrations.lightgbm import (
    NeptuneCallback, create_booster_summary
)
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Create run
run = neptune.init_run(
    api_token=neptune.ANONYMOUS_API_TOKEN, # (1)!
    project="common/lightgbm-integration", # (2)!
    name="sklearn-api-cls",  # optional
    tags=["lgbm-integration", "sklearn-api", "cls"],  # optional
)

# Create neptune callback
neptune_callback = NeptuneCallback(run=run)

# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=123,
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# Define parameters
params = {
    "boosting_type": "gbdt",
    "objective": "multiclass",
    "num_class": 10,
    "num_leaves": 21,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "max_depth": 12,
    "n_estimators": 207,
}

# Create instance of the classifier object
gbm = lgb.LGBMClassifier(**params)

# Fit model and log metadata
gbm.fit(
    X_train,
    y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    eval_names=["training", "validation"],
    eval_metric=["multi_logloss", "multi_error"],
    callbacks=[neptune_callback],
)

y_pred = gbm.predict(X_test)

# Log summary metadata to the same run under the "lgbm_summary" namespace
run["gbm_summary"] = create_booster_summary(
    booster=gbm,
    log_trees=True,
    list_trees=[0, 1, 2, 3, 4],
    log_confusion_matrix=True,
    y_pred=y_pred,
    y_true=y_test,
)

# Stop run
run.stop()
  1. The api_token argument is included to enable anonymous logging.

    Once you've registered, leave the token out of your script and instead save it as an environment variable.

  2. Projects in the common workspace are public and can be used for testing.

    To log to your own workspace, pass the full name of your Neptune project: workspace-name/project-name. For example, project="ml-team/classification".

    You can copy the name from the project details ( Details & privacy).

Related