LightGBM

Learn how to log LightGBM metadata to Neptune

What will you get with this integration?

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. Neptune + LightGBM integration, lets you:

  1. automatically log many types of metadata during training,

  2. log model summary after training.

Automatically log metadata during training

What is logged?

training and validation metrics

parameters

feature names, num_features and num_rows for the train_set

hardware consumption (CPU, GPU, Memory)

stdout and stderr logs

training code and git commit information

Example dashboard with train-valid metrics and selected parameters

Log model summary after training

You can also log trained LightGBM booster summary to Neptune that can have:

  • pickled model

  • feature importance chart (gain and split)

  • visualized trees

  • trees saved as DataFrame

  • confusion matrix (only for classification problems)

Example dashboard with model summary

Where to start?

To get started with this integration, follow the quickstart below (recommended). If you are an experienced LightGBM user you can check TL;DR section that gives fast-track information on how the integration works.

If you want to try it out now you can either:

Quickstart

This quickstart will show you how to:

  • install required libraries,

  • log metadata during training (metrics, parameters, etc.),

  • log booster summary (visualizations, confusion matrix, pickled model, etc.) after training,

  • check results in the Neptune app.

At the end of this quickstart, you will be able to add Neptune to your LightGBM scripts and use it in your experimentation.

Install requirements

Before you start, make sure that:

Install neptune-client, lightgbm and neptune-lightgbm

Depending on your operating system open a terminal or CMD and run this command. All required libraries are available via pip and conda:

pip
conda
pip
pip install neptune-client lightgbm neptune-lightgbm
conda
conda install -c conda-forge neptune-client lightgbm neptune-lightgbm

This integration is tested with lightgbm==3.2.1, neptune-lightgbm==0.9.10, and neptune-client==0.9.16

For more help about the neptune-client installation check:

Install psutil (optional, you can skip it)

If you want to have hardware monitoring logged (recommended) you should additionally install psutil.

pip
conda
pip
pip install psutil
conda
conda install psutil

Install graphviz (optional, you can skip it)

If you want to log visualized trees after training (recommended), you need to install graphviz.

The below installation is only for the pure Python interface to the graphviz software. You need to install graphviz separately. Check graphviz docs for installation help.

pip
conda
pip
pip install graphviz
conda
conda install -c conda-forge python-graphviz

Log metadata during training

To start logging metadata (metrics, parameters, etc.) during training you need to use NeptuneCallback.

core code
full script
core code
from neptune.new.integrations.lightgbm import NeptuneCallback
# Create run
my_run = neptune.init(project="my_workspace/my_project")
# Create neptune callback
neptune_callback = NeptuneCallback(run=my_run)
# Prepare data, params, etc.
...
# Pass the callback to the train function and train the model
gbm = lgb.train(
params,
lgb_train,
callbacks=[neptune_callback],
)
full script
import lightgbm as lgb
import neptune.new as neptune
from neptune.new.integrations.lightgbm import NeptuneCallback
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init(
project="common/lightgbm-integration",
api_token="ANONYMOUS",
name="train-cls",
tags=["lgbm-integration", "train", "cls"]
)
# Create neptune callback
neptune_callback = NeptuneCallback(run=run)
# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
# Define parameters
params = {
"boosting_type": "gbdt",
"objective": "multiclass",
"num_class": 10,
"metric": ["multi_logloss", "multi_error"],
"num_leaves": 21,
"learning_rate": 0.05,
"feature_fraction": 0.9,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"max_depth": 12,
}
# Train the model
gbm = lgb.train(
params,
lgb_train,
num_boost_round=200,
valid_sets=[lgb_train, lgb_eval],
valid_names=["training", "validation"],
callbacks=[neptune_callback],
)

Read docstrings of the NeptuneCallback to learn more about parameters.

In the snippet above

  • import NeptuneCallback that you will use to handle metadata logging,

  • Create a new run in Neptune,

  • Pass run object to the NeptuneCallback,

  • Pass the created neptune_callback to the train function.

At this point, your script is ready to use Neptune as a logger.

Now, you can run your script and have metadata logged to Neptune for further inspection, comparison, and sharing:

python main.py

In Neptune app it will look similar to this:

Logged metadata include parameters and train/valid metrics.

The run above contains

Name

Description

feature names

Names of features in the train set.

monitoring

Hardware monitoring charts, and stdout, stderr logs.

params

LightGBM model parameters.

source_code

Python sources associated with this run. Learn more: here.

sys

Basic run metadata, like creation time, tags, run owner, etc. Learn more here.

train_set

num_features and num_rows in the train set.

training

Training metrics

validation

Validation metrics

What next?

You can run the example presented above by yourself or see it in Neptune:

Log booster summary after training

To log additional metadata that describes the trained model you can use create_booster_summary() method.

You can log a summary to the new run, or to the same run that you used for logging model training. The second option can be very useful because you have all the information in a single run.

In the snippet below you will train the model and log summary information after training:

core code
full script
core code
from neptune.new.integrations.lightgbm import create_booster_summary
# Create run
my_run = neptune.init(project="my_workspace/my_project")
# Prepare data, params and train the model
...
gbm = lgb.train(params, lgb_train)
# Compute test predictions
y_pred = ...
# Log summary metadata under the "lgbm_summary" namespace
my_run["lgbm_summary"] = create_booster_summary(
booster=gbm,
log_trees=True,
list_trees=[0, 1, 2, 3, 4],
log_confusion_matrix=True,
y_pred=y_pred,
y_true=y_test
)
full script
import lightgbm as lgb
import neptune.new as neptune
import numpy as np
from neptune.new.integrations.lightgbm import NeptuneCallback,\
create_booster_summary
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init(
project="common/lightgbm-integration",
api_token="ANONYMOUS",
name="train-cls",
tags=["lgbm-integration", "train", "cls"]
)
# Create neptune callback
neptune_callback = NeptuneCallback(run=run)
# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
# Define parameters
params = {
"boosting_type": "gbdt",
"objective": "multiclass",
"num_class": 10,
"metric": ["multi_logloss", "multi_error"],
"num_leaves": 21,
"learning_rate": 0.05,
"feature_fraction": 0.9,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"max_depth": 12,
}
# Train the model
gbm = lgb.train(
params,
lgb_train,
num_boost_round=200,
valid_sets=[lgb_train, lgb_eval],
valid_names=["training", "validation"],
callbacks=[neptune_callback],
)
y_pred = np.argmax(gbm.predict(X_test), axis=1)
# Log summary metadata to the same run under the "lgbm_summary" namespace
run["lgbm_summary"] = create_booster_summary(
booster=gbm,
log_trees=True,
list_trees=[0, 1, 2, 3, 4],
log_confusion_matrix=True,
y_pred=y_pred,
y_true=y_test
)

Read docstrings of the create_booster_summary to learn more about parameters.

About the snippet above

create_booster_summary() returns regular Python dictionary that can be directly assigned to the run's namespace. In this way, you can organize your run in such a way that all the summary metadata, like visualizations, and pickled model are under the common path.

Run script with additional metadata logging:

python main.py

It will look like this:

Logged metadata including training summary under namespace "lgbm_summary".

More about the run above

This run has one extra path "lgbm_summary", with the following metadata organization:

lgbm_summary
|—— pickled_model
|—— trees_as_dataframe
|—— visualizations
|—— confusion_matrix
|—— trees
|—— feature_importances
|—— gain
|—— split

Name

Description

pickled_model

Pickled model (booster).

trees_as_dataframe

Trees represented as a DataFrame. Learn more: here.

confusion_matrix

Confusion matrix for test data logged as image.

trees

Selected trees visualized as graphs.

gain

Model's feature importances (total gains of splits that use the feature.).

split

Model's feature importances (numbers of times the feature is used in a model).

You can use both NeptuneCallback and create_booster_summary() in the same script and log all metadata to the same run in Neptune.

Stop Logging

Once you are done logging, you should stop tracking the run using the stop() method. This is needed only while logging from a notebook environment. While logging through a script, Neptune automatically stops tracking once the script has completed execution.

run.stop()

What next?

You can run the example presented above by yourself or see it in Neptune:

TL;DR for the LightGBM users

This section is for the LightGBM users who are familiar with Neptune and LightGBM callbacks. If you didn't work with Neptune or LightGBM callbacks before, jump to the quickstart.

Install requirements

pip
conda
pip
pip install -q neptune-client lightgbm neptune-lightgbm psutil
conda
conda install -c conda-forge \
neptune-client==0.9.16 lightgbm==3.2.1 neptune-lightgbm==0.9.9 psutil==5.8.0

This integration is tested with lightgbm==3.2.1, neptune-lightgbm==0.9.12, and neptune-client==0.9.16

Install graphviz to log visualized trees after training.

Below installation is only for the pure Python interface to the graphviz software. You need to install graphviz separately. Check graphviz docs for installation help.

pip
conda
pip
pip install -q graphviz
conda
conda install -q -c conda-forge python-graphviz

Log metadata during and after training

For metadata logging during training use NeptuneCallback and for logging additional model summary after training use create_booster_summary() method:

core code
full script
core code
from neptune.new.integrations.lightgbm import NeptuneCallback,\
create_booster_summary
# Create run
my_run = neptune.init(project="my_workspace/my_project")
# Create neptune callback
neptune_callback = NeptuneCallback(run=my_run)
# Prepare data, params, etc.
...
# Pass the callback to the train function and train the model
gbm = lgb.train(
params,
lgb_train,
callbacks=[neptune_callback],
)
# Compute test predictions
y_pred = ...
# Log summary metadata under the "lgbm_summary" namespace
my_run["lgbm_summary"] = create_booster_summary(
booster=gbm,
log_trees=True,
list_trees=[0, 1, 2, 3, 4],
log_confusion_matrix=True,
y_pred=y_pred,
y_true=y_test
)
full script
import lightgbm as lgb
import neptune.new as neptune
import numpy as np
from neptune.new.integrations.lightgbm import NeptuneCallback,\
create_booster_summary
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init(
project="common/lightgbm-integration",
api_token="ANONYMOUS",
name="train-cls",
tags=["lgbm-integration", "train", "cls"]
)
# Create neptune callback
neptune_callback = NeptuneCallback(run=run)
# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
# Define parameters
params = {
"boosting_type": "gbdt",
"objective": "multiclass",
"num_class": 10,
"metric": ["multi_logloss", "multi_error"],
"num_leaves": 21,
"learning_rate": 0.05,
"feature_fraction": 0.9,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"max_depth": 12,
}
# Train the model
gbm = lgb.train(
params,
lgb_train,
num_boost_round=200,
valid_sets=[lgb_train, lgb_eval],
valid_names=["training", "validation"],
callbacks=[neptune_callback],
)
y_pred = np.argmax(gbm.predict(X_test), axis=1)
# Log summary metadata to the same run under the "lgbm_summary" namespace
run["lgbm_summary"] = create_booster_summary(
booster=gbm,
log_trees=True,
list_trees=[0, 1, 2, 3, 4],
log_confusion_matrix=True,
y_pred=y_pred,
y_true=y_test
)

Read docstrings of the NeptuneCallback and create_booster_summary to learn more about parameters.

In the snippet above you

  • use NeptuneCallback to log training metadata such as parameters and metrics,

  • use create_booster_summary() to log additional metadata (visualizations, pickled model) after training is done.

Run script to log both training and booster summary metadata:

python main.py

It will look like this:

Example dashboard with booster summary

Stop Logging

Once you are done logging, you should stop tracking the run using the stop() method. This is needed only while logging from a notebook environment. While logging through a script, Neptune automatically stops tracking once the script has completed execution.

run.stop()

What next?

You can run the example presented above by yourself or see it in Neptune:

More options

CV

You can use NeptuneCallback in the LightGBM.cv function:

core code
full script
core code
from neptune.new.integrations.lightgbm import NeptuneCallback
# Create run
my_run = neptune.init(project="my_workspace/my_project")
# Create neptune callback
neptune_callback = NeptuneCallback(run=my_run)
# Prepare data, params, etc.
...
# Pass the callback to the cv function
gbm_cv = lgb.cv(
params,
lgb_train,
callbacks=[neptune_callback],
)
full script
import lightgbm as lgb
import neptune.new as neptune
from neptune.new.integrations.lightgbm import NeptuneCallback
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init(
project="common/lightgbm-integration",
api_token="ANONYMOUS",
name="cv-cls",
tags=["lgbm-integration", "cv", "cls"]
)
# Create neptune callback
neptune_callback = NeptuneCallback(run=run)
# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
# Define parameters
params = {
"boosting_type": "gbdt",
"objective": "multiclass",
"num_class": 10,
"metric": ["multi_logloss", "multi_error"],
"num_leaves": 21,
"learning_rate": 0.05,
"feature_fraction": 0.9,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"max_depth": 12,
}
# Run CV
gbm_cv = lgb.cv(
params,
lgb_train,
num_boost_round=200,
nfold=7,
callbacks=[neptune_callback],
)

Read docstrings of the NeptuneCallback to learn more about parameters.

In the snippet above

  • import NeptuneCallback that you will use to handle metadata logging,

  • Create a new run in Neptune,

  • Pass run object to the NeptuneCallback,

  • Pass created neptune_callback to the lightgbm.cv function.

At this point, your script is ready to use Neptune as a logger.

Now, you can run your script and have metadata logged to Neptune for further inspection, comparison, and sharing:

python main.py

In Neptune it will look similar to this:

Example dashboard with cross-validation results.

What next?

You can run the example presented above by yourself or see it in Neptune:

Scikit-learn API

You can use NeptuneCallback and create_booster_summary() when working with the Sklearn API of the LightGBM.

core code
full script
core code
from neptune.new.integrations.lightgbm import NeptuneCallback,\
create_booster_summary
# Create run
my_run = neptune.init(project="my_workspace/my_project")
# Create neptune callback
neptune_callback = NeptuneCallback(run=my_run)
# Prepare data, params, and create instance of the classifier object
...
gbm = lgb.LGBMClassifier(**params)
# Fit model and log metadata
gbm.fit(
X_train,
y_train,
callbacks=[neptune_callback],
)
# Compute test predictions
y_pred = ...
# Log summary metadata to the same run under the "lgbm_summary" namespace
run["lgbm_summary"] = create_booster_summary(
booster=gbm,
log_trees=True,
list_trees=[0, 1, 2, 3, 4],
log_confusion_matrix=True,
y_pred=y_pred,
y_true=y_test
)
full script
import lightgbm as lgb
import neptune.new as neptune
from neptune.new.integrations.lightgbm import NeptuneCallback,\
create_booster_summary
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Create run
run = neptune.init(
project="common/lightgbm-integration",
api_token="ANONYMOUS",
name="sklearn-api-cls",
tags=["lgbm-integration", "sklearn-api", "cls"]
)
# Create neptune callback
neptune_callback = NeptuneCallback(run=run)
# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.2,
random_state=123
)
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)
# Define parameters
params = {
"boosting_type": "gbdt",
"objective": "multiclass",
"num_class": 10,
"num_leaves": 21,
"learning_rate": 0.05,
"feature_fraction": 0.9,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"max_depth": 12,
"n_estimators": 207,
}
# Create instance of the classifier object
gbm = lgb.LGBMClassifier(**params)
# Fit model and log metadata
gbm.fit(
X_train,
y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_names=["training", "validation"],
eval_metric=["multi_logloss", "multi_error"],
callbacks=[neptune_callback],
)
y_pred = gbm.predict(X_test)
# Log summary metadata to the same run under the "lgbm_summary" namespace
run["gbm_summary"] = create_booster_summary(
booster=gbm,
log_trees=True,
list_trees=[0, 1, 2, 3, 4],
log_confusion_matrix=True,
y_pred=y_pred,
y_true=y_test
)

Read docstrings of the NeptuneCallback and create_booster_summary to learn more about parameters.

In the snippet above

  • import NeptuneCallback that you will use to handle metadata logging,

  • Create a new run in Neptune,

  • Pass run object to the NeptuneCallback,

  • Pass created neptune_callback to the fit() function (sklearn-api).

At this point, your script is ready to use Neptune as a logger.

Now, you can run your script and have metadata logged to Neptune for further inspection, comparison, and sharing:

python main.py

In Neptune, it will look similar to the screen below. Note that it is the same as in the train() function:

Example run, where sklearn API was used.

What next?

You can run the example presented above by yourself or see it in Neptune:

Resume run

You can resume a run that you created before and continue logging to it. It comes useful when you train a LightGBM model in multiple training sessions. Here is how to do it:

Log other metadata

If you have other types of metadata that are not covered in this integration, you can still log them by using neptune-client. When you create the run, you have the my_run handle:

# Create new Neptune run
my_run = neptune.init(project="my_workspace/my_project")

You can use my_run object to log metadata. Here is more info about it:

How to ask for help?

Please visit the Getting help page. Everything regarding support is there:

Other pages you may like

You may also find the following pages useful: