Skip to content

API Reference

Functions

lightshap.explain_tree

explain_tree(model, X)

Calculate TreeSHAP for XGBoost, LightGBM, and CatBoost models.

The following model types are supported:

  • xgboost.Booster
  • xgboost.XGBModel
  • xgboost.XGBRegressor
  • xgboost.XGBClassifier
  • xgboost.XGBRFClassifier
  • xgboost.XGBRFRegressor
  • lightgbm.Booster
  • lightgbm.LGBMModel
  • lightgbm.LGBMRanker
  • lightgbm.LGBMRegressor
  • lightgbm.LGBMClassifier
  • catboost.CatBoost
  • catboost.CatBoostClassifier
  • catboost.CatBoostRanker
  • catboost.CatBoostRegressor

Parameters:

Name Type Description Default
model XGBoost, LightGBM, or CatBoost model

A fitted model.

required
X array - like

The input data for which SHAP values are to be computed.

required

Returns:

Type Description
Explanation

An Explanation object.

Examples:

>>> # Example 1: XGBoost regression
>>> import numpy as np
>>> import pandas as pd
>>> from lightshap import explain_tree
>>>
>>> import xgboost as xgb
>>>
>>> rng = np.random.default_rng(seed=42)
>>> X = pd.DataFrame(
...     {
...         "X1": rng.normal(0, 1, 100),
...         "X2": rng.uniform(-2, 2, 100),
...         "X3": rng.choice([0, 1, 2], 100),
...     }
... )
>>> y = X["X1"] + X["X2"] ** 2 + X["X3"] + rng.normal(0, 0.1, 100)
>>> model = xgb.train({"learning_rate": 0.1}, xgb.DMatrix(X, label=y))
>>>
>>> explanation = explain_tree(model, X)
>>> explanation.plot.beeswarm()
>>> explanation.plot.scatter()
>>> # Example 2: LightGBM Multi-Class Classification
>>> import numpy as np
>>> import pandas as pd
>>> from lightgbm import LGBMClassifier
>>> from lightshap import explain_tree
>>>
>>> rng = np.random.default_rng(seed=42)
>>> X = pd.DataFrame(
...     {
...         "X1": rng.normal(0, 1, 100),
...         "X2": rng.uniform(-2, 2, 100),
...         "X3": rng.choice([0, 1, 2], 100),
...     }
... )
>>> y = X["X1"] + X["X2"] ** 2 + X["X3"] + rng.normal(0, 0.1, 100)
>>> y = pd.cut(y, bins=3, labels=[0, 1, 2])
>>> model = LGBMClassifier(max_depth=3, verbose=-1)
>>> model.fit(X, y)
>>>
>>> # SHAP analysis
>>> explanation = explain_tree(model, X)
>>> explanation.set_output_names(["Class 0", "Class 1", "Class 2"])
>>> explanation.plot.bar()
>>> explanation.plot.scatter(which_output=0)  # Class 0

lightshap.explain_any

explain_any(predict, X, bg_X=None, bg_w=None, bg_n=200, method=None, how=None, max_iter=None, tol=0.01, random_state=None, n_jobs=1, verbose=True)

SHAP values for any model

Calculate SHAP values for any model using either Kernel SHAP or Permutation SHAP. By default, it uses Permutation SHAP for p <= 8 features and a hybrid between exact and sampling Kernel SHAP for p > 8 features.

Parameters:

Name Type Description Default
predict callable

A callable to get predictions, e.g., model.predict, model.predict_proba, or lambda x: scipy.special.logit(model.predict_proba(x)[:, -1]).

required
X (DataFrame, DataFrame, ndarray)

Input data for which explanations are to be generated. Should contain only the p feature columns. Must be compatible with predict.

required
bg_X pd.DataFrame, pl.DataFrame, np.ndarray, or None

Background data used to integrate out "switched off" features, typically a representative sample of the training data with 100 to 500 rows. Should contain the same columns as X, and be compatible with predict. If None, up to bg_n rows of X are randomly selected.

None
bg_w pd.Series, pl.Series, np.ndarray, or None

Weights for the background data. If None, equal weights are used. If bg_X is None, bg_w must have the same length as X.

None
bg_n int

If bg_X is None, that many rows are randomly selected from X to use as background data. Values between 50 and 500 are recommended.

200
method

Either "kernel", "permutation", or None. If None, it is set to "permutation" when p <= 8, and to "kernel" otherwise.

None
how

If "exact", exact SHAP values are computed. If "sampling", iterative sampling is used to approximate SHAP values. For Kernel SHAP, hybrid approaches between "sampling" and "exact" options are available: "h1" uses exact calculations for coalitions of size 1 and p-1, whereas "h2" uses exact calculations for coalitions of size 1, 2, p-2, and p-1. If None, it is set to "exact" when p <= 8. Otherwise, if method=="permutation", it is set to "sampling". For Kernel SHAP, if 8 < p <= 16, it is set to "h2", and to "h1" when p > 16.

None
max_iter int or None

Maximum number of iterations for non-exact algorithms. Each iteration represents a forward and backward pass through a random permutation. For permutation SHAP, one iteration allows to evaluate Shapley's formula 2*p times (twice per feature). p subsequent iterations are starting with different values for faster convergence. If None, it is set to 10 * p.

None
tol float

Tolerance for convergence. The algorithm stops when the estimated standard errors are all smaller or equal to tol * range(shap_values) for each output dimension. Not used when how=="exact".

0.01
random_state int or None

Integer random seed to initialize numpy's random generator. Required for non-exact algorithms, and to subsample the background data if bg_X is None.

None
n_jobs int

Number of parallel jobs to run via joblib. If 1, no parallelization is used. If -1, all available cores are used.

1
verbose bool

If True, prints information and the tqdm progress bar.

True

Returns:

Type Description
Explanation object

Examples:

Example 1: Working with Numpy input

>>> import numpy as np
>>> from lightshap import explain_any
>>>
>>> # Create synthetic data
>>> rng = np.random.default_rng(0)
>>> X = rng.standard_normal((1000, 4))
>>>
>>> # In practice, you would use model.predict, model.predict_proba,
>>> # or a function thereof, e.g.,
>>> # lambda X: scipy.special.logit(model.predict_proba(X))
>>> def predict_function(X):
...     linear = X[:, 0] + 2 * X[:, 1] - X[:, 2] + 0.5 * X[:, 3]
...     interactions = X[:, 0] * X[:, 1] - X[:, 1] * X[:, 2]
...     return (linear + interactions).reshape(-1, 1)
>>>
>>> # Explain with numpy array (no feature names initially)
>>> explanation = explain_any(
...     predict=predict_function,
...     X=X[:100],  # Explain first 100 rows
... )
>>>
>>> # Set meaningful feature names
>>> feature_names = ["temperature", "pressure", "humidity", "wind_speed"]
>>> explanation = explanation.set_feature_names(feature_names)
>>>
>>> # Generate plots
>>> explanation.plot.bar()
>>> explanation.plot.scatter(["temperature", "humidity"])
>>> explanation.plot.waterfall(row_id=0)

Example 2: Polars input with categorical features

>>> import numpy as np
>>> import polars as pl
>>> from lightshap import explain_any
>>>
>>> rng = np.random.default_rng(0)
>>> n = 800
>>>
>>> df = pl.DataFrame({
...     "age": rng.uniform(18, 80, n).round(),
...     "income": rng.exponential(50000, n).round(-3),
...     "education": rng.choice(["high_school", "college", "graduate", "phd"], n),
...     "region": rng.choice(["north", "south", "east", "west"], n),
... }).with_columns([
...     pl.col("education").cast(pl.Categorical),
...     pl.col("region").cast(pl.Categorical),
... ])
>>>
>>> # Again, in practice you would use a fitted model's predict instead
>>> def predict_function(X):
...     pred = X["age"] / 50 + X["income"] / 100_000 * (
...         1 + 0.5 * X["education"].is_in(["graduate", "phd"])
...     )
...     return pred
>>>
>>> explanation = explain_any(
...     predict=predict_function,
...     X=df[:200],  # Explain first 200 rows
...     bg_X=df[200:400],  # Pass background dataset or use (subset) of X
... )
>>>
>>> explanation.plot.beeswarm()
>>> explanation.plot.scatter()

Classes

lightshap.Explanation

SHAP Explanation object that encapsulates model explanations.

The Explanation class provides a comprehensive framework for storing, analyzing, and visualizing SHAP (SHapley Additive exPlanations) values, which help interpret machine learning model predictions. This class supports both single-output and multi-output models, handles feature importance analysis, and offers various visualization methods.

The class stores SHAP values along with the associated data points, baseline values, and optionally includes standard errors, convergence indicators, and iteration counts for approximation methods. It provides methods to select subsets of the data, calculate feature importance, and create various visualizations including waterfall plots, dependence plots, summary plots, and importance plots.

Parameters:

Name Type Description Default
shap_values ndarray

numpy.ndarray of shape (n_obs, n_features) for single-output models, and of shape (n_obs, n_features, n_outputs) for multi-output models.

required
X (DataFrame, DataFrame, ndarray)

Feature values corresponding to shap_values. The columns must be in the same order.

required
baseline float or ndarray

The baseline value(s) representing the expected model output when all features are missing. For single-output models, either a scalar or a numpy.ndarray of shape (1, ). For multi-output models, an array of shape (n_outputs,).

0.0
feature_names list or None

Feature names. If None and X is a pandas DataFrame, column names are used. If None and X is not a DataFrame, default names are generated.

None
output_names list or None

Names of the outputs for multi-output models. If None, default names are generated.

None
standard_errors ndarray or None

Standard errors of the SHAP values. Must have the same shape as shap_values, or None. Only relevant for approximate methods.

None
converged ndarray or None

Boolean array indicating the convergence status per observation. Only relevant for approximate methods.

None
n_iter ndarray or None

Number of iterations per observation. Only relevant for approximate methods.

None

Attributes:

Name Type Description
shap_values ndarray

numpy.ndarray of shape (n_obs, n_features) for single-output models, and of shape (n_obs, n_features, n_outputs) for multi-output models.

X DataFrame

The feature values corresponding to shap_values. Note that the index is reset to the values 0 to n_obs - 1.

baseline ndarray

Baseline value(s). Has shape (1, ) for single-output models, and shape (n_outputs, ) for multi-output models.

standard_errors ndarray or None

Standard errors of the SHAP values of the same shape as shap_values (if available).

converged ndarray or None

Convergence indicators of shape (n_obs, ) (if available).

n_iter ndarray or None

Iteration counts of shape (n_obs, ) (if available).

shape tuple

Shape of shap_values.

ndim int

Number of dimensions of the SHAP values (2 or 3).

feature_names list

Feature names.

output_names list or None

Output names for multi-output models. None for single-output models.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> from lightshap import Explanation
>>>
>>> # Example data
>>> X = pd.DataFrame({'feature1': [1, 2, 3], 'feature2': [4, 5, 6]})
>>> shap_values = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])
>>>
>>> explanation = Explanation(shap_values, X, baseline=0.5)
>>>
>>> # Waterfall plot of first observation
>>> explanation.plot.waterfall(row_id=0)

plot property

Accessor for plotting methods.

Examples:

>>> explanation.plot.bar()
>>> explanation.plot.waterfall(row_id=0)
>>> explanation.plot.beeswarm()
>>> explanation.plot.scatter(features=["feature1", "feature2"])

filter(indices)

Filter the SHAP values by array-like.

Parameters:

Name Type Description Default
indices array - like

Integer or boolean array-like to filter the SHAP values and data.

required

Returns:

Type Description
Explanation

A new Explanation object with filtered SHAP values and data.

select_output(index)

Select specific output dimension from the SHAP values. Useful if predictions are multi-output.

Parameters:

Name Type Description Default
index Int or str

Index or name of the output dimension to select.

required

Returns:

Type Description
Explanation

A new Explanation object with only the selected output.

set_feature_names(feature_names)

Set feature names of 'X'.

Parameters:

Name Type Description Default
feature_names list or array - like

Feature names to set.

required

set_output_names(output_names=None)

If predictions are multi-output, set names of the additional dimension.

Parameters:

Name Type Description Default
output_names list or array - like

Output names to set.

None

set_X(X)

Set X and self.feature_names.

X is converted to pandas. String and object columns are converted to categoricals, while numeric columns are left unchanged. Other column types will raise a TypeError.

Parameters:

Name Type Description Default
X (ndarray, DataFrame or DataFrame)

New data to set. Columns must match the order of SHAP values.

required

importance(which_output=None)

Calculate mean absolute SHAP values for each feature (and output dimension).

Parameters:

Name Type Description Default
which_output int or string

Index or name of the output dimension to calculate importance for. If None, all outputs are considered. Only relevant for multi-output models.

None

Returns:

Type Description
Series or DataFrame

Series containing mean absolute SHAP values sorted by importance. In case of multi-output models, it returns a DataFrame, and the sort order is determined by the average importance across all outputs.

interaction_heuristic(features=None, color_features=None)

Interaction heuristic.

For each feature/color_feature combination, the weighted average absolute Pearson correlation coefficient between the SHAP values of the feature and the values of the color_feature is calculated. The larger the value, the higher the potential interaction.

Notes:

  • Non-numeric color features are converted to numeric, which does not always make sense.
  • Missing values in the color feature are currently discarded.
  • The number of non-missing color values in the bins are used as weight to compute the weighted average.

Parameters:

Name Type Description Default
features list

List of feature names. If None, all features are used.

None
color_features list

List of color feature names. If None, all features are used.

None

Returns:

Type Description
DataFrame

DataFrame with interaction heuristics. feature_names serve as index, and color_features as columns.