API Reference

Functions

`lightshap.explain_tree`

`explain_tree(model, X)`

Calculate TreeSHAP for XGBoost, LightGBM, and CatBoost models.

The following model types are supported:

xgboost.Booster
xgboost.XGBModel
xgboost.XGBRegressor
xgboost.XGBClassifier
xgboost.XGBRFClassifier
xgboost.XGBRFRegressor
lightgbm.Booster
lightgbm.LGBMModel
lightgbm.LGBMRanker
lightgbm.LGBMRegressor
lightgbm.LGBMClassifier
catboost.CatBoost
catboost.CatBoostClassifier
catboost.CatBoostRanker
catboost.CatBoostRegressor

Parameters:

Name	Type	Description	Default
`model`	`XGBoost, LightGBM, or CatBoost model`	A fitted model.	required
`X`	`array - like`	The input data for which SHAP values are to be computed.	required

Returns:

Type	Description
`Explanation`	An Explanation object.

Examples:

>>> # Example 1: XGBoost regression
>>> import numpy as np
>>> import pandas as pd
>>> from lightshap import explain_tree
>>>
>>> import xgboost as xgb
>>>
>>> rng = np.random.default_rng(seed=42)
>>> X = pd.DataFrame(
...     {
...         "X1": rng.normal(0, 1, 100),
...         "X2": rng.uniform(-2, 2, 100),
...         "X3": rng.choice([0, 1, 2], 100),
...     }
... )
>>> y = X["X1"] + X["X2"] ** 2 + X["X3"] + rng.normal(0, 0.1, 100)
>>> model = xgb.train({"learning_rate": 0.1}, xgb.DMatrix(X, label=y))
>>>
>>> explanation = explain_tree(model, X)
>>> explanation.plot.beeswarm()
>>> explanation.plot.scatter()

>>> # Example 2: LightGBM Multi-Class Classification
>>> import numpy as np
>>> import pandas as pd
>>> from lightgbm import LGBMClassifier
>>> from lightshap import explain_tree
>>>
>>> rng = np.random.default_rng(seed=42)
>>> X = pd.DataFrame(
...     {
...         "X1": rng.normal(0, 1, 100),
...         "X2": rng.uniform(-2, 2, 100),
...         "X3": rng.choice([0, 1, 2], 100),
...     }
... )
>>> y = X["X1"] + X["X2"] ** 2 + X["X3"] + rng.normal(0, 0.1, 100)
>>> y = pd.cut(y, bins=3, labels=[0, 1, 2])
>>> model = LGBMClassifier(max_depth=3, verbose=-1)
>>> model.fit(X, y)
>>>
>>> # SHAP analysis
>>> explanation = explain_tree(model, X)
>>> explanation.set_output_names(["Class 0", "Class 1", "Class 2"])
>>> explanation.plot.bar()
>>> explanation.plot.scatter(which_output=0)  # Class 0

`lightshap.explain_any`

`explain_any(predict, X, bg_X=None, bg_w=None, bg_n=200, method=None, how=None, max_iter=None, tol=0.01, random_state=None, n_jobs=1, verbose=True)`

SHAP values for any model

Calculate SHAP values for any model using either Kernel SHAP or Permutation SHAP. By default, it uses Permutation SHAP for p <= 8 features and a hybrid between exact and sampling Kernel SHAP for p > 8 features.

Parameters:

Name	Type	Description	Default
`predict`	`callable`	A callable to get predictions, e.g., `model.predict`, `model.predict_proba`, or lambda x: scipy.special.logit(model.predict_proba(x)[:, -1]).	required
`X`	`(DataFrame, DataFrame, ndarray)`	Input data for which explanations are to be generated. Should contain only the p feature columns. Must be compatible with `predict`.	required
`bg_X`	`pd.DataFrame, pl.DataFrame, np.ndarray, or None`	Background data used to integrate out "switched off" features, typically a representative sample of the training data with 100 to 500 rows. Should contain the same columns as `X`, and be compatible with `predict`. If None, up to `bg_n` rows of `X` are randomly selected.	`None`
`bg_w`	`pd.Series, pl.Series, np.ndarray, or None`	Weights for the background data. If None, equal weights are used. If `bg_X` is None, `bg_w` must have the same length as `X`.	`None`
`bg_n`	`int`	If `bg_X` is None, that many rows are randomly selected from `X` to use as background data. Values between 50 and 500 are recommended.	`200`
`method`		Either "kernel", "permutation", or None. If None, it is set to "permutation" when p <= 8, and to "kernel" otherwise.	`None`
`how`		If "exact", exact SHAP values are computed. If "sampling", iterative sampling is used to approximate SHAP values. For Kernel SHAP, hybrid approaches between "sampling" and "exact" options are available: "h1" uses exact calculations for coalitions of size 1 and p-1, whereas "h2" uses exact calculations for coalitions of size 1, 2, p-2, and p-1. If None, it is set to "exact" when p <= 8. Otherwise, if method=="permutation", it is set to "sampling". For Kernel SHAP, if 8 < p <= 16, it is set to "h2", and to "h1" when p > 16.	`None`
`max_iter`	`int or None`	Maximum number of iterations for non-exact algorithms. Each iteration represents a forward and backward pass through a random permutation. For permutation SHAP, one iteration allows to evaluate Shapley's formula 2p times (twice per feature). p subsequent iterations are starting with different values for faster convergence. If None, it is set to 10 p.	`None`
`tol`	`float`	Tolerance for convergence. The algorithm stops when the estimated standard errors are all smaller or equal to `tol * range(shap_values)` for each output dimension. Not used when how=="exact".	`0.01`
`random_state`	`int or None`	Integer random seed to initialize numpy's random generator. Required for non-exact algorithms, and to subsample the background data if `bg_X` is None.	`None`
`n_jobs`	`int`	Number of parallel jobs to run via joblib. If 1, no parallelization is used. If -1, all available cores are used.	`1`
`verbose`	`bool`	If True, prints information and the tqdm progress bar.	`True`

Returns:

Type	Description
`Explanation object`

Examples:

Example 1: Working with Numpy input

>>> import numpy as np
>>> from lightshap import explain_any
>>>
>>> # Create synthetic data
>>> rng = np.random.default_rng(0)
>>> X = rng.standard_normal((1000, 4))
>>>
>>> # In practice, you would use model.predict, model.predict_proba,
>>> # or a function thereof, e.g.,
>>> # lambda X: scipy.special.logit(model.predict_proba(X))
>>> def predict_function(X):
...     linear = X[:, 0] + 2 * X[:, 1] - X[:, 2] + 0.5 * X[:, 3]
...     interactions = X[:, 0] * X[:, 1] - X[:, 1] * X[:, 2]
...     return (linear + interactions).reshape(-1, 1)
>>>
>>> # Explain with numpy array (no feature names initially)
>>> explanation = explain_any(
...     predict=predict_function,
...     X=X[:100],  # Explain first 100 rows
... )
>>>
>>> # Set meaningful feature names
>>> feature_names = ["temperature", "pressure", "humidity", "wind_speed"]
>>> explanation = explanation.set_feature_names(feature_names)
>>>
>>> # Generate plots
>>> explanation.plot.bar()
>>> explanation.plot.scatter(["temperature", "humidity"])
>>> explanation.plot.waterfall(row_id=0)

Example 2: Polars input with categorical features

>>> import numpy as np
>>> import polars as pl
>>> from lightshap import explain_any
>>>
>>> rng = np.random.default_rng(0)
>>> n = 800
>>>
>>> df = pl.DataFrame({
...     "age": rng.uniform(18, 80, n).round(),
...     "income": rng.exponential(50000, n).round(-3),
...     "education": rng.choice(["high_school", "college", "graduate", "phd"], n),
...     "region": rng.choice(["north", "south", "east", "west"], n),
... }).with_columns([
...     pl.col("education").cast(pl.Categorical),
...     pl.col("region").cast(pl.Categorical),
... ])
>>>
>>> # Again, in practice you would use a fitted model's predict instead
>>> def predict_function(X):
...     pred = X["age"] / 50 + X["income"] / 100_000 * (
...         1 + 0.5 * X["education"].is_in(["graduate", "phd"])
...     )
...     return pred
>>>
>>> explanation = explain_any(
...     predict=predict_function,
...     X=df[:200],  # Explain first 200 rows
...     bg_X=df[200:400],  # Pass background dataset or use (subset) of X
... )
>>>
>>> explanation.plot.beeswarm()
>>> explanation.plot.scatter()

Classes

`lightshap.Explanation`

SHAP Explanation object that encapsulates model explanations.

The Explanation class provides a comprehensive framework for storing, analyzing, and visualizing SHAP (SHapley Additive exPlanations) values, which help interpret machine learning model predictions. This class supports both single-output and multi-output models, handles feature importance analysis, and offers various visualization methods.

The class stores SHAP values along with the associated data points, baseline values, and optionally includes standard errors, convergence indicators, and iteration counts for approximation methods. It provides methods to select subsets of the data, calculate feature importance, and create various visualizations including waterfall plots, dependence plots, summary plots, and importance plots.

Parameters:

Name	Type	Description	Default
`shap_values`	`ndarray`	numpy.ndarray of shape (n_obs, n_features) for single-output models, and of shape (n_obs, n_features, n_outputs) for multi-output models.	required
`X`	`(DataFrame, DataFrame, ndarray)`	Feature values corresponding to `shap_values`. The columns must be in the same order.	required
`baseline`	`float or ndarray`	The baseline value(s) representing the expected model output when all features are missing. For single-output models, either a scalar or a numpy.ndarray of shape (1, ). For multi-output models, an array of shape (n_outputs,).	`0.0`
`feature_names`	`list or None`	Feature names. If None and X is a pandas DataFrame, column names are used. If None and X is not a DataFrame, default names are generated.	`None`
`output_names`	`list or None`	Names of the outputs for multi-output models. If None, default names are generated.	`None`
`standard_errors`	`ndarray or None`	Standard errors of the SHAP values. Must have the same shape as shap_values, or None. Only relevant for approximate methods.	`None`
`converged`	`ndarray or None`	Boolean array indicating the convergence status per observation. Only relevant for approximate methods.	`None`
`n_iter`	`ndarray or None`	Number of iterations per observation. Only relevant for approximate methods.	`None`

Attributes:

Name	Type	Description
`shap_values`	`ndarray`	numpy.ndarray of shape (n_obs, n_features) for single-output models, and of shape (n_obs, n_features, n_outputs) for multi-output models.
`X`	`DataFrame`	The feature values corresponding to `shap_values`. Note that the index is reset to the values 0 to n_obs - 1.
`baseline`	`ndarray`	Baseline value(s). Has shape (1, ) for single-output models, and shape (n_outputs, ) for multi-output models.
`standard_errors`	`ndarray or None`	Standard errors of the SHAP values of the same shape as `shap_values` (if available).
`converged`	`ndarray or None`	Convergence indicators of shape (n_obs, ) (if available).
`n_iter`	`ndarray or None`	Iteration counts of shape (n_obs, ) (if available).
`shape`	`tuple`	Shape of `shap_values`.
`ndim`	`int`	Number of dimensions of the SHAP values (2 or 3).
`feature_names`	`list`	Feature names.
`output_names`	`list or None`	Output names for multi-output models. None for single-output models.

Examples:

>>> import numpy as np
>>> import pandas as pd
>>> from lightshap import Explanation
>>>
>>> # Example data
>>> X = pd.DataFrame({'feature1': [1, 2, 3], 'feature2': [4, 5, 6]})
>>> shap_values = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])
>>>
>>> explanation = Explanation(shap_values, X, baseline=0.5)
>>>
>>> # Waterfall plot of first observation
>>> explanation.plot.waterfall(row_id=0)

`plot` `property`

Accessor for plotting methods.

Examples:

>>> explanation.plot.bar()
>>> explanation.plot.waterfall(row_id=0)
>>> explanation.plot.beeswarm()
>>> explanation.plot.scatter(features=["feature1", "feature2"])

`filter(indices)`

Filter the SHAP values by array-like.

Parameters:

Name	Type	Description	Default
`indices`	`array - like`	Integer or boolean array-like to filter the SHAP values and data.	required

Returns:

Type	Description
`Explanation`	A new Explanation object with filtered SHAP values and data.

`select_output(index)`

Select specific output dimension from the SHAP values. Useful if predictions are multi-output.

Parameters:

Name	Type	Description	Default
`index`	`Int or str`	Index or name of the output dimension to select.	required

Returns:

Type	Description
`Explanation`	A new Explanation object with only the selected output.

`set_feature_names(feature_names)`

Set feature names of 'X'.

Parameters:

Name	Type	Description	Default
`feature_names`	`list or array - like`	Feature names to set.	required

`set_output_names(output_names=None)`

If predictions are multi-output, set names of the additional dimension.

Parameters:

Name	Type	Description	Default
`output_names`	`list or array - like`	Output names to set.	`None`

`set_X(X)`

Set X and self.feature_names.

X is converted to pandas. String and object columns are converted to categoricals, while numeric columns are left unchanged. Other column types will raise a TypeError.

Parameters:

Name	Type	Description	Default
`X`	`(ndarray, DataFrame or DataFrame)`	New data to set. Columns must match the order of SHAP values.	required

`importance(which_output=None)`

Calculate mean absolute SHAP values for each feature (and output dimension).

Parameters:

Name	Type	Description	Default
`which_output`	`int or string`	Index or name of the output dimension to calculate importance for. If None, all outputs are considered. Only relevant for multi-output models.	`None`

Returns:

Type	Description
`Series or DataFrame`	Series containing mean absolute SHAP values sorted by importance. In case of multi-output models, it returns a DataFrame, and the sort order is determined by the average importance across all outputs.

`interaction_heuristic(features=None, color_features=None)`

Interaction heuristic.

For each feature/color_feature combination, the weighted average absolute Pearson correlation coefficient between the SHAP values of the feature and the values of the color_feature is calculated. The larger the value, the higher the potential interaction.

Notes:

Non-numeric color features are converted to numeric, which does not always make sense.
Missing values in the color feature are currently discarded.
The number of non-missing color values in the bins are used as weight to compute the weighted average.

Parameters:

Name	Type	Description	Default
`features`	`list`	List of feature names. If None, all features are used.	`None`
`color_features`	`list`	List of color feature names. If None, all features are used.	`None`

Returns:

Type	Description
`DataFrame`	DataFrame with interaction heuristics. `feature_names` serve as index, and `color_features` as columns.

API Reference

Functions

lightshap.explain_tree

explain_tree(model, X)

lightshap.explain_any

explain_any(predict, X, bg_X=None, bg_w=None, bg_n=200, method=None, how=None, max_iter=None, tol=0.01, random_state=None, n_jobs=1, verbose=True)

Classes

lightshap.Explanation

plot property

filter(indices)

select_output(index)

set_feature_names(feature_names)

set_output_names(output_names=None)

set_X(X)

importance(which_output=None)

interaction_heuristic(features=None, color_features=None)

`lightshap.explain_tree`

`explain_tree(model, X)`

`lightshap.explain_any`

`explain_any(predict, X, bg_X=None, bg_w=None, bg_n=200, method=None, how=None, max_iter=None, tol=0.01, random_state=None, n_jobs=1, verbose=True)`

`lightshap.Explanation`

`plot` `property`

`filter(indices)`

`select_output(index)`

`set_feature_names(feature_names)`

`set_output_names(output_names=None)`

`set_X(X)`

`importance(which_output=None)`

`interaction_heuristic(features=None, color_features=None)`