API Reference
Functions
lightshap.explain_tree
explain_tree(model, X)
Calculate TreeSHAP for XGBoost, LightGBM, and CatBoost models.
The following model types are supported:
- xgboost.Booster
- xgboost.XGBModel
- xgboost.XGBRegressor
- xgboost.XGBClassifier
- xgboost.XGBRFClassifier
- xgboost.XGBRFRegressor
- lightgbm.Booster
- lightgbm.LGBMModel
- lightgbm.LGBMRanker
- lightgbm.LGBMRegressor
- lightgbm.LGBMClassifier
- catboost.CatBoost
- catboost.CatBoostClassifier
- catboost.CatBoostRanker
- catboost.CatBoostRegressor
Parameters:
Name | Type | Description | Default |
---|---|---|---|
model
|
XGBoost, LightGBM, or CatBoost model
|
A fitted model. |
required |
X
|
array - like
|
The input data for which SHAP values are to be computed. |
required |
Returns:
Type | Description |
---|---|
Explanation
|
An Explanation object. |
Examples:
>>> # Example 1: XGBoost regression
>>> import numpy as np
>>> import pandas as pd
>>> from lightshap import explain_tree
>>>
>>> import xgboost as xgb
>>>
>>> rng = np.random.default_rng(seed=42)
>>> X = pd.DataFrame(
... {
... "X1": rng.normal(0, 1, 100),
... "X2": rng.uniform(-2, 2, 100),
... "X3": rng.choice([0, 1, 2], 100),
... }
... )
>>> y = X["X1"] + X["X2"] ** 2 + X["X3"] + rng.normal(0, 0.1, 100)
>>> model = xgb.train({"learning_rate": 0.1}, xgb.DMatrix(X, label=y))
>>>
>>> explanation = explain_tree(model, X)
>>> explanation.plot.beeswarm()
>>> explanation.plot.scatter()
>>> # Example 2: LightGBM Multi-Class Classification
>>> import numpy as np
>>> import pandas as pd
>>> from lightgbm import LGBMClassifier
>>> from lightshap import explain_tree
>>>
>>> rng = np.random.default_rng(seed=42)
>>> X = pd.DataFrame(
... {
... "X1": rng.normal(0, 1, 100),
... "X2": rng.uniform(-2, 2, 100),
... "X3": rng.choice([0, 1, 2], 100),
... }
... )
>>> y = X["X1"] + X["X2"] ** 2 + X["X3"] + rng.normal(0, 0.1, 100)
>>> y = pd.cut(y, bins=3, labels=[0, 1, 2])
>>> model = LGBMClassifier(max_depth=3, verbose=-1)
>>> model.fit(X, y)
>>>
>>> # SHAP analysis
>>> explanation = explain_tree(model, X)
>>> explanation.set_output_names(["Class 0", "Class 1", "Class 2"])
>>> explanation.plot.bar()
>>> explanation.plot.scatter(which_output=0) # Class 0
lightshap.explain_any
explain_any(predict, X, bg_X=None, bg_w=None, bg_n=200, method=None, how=None, max_iter=None, tol=0.01, random_state=None, n_jobs=1, verbose=True)
SHAP values for any model
Calculate SHAP values for any model using either Kernel SHAP or Permutation SHAP. By default, it uses Permutation SHAP for p <= 8 features and a hybrid between exact and sampling Kernel SHAP for p > 8 features.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
predict
|
callable
|
A callable to get predictions, e.g., |
required |
X
|
(DataFrame, DataFrame, ndarray)
|
Input data for which explanations are to be generated. Should contain only
the p feature columns. Must be compatible with |
required |
bg_X
|
pd.DataFrame, pl.DataFrame, np.ndarray, or None
|
Background data used to integrate out "switched off" features,
typically a representative sample of the training data with 100 to 500 rows.
Should contain the same columns as |
None
|
bg_w
|
pd.Series, pl.Series, np.ndarray, or None
|
Weights for the background data. If None, equal weights are used.
If |
None
|
bg_n
|
int
|
If |
200
|
method
|
Either "kernel", "permutation", or None. If None, it is set to "permutation" when p <= 8, and to "kernel" otherwise. |
None
|
|
how
|
If "exact", exact SHAP values are computed. If "sampling", iterative sampling is used to approximate SHAP values. For Kernel SHAP, hybrid approaches between "sampling" and "exact" options are available: "h1" uses exact calculations for coalitions of size 1 and p-1, whereas "h2" uses exact calculations for coalitions of size 1, 2, p-2, and p-1. If None, it is set to "exact" when p <= 8. Otherwise, if method=="permutation", it is set to "sampling". For Kernel SHAP, if 8 < p <= 16, it is set to "h2", and to "h1" when p > 16. |
None
|
|
max_iter
|
int or None
|
Maximum number of iterations for non-exact algorithms. Each iteration represents a forward and backward pass through a random permutation. For permutation SHAP, one iteration allows to evaluate Shapley's formula 2*p times (twice per feature). p subsequent iterations are starting with different values for faster convergence. If None, it is set to 10 * p. |
None
|
tol
|
float
|
Tolerance for convergence. The algorithm stops when the estimated standard
errors are all smaller or equal to |
0.01
|
random_state
|
int or None
|
Integer random seed to initialize numpy's random generator. Required for
non-exact algorithms, and to subsample the background data if |
None
|
n_jobs
|
int
|
Number of parallel jobs to run via joblib. If 1, no parallelization is used. If -1, all available cores are used. |
1
|
verbose
|
bool
|
If True, prints information and the tqdm progress bar. |
True
|
Returns:
Type | Description |
---|---|
Explanation object
|
|
Examples:
Example 1: Working with Numpy input
>>> import numpy as np
>>> from lightshap import explain_any
>>>
>>> # Create synthetic data
>>> rng = np.random.default_rng(0)
>>> X = rng.standard_normal((1000, 4))
>>>
>>> # In practice, you would use model.predict, model.predict_proba,
>>> # or a function thereof, e.g.,
>>> # lambda X: scipy.special.logit(model.predict_proba(X))
>>> def predict_function(X):
... linear = X[:, 0] + 2 * X[:, 1] - X[:, 2] + 0.5 * X[:, 3]
... interactions = X[:, 0] * X[:, 1] - X[:, 1] * X[:, 2]
... return (linear + interactions).reshape(-1, 1)
>>>
>>> # Explain with numpy array (no feature names initially)
>>> explanation = explain_any(
... predict=predict_function,
... X=X[:100], # Explain first 100 rows
... )
>>>
>>> # Set meaningful feature names
>>> feature_names = ["temperature", "pressure", "humidity", "wind_speed"]
>>> explanation = explanation.set_feature_names(feature_names)
>>>
>>> # Generate plots
>>> explanation.plot.bar()
>>> explanation.plot.scatter(["temperature", "humidity"])
>>> explanation.plot.waterfall(row_id=0)
Example 2: Polars input with categorical features
>>> import numpy as np
>>> import polars as pl
>>> from lightshap import explain_any
>>>
>>> rng = np.random.default_rng(0)
>>> n = 800
>>>
>>> df = pl.DataFrame({
... "age": rng.uniform(18, 80, n).round(),
... "income": rng.exponential(50000, n).round(-3),
... "education": rng.choice(["high_school", "college", "graduate", "phd"], n),
... "region": rng.choice(["north", "south", "east", "west"], n),
... }).with_columns([
... pl.col("education").cast(pl.Categorical),
... pl.col("region").cast(pl.Categorical),
... ])
>>>
>>> # Again, in practice you would use a fitted model's predict instead
>>> def predict_function(X):
... pred = X["age"] / 50 + X["income"] / 100_000 * (
... 1 + 0.5 * X["education"].is_in(["graduate", "phd"])
... )
... return pred
>>>
>>> explanation = explain_any(
... predict=predict_function,
... X=df[:200], # Explain first 200 rows
... bg_X=df[200:400], # Pass background dataset or use (subset) of X
... )
>>>
>>> explanation.plot.beeswarm()
>>> explanation.plot.scatter()
Classes
lightshap.Explanation
SHAP Explanation object that encapsulates model explanations.
The Explanation class provides a comprehensive framework for storing, analyzing, and visualizing SHAP (SHapley Additive exPlanations) values, which help interpret machine learning model predictions. This class supports both single-output and multi-output models, handles feature importance analysis, and offers various visualization methods.
The class stores SHAP values along with the associated data points, baseline values, and optionally includes standard errors, convergence indicators, and iteration counts for approximation methods. It provides methods to select subsets of the data, calculate feature importance, and create various visualizations including waterfall plots, dependence plots, summary plots, and importance plots.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
shap_values
|
ndarray
|
numpy.ndarray of shape (n_obs, n_features) for single-output models, and of shape (n_obs, n_features, n_outputs) for multi-output models. |
required |
X
|
(DataFrame, DataFrame, ndarray)
|
Feature values corresponding to |
required |
baseline
|
float or ndarray
|
The baseline value(s) representing the expected model output when all features are missing. For single-output models, either a scalar or a numpy.ndarray of shape (1, ). For multi-output models, an array of shape (n_outputs,). |
0.0
|
feature_names
|
list or None
|
Feature names. If None and X is a pandas DataFrame, column names are used. If None and X is not a DataFrame, default names are generated. |
None
|
output_names
|
list or None
|
Names of the outputs for multi-output models. If None, default names are generated. |
None
|
standard_errors
|
ndarray or None
|
Standard errors of the SHAP values. Must have the same shape as shap_values, or None. Only relevant for approximate methods. |
None
|
converged
|
ndarray or None
|
Boolean array indicating the convergence status per observation. Only relevant for approximate methods. |
None
|
n_iter
|
ndarray or None
|
Number of iterations per observation. Only relevant for approximate methods. |
None
|
Attributes:
Name | Type | Description |
---|---|---|
shap_values |
ndarray
|
numpy.ndarray of shape (n_obs, n_features) for single-output models, and of shape (n_obs, n_features, n_outputs) for multi-output models. |
X |
DataFrame
|
The feature values corresponding to |
baseline |
ndarray
|
Baseline value(s). Has shape (1, ) for single-output models, and shape (n_outputs, ) for multi-output models. |
standard_errors |
ndarray or None
|
Standard errors of the SHAP values of the same shape as |
converged |
ndarray or None
|
Convergence indicators of shape (n_obs, ) (if available). |
n_iter |
ndarray or None
|
Iteration counts of shape (n_obs, ) (if available). |
shape |
tuple
|
Shape of |
ndim |
int
|
Number of dimensions of the SHAP values (2 or 3). |
feature_names |
list
|
Feature names. |
output_names |
list or None
|
Output names for multi-output models. None for single-output models. |
Examples:
>>> import numpy as np
>>> import pandas as pd
>>> from lightshap import Explanation
>>>
>>> # Example data
>>> X = pd.DataFrame({'feature1': [1, 2, 3], 'feature2': [4, 5, 6]})
>>> shap_values = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])
>>>
>>> explanation = Explanation(shap_values, X, baseline=0.5)
>>>
>>> # Waterfall plot of first observation
>>> explanation.plot.waterfall(row_id=0)
plot
property
filter(indices)
Filter the SHAP values by array-like.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
indices
|
array - like
|
Integer or boolean array-like to filter the SHAP values and data. |
required |
Returns:
Type | Description |
---|---|
Explanation
|
A new Explanation object with filtered SHAP values and data. |
select_output(index)
Select specific output dimension from the SHAP values. Useful if predictions are multi-output.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
Int or str
|
Index or name of the output dimension to select. |
required |
Returns:
Type | Description |
---|---|
Explanation
|
A new Explanation object with only the selected output. |
set_feature_names(feature_names)
Set feature names of 'X'.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
feature_names
|
list or array - like
|
Feature names to set. |
required |
set_output_names(output_names=None)
If predictions are multi-output, set names of the additional dimension.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_names
|
list or array - like
|
Output names to set. |
None
|
set_X(X)
Set X and self.feature_names.
X
is converted to pandas. String and object columns are converted to
categoricals, while numeric columns are left unchanged. Other column types
will raise a TypeError.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
(ndarray, DataFrame or DataFrame)
|
New data to set. Columns must match the order of SHAP values. |
required |
importance(which_output=None)
Calculate mean absolute SHAP values for each feature (and output dimension).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
which_output
|
int or string
|
Index or name of the output dimension to calculate importance for. If None, all outputs are considered. Only relevant for multi-output models. |
None
|
Returns:
Type | Description |
---|---|
Series or DataFrame
|
Series containing mean absolute SHAP values sorted by importance. In case of multi-output models, it returns a DataFrame, and the sort order is determined by the average importance across all outputs. |
interaction_heuristic(features=None, color_features=None)
Interaction heuristic.
For each feature/color_feature combination, the weighted average absolute Pearson correlation coefficient between the SHAP values of the feature and the values of the color_feature is calculated. The larger the value, the higher the potential interaction.
Notes:
- Non-numeric color features are converted to numeric, which does not always make sense.
- Missing values in the color feature are currently discarded.
- The number of non-missing color values in the bins are used as weight to compute the weighted average.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
features
|
list
|
List of feature names. If None, all features are used. |
None
|
color_features
|
list
|
List of color feature names. If None, all features are used. |
None
|
Returns:
Type | Description |
---|---|
DataFrame
|
DataFrame with interaction heuristics. |