1 Introduction

A decision tree is a simple, easy-to-interpret modeling technique for both regression and classification problems. Compared to other methods, decision trees usually do not perform very well. Their relevance lies in the fact that they are the building blocks of two of the most successful ML algorithms: random forests and gradient boosted trees. In this chapter, we will introduce these tree-based methods. For more background, the reader is referred to the main references Hastie, Tibshirani, and Friedman (2001) and James et al. (2021).

2 Decision Trees

2.1 How they work

On our journey to estimate the model f by \hat f, we have so far mainly considered linear functions f. We now turn to another class of functions: decision trees. They were introduced in Breiman et al. (1984) and are sometimes called “Classification and Regression Trees” (CART).

(Binary) decision trees are computed recursively by partitioning the data into two parts. Partitions are chosen to optimize total loss by asking the best “yes/no” question about the covariates, e.g.: Is the age of the driver \le 22?

A regression tree for claims frequency

For regression problems, the most frequently used loss function L is the squared error. For classification, it is “information” (= cross-entropy = log loss = half the unit logistic deviance) or the very similar Gini impurity.

Predictions are computed by sending an observation down the tree, starting with the question at the “trunk” and ending in a leaf, say in leaf j. The prediction is the value \gamma_j associated with that leaf. In regression situations, \gamma_j usually corresponds to the average response of all training observations that fall into that leaf. In classification situations, it may be the most frequent class in the leaf or the vector of class probabilities.

More formally, let R_1, \dots, R_J denote the terminal regions corresponding to the J leaves in the sense that \boldsymbol x falls in leaf j \Leftrightarrow \boldsymbol x \in R_j. Then, the prediction function can be expressed as \hat f(\boldsymbol x) = \sum_{j = 1}^{J} \gamma_j \boldsymbol 1\{\boldsymbol x \in R_j\}.

The concept of a decision tree is best understood with an example.

2.2 Example

We will use the dataCar data set to predict the claim probability with a decision tree. As features, we use veh_value, veh_body, veh_age, gender, area and agecat.

library(rpart)
library(rpart.plot)
library(insuranceData)

data(dataCar)

fit <- rpart(
  clm ~ veh_value + veh_body + veh_age + gender + area + agecat,
  data = dataCar,
  method = "class",
  parms = list(split = "information"),
  xval = 0,
  cp = -1,
  maxdepth = 3
)

prp(fit,
  type = 2, extra = 7, shadow.col = "gray",
  faclen = 0, box.palette = "auto", branch.type = 4,
  varlen = 0, cex = 0.9, digits = 3, split.cex = 0.8
)

dataCar[1, c("agecat", "veh_value", "veh_body")]

predict(fit, dataCar[1, ])

##           0          1
## 1 0.9330357 0.06696429

Comments

The first observation belongs to a person in age category 2 and has a $10’600 hatchback: the first question sends us to the right, the second to the left and the third to the right. This gives us a claim probability of 6.7%.
For example, how was the first question (agecat >= 5) selected? The algorithm searches all covariates for all possible split positions and selects the one with the best total loss improvement. In this case, the split on the covariate agecat at the value 5 reduced the total loss the most. As loss function, we used “information”, which is equivalent to (half) the unit logistic deviance.

2.3 Properties

Here, we list a couple of properties of decision trees.

Outliers: In contrast to linear models, outliers in covariates are not an issue because the algorithm only takes into account the sort order of the values. Similarly, taking logarithms or other monotone transformations of covariates has no effect. Both statements do not hold for the response variable.
Missing values: Some implementations can deal with missing values in the input. Alternatively, missing values are often replaced by a typical value or a value smaller/larger than the smallest non-missing value (such as -1 for a positive variable).
Categorical covariates: Unordered categorical covariates are difficult to split, since \ell levels theoretically give rise to 2^\ell possible partitions. Try grouping small categories together, or consider representing the levels by ordered categories (even if it does not make too much sense). One-hot-encoding is also an option. Some algorithms provide ways to handle unordered categories internally.
Greedy: Partitions are made in a greedy way to optimize the objective in one split. Looking ahead more than one split would lead to better models but is not feasible from a computational point of view.
Interactions: By their flexible structure, a decision tree can automatically capture interaction effects of any order (and other non-linear effects), at least if the data set is large and the tree is deep enough. This is a clear advantage of tree-based methods compared to linear models, where these elements must be carefully and manually selected. In the next chapter, we will meet another model class with this advantage: neural nets.
Extrapolation: By construction, a decision tree cannot extrapolate. Thus, even a diamond weighing 10 carats cannot get a higher price prediction than the most expensive diamond in the training data. In fact, it will be even lower (why?).
Instability: A small change in the data can result in a completely different tree. The reason for this is that a change in an early split also affects later splits. Averaging many trees reduces this variance, which leads us to random forests.

These properties (except the last one) translate to combinations of trees like random forests or boosted trees.

Before we look into these tree ensemble techniques, here is a short video of Jerome Friedman explaining how he tried to efficiently approximate nearest neighbor and came up with a new method: decision trees.

3 Random Forests

In 2001, Leo Breiman introduced a powerful tree-based algorithm called random forest (Breiman 2001). A random forest consists of many decision trees. The trees differ for two reasons. One of them is an ensembling technique called “bagging”. Before studying random forests, we first have a look at bagging.

3.1 Bagging

Combining multiple models (base learners) into a single one is a common technique to slightly improve the overall performance of a model. This is referred to as model ensembling. For example, one could create an ensemble model by running multiple k-nearest neighbor regression models, each with a different value for k (e.g., 5, 6, 7), and then averaging the resulting predictions. The combined model predictions tend to have lower variance and thus often a better out-of-sample performance, just like a diversified stock portfolio.

Another way of creating an ensemble model is based on the Bootstrap: Bagging (from “bootstrap aggregating) draws B bootstrap samples and fits a model on each of those samples. The resulting B models are then combined to a single one.

Algorithm: bagging (for regression)

Select B bootstrapped training data sets from the original training data.
Fit models \hat f^{*j}(\boldsymbol x) on them.
Return the bagged model \hat f(\boldsymbol x) = \frac{1}{B} \sum_{j = 1}^B \hat f^{*j}(\boldsymbol x)

For classification, the last step is replaced by taking averages for each predicted probability or doing majority voting.

3.1.1 Example

Let’s try out the idea of bagging using diamonds data. We first fit a relatively deep decision tree on the training data without bagging. This is our reference model. Then, we fit a tree on each of 20 bootstrapped versions of the training data. In order to evaluate the results, we store the validation predictions of each of the 20 bootstrapped models. We evaluate them by taking their successive means (averaging the predictions of the first b models for b = 1, \dots, 20).

library(rpart)
library(tidyverse)
library(withr)

RMSE <- function(y, pred) {
  sqrt(mean((y - pred)^2))
}

diamonds <- transform(
  diamonds,
  log_price = log(price),
  log_carat = log(carat)
)

# Split diamonds into 80% for "training" and 20% for validation
with_seed(
  9838,
  ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)

train <- diamonds[ix, ]
valid <- diamonds[-ix, ]

# Tree without bagging for comparison
form <- log_price ~ log_carat + clarity + color + cut
fit_unbagged <- rpart(
  form,
  data = train,
  xval = 0,
  cp = -1,
  maxdepth = 8
)
rmse_unbagged <- RMSE(valid$log_price, predict(fit_unbagged, valid))

# Fit B models on B bootstrap samples and store their predictions
# on the validation data (one column per bootstrap sample)
B <- 20
valid_pred <- matrix(nrow = nrow(valid), ncol = B)
for (j in 1:B) {
  data_boot <- train[sample(nrow(train), replace = TRUE), ]
  fit_boot <- rpart(
    form,
    data = data_boot,
    xval = 0,
    cp = -1,
    maxdepth = 8
  )
  valid_pred[, j] <- predict(fit_boot, valid)
}

# Very high correlations across model predictions
# (subset of three models shown)
round(cor(valid_pred[, 1:3]), 3)

##       [,1]  [,2]  [,3]
## [1,] 1.000 0.998 0.997
## [2,] 0.998 1.000 0.997
## [3,] 0.997 0.997 1.000

# Row-wise successive mean predictions
pred_bagged <- t(
  apply(valid_pred, 1, function(z) cumsum(z) / 1:length(z))
)

# Validation RMSE per column
rmse_bagged <- apply(
  pred_bagged, 2, function(z) RMSE(valid$log_price, z)
)

# Visualization
rmse_bagged |>
  as_tibble_col("Validation RMSE") |>
  mutate(`Bagging round` = row_number()) |>
  ggplot(aes(`Bagging round`, `Validation RMSE`)) +
  geom_point(color = "chartreuse4") +
  geom_hline(yintercept = rmse_unbagged)

Comments

The more bagging rounds, the better the validation performance.
Already after two bagging rounds, performance is better compared with the unbagged model (straight black line).
The effect tails off.
The predictions of the 20 models are extremely strongly correlated. Still, they are not perfectly correlated, so there is some diversification going on.

Remarks on bagging

Base learners: Bagging works best with unstable models like deep decision trees as base learners. In principle, it can also be applied to any other modeling technique, e.g., linear regression.
OOB validation: In each tree, about 1/3 of the observations are not included in the bootstrap sample. Put differently: each observation is used in about 2/3 of all models. If its prediction is calculated from the other 1/3 of the models, we get an “out-of-sample” prediction, also called “out-of-bag” (OOB) prediction. If rows are independent, the performance measure derived from these OOB predictions is usually a good estimate of true performance, replacing (cross-)validation. Do not trust OOB results when rows are dependent, such as with grouped rows.
Parallel computing: Since the models can be fitted fully independently, bagging is ideal for parallel computing.
Performance versus complexity: A bagged model consisting of B base models takes B times as long to train, is B times as large, and is usually more difficult to interpret than a single model. Still, the performance gain is sometimes worth the effort. The same tradeoff applies to ensembling techniques in general.

3.1.2 Bias-Variance Trade-Off

We have mentioned that ensemble models try to reduce variance to improve their test performance. But what does variance mean in this context? And why does a smaller variance leads to better test performance? For the MSE as performance measure, we have the following decomposition of the generalization error (James et al. (2021)):

\underbrace{\mathbb E(y_o - \hat f(x_o))^2}_{\text{Expected test squared error of } x_o} = \text{Var}(\hat f(x_o)) + \left[\text{Bias}(\hat f(x_o))\right]^2 + \underbrace{\text{Var}(\varepsilon)}_{\text{Irreducible error}}, where

(y_o, x_o) is a specific observation,
expectations and variances are calculated over a large number of training sets, and
“Bias” is the error introduced by approximating the true model by f.

Models with low bias (e.g., deep decision trees or k-nearest-neighbor with small k) usually have high variance. On the other hand, models with low variance (like linear regression) tend to have high bias. This is the famous Bias-Variance Trade-Off. A bagged decision tree tries to have low bias (deep trees) and low variance (from bagging).

3.2 From bagging to random forests

To turn a bagged decision tree into a random forest, a second source of randomness is injected: Each split scans only a random selection “mtry” of the p covariates to find the best split, usually \sqrt{p} or p/3. “mtry” is one of the main tuning parameters of a random forest. This extra randomness further decorrelates the bagged trees, which improves the diversification effect. Properties of bagging also apply to random forests.

Algorithm: random forest (for regression)

Select B bootstrapped training data sets from the original training data.
Fit (usually deep) decision trees \hat f^{*j}(\boldsymbol x) on these. During each split, consider only a random subset of features.
Return the model \hat f(\boldsymbol x) = \frac{1}{B} \sum_{j = 1}^B \hat f^{*j}(\boldsymbol x).

For classification, the last step is again replaced by taking averages for each predicted probability or doing majority voting.

Comments about random forests

Number of trees: Usually, 100-1000 trees are being grown. The more, the better. More trees also mean longer training time and larger models.
Deep trees and diversification: Single trees in a random forest are usually deep, thus they overfit on the training data. It is the diversity across trees that produces a good and stable model, just like a well-diversified stock portfolio.
Never trust performance on the training set for random forests. The reason is the overfitting caused by the deep decision trees. Besides using the standard approaches to assess true performance (simple or cross-validation), we can study OOB results.
Parameter tuning: Random forests offer many tuning parameters. Since the results typically do not depend too much on their choice, untuned random forests are often great benchmark models, see also this short video by Adele Cutler on working with Leo Breiman on random forests. Besides “mtry”, it can make sense to tune the tree depth.

3.2.1 Example

Let us now fit a random forest for (logarithmic) diamond prices with default parameters and 500 trees. 80% of the data is used for training, the other 20% we use for evaluating the performance. (Throughout the rest of the lecture, we will ignore the problematic aspect of having repeated rows for some diamonds.)

library(ggplot2)
library(withr)
library(ranger)
library(MetricsWeighted)

options(ranger.num.threads = 8)

diamonds <- transform(
  diamonds,
  log_price = log(price),
  log_carat = log(carat)
)

y <- "log_price"
xvars <- c("log_carat", "color", "cut", "clarity")

# Train/test split
with_seed(
  9838,
  ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)

fit <- ranger(
  reformulate(xvars, y),
  num.trees = 500,
  data = diamonds[ix, ],
  importance = "impurity",
  seed = 83
)
fit

## Ranger result
## 
## Call:
##  ranger(reformulate(xvars, y), num.trees = 500, data = diamonds[ix,      ], importance = "impurity", seed = 83) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      43152 
## Number of independent variables:  4 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         impurity 
## Splitrule:                        variance 
## OOB prediction error (MSE):       0.01080754 
## R squared (OOB):                  0.9894831

# Performance on test data
pred <- predict(fit, diamonds[-ix, ])$predictions
y_test <- diamonds[-ix, y]
rmse(y_test, pred) # 0.1020

## [1] 0.1020477

# Reference (intercept only) model "fitted" on training data
r_squared(y_test, pred, reference_mean = mean(diamonds[ix, y])) # 0.99

## [1] 0.989959

# Size of the fitted model
format(object.size(fit), units = "auto")

## [1] "125.1 Mb"

Comments

Performance is much better than with bagging alone. Besides the random feature selection, the main reason for this is that very deep decision trees are grown.
The OOB estimate of performance is extremely close to the test set performance.
How can we interpret a complex model like this?
The model fills about 130 MB of RAM. This is a lot, given the fact that the diamonds data is not too large.

3.3 Interpreting a black box

In contrast to a single decision tree or a linear model, a combination of many trees is not easy to interpret. It is good practice for any ML model to not only study

performance, but also
variable importance, and
feature effects.

A pure prediction machine is hardly of any interest and might even contain mistakes like using covariates derived from the response. Model interpretation helps to fight such problems and thus also to increase trust in a model and its creator. The field of ML/statistics caring about interpreting complex models is called XAI (from eXplainable Artificial Intelligence). We present some of the methods here, and some other later in the chapter on neural networks.

Most of the text of this section is adapted from the responsible ML lecture.

3.3.1 Variable importance

There are different approaches to measure the importance of a covariate. Since no general mathematical definition of “importance” exists, the results of different approaches might be inconsistent across each other. For tree-based methods, a usual approach is to measure how many times a covariate X^{(j)} was used in a split or how much total loss improvement came from splitting on X^{(j)}. For a GLM, we could study partial Chi-squared tests (model with feature X^{(j)} versus model without).

Approaches that work for any supervised model include permutation importance (see later) and SHAP importance (not covered).

3.3.1.1 Example: Random forest (continued)

Let us study relative split gain importance regarding MSE improvement for the features in our random forest.

imp <- sort(importance(fit))
imp <- imp / sum(imp)
barplot(imp, horiz = TRUE, col = "chartreuse4", las = 2)

Comment: As expected, logarithmic carat is by far the most important feature.

3.3.2 Effects

One of the advantages of an additive linear regression over a complex ML model is its simple structure. The coefficient(s) of a feature X^{(j)} immediately tells us how predictions will react on changes in X^{(j)}, keeping all other covariates fixed (“Ceteris Paribus”). Due to additivity, the effect of X^{(j)} is the same for all observations.

But what if the linear regression contains complex interaction effects? Or if the model is an ML model such as a random forest or k-nearest neighbor? Options include studying individual conditional expectations and/or partial dependence plots.

3.3.2.1 Individual conditional expectations

The individual conditional expectation (ICE) function (Goldstein et al. 2015) for covariate X^{(j)} of model f and an observation with covariate vector \boldsymbol x \in {\mathbb R}^p is given by \text{ICE}_j: v \in {\mathbb R} \mapsto f(v, \boldsymbol x_{\setminus j}), where \boldsymbol x_{\setminus j} denotes all but the j-th component of \boldsymbol x, which is being replaced by the value v.

The corresponding ICE curve represents the graph (v, \text{ICE}_j(v)) for a grid of values v \in \mathbb R. An ICE plot finally shows the ICE curves of many observations.

Remarks

The more different the curves in an ICE plot for covariate X^{(j)} (ignoring vertical shifts), the stronger the interaction effects with other covariates.
If there are no interaction effects between X^{(j)} and the other covariates (additive feature), the curves are parallel. If such covariate is represented by a linear function with slope \beta, the ICE curves will have slope \beta as well.
ICE curves can be vertically centered to focus on interaction strength. These curves are called “c-ICE” curves. For a model additive in X^{(j)}, the c-ICE plot of X^{(j)} shows a single curve.
If the model f involves a link function, ICE curves are often plotted on the link scale (why?).

Pros and Cons

Pro: Very simple to compute.
Pro: Easy to interpret thanks to Ceteris Paribus logic.
Pro: Gives an impression about associated interaction strength.
Con: Suboptimal when Ceteris Paribus clause is unnatural.
Con: Model is applied to rare/impossible feature value combinations.
Con: One does not see which variable is interacting.

3.3.2.1.1 Example: Random forest (continued)

How do ICE and c-ICE plots for two selected features look?

library(hstats)

# ICE and c-ICE plots of log_carat and clarity
v <- "log_carat"
ic <- ice(fit, v = v, X = diamonds[ix, ])

plot(ic, color = "chartreuse4") +
  ggtitle(paste("ICE plot for", v))

plot(ic, color = "chartreuse4", center = TRUE) +
  ggtitle(paste("c-ICE plot for", v))

v <- "clarity"
ic <- ice(fit, v = v, X = diamonds[ix, ])

plot(ic, color = "chartreuse4") +
  ggtitle(paste("ICE plot for", v))

plot(ic, color = "chartreuse4", center = TRUE) +
  ggtitle(paste("c-ICE plot for", v))

Comments

ICE plot for log_carat: The curves are increasing, which makes sense: everything else fixed, the expected (log) price increases with (log) carat. For very large values of log carat, predictions tail off. This is not realistic but a consequence of the fact that tree-based models cannot extrapolate.
c-ICE plot for log_carat: Vertical scatter is not strong, indicating weak interaction effects. If our model would use price as response, the interaction effects would be much stronger (why?).
clarity: Better clarity is associated with higher predictions. This trend is easier to see in the centered ICE plot. There, we can also see strong interaction effects (some curves are much steeper than others, etc.).

3.3.2.2 Partial dependence

The pointwise average of many ICE curves is called a partial dependence plot (PDP), originally introduced by Friedman (2001) in his seminal work on gradient boosting. It describes the effect of the variable X^{(j)}, averaged over all interaction effects and holding all other variables fixed. The empirical partial dependence function for model \hat f and feature X^{(j)} is given by \text{PD}_j(v) = \frac{1}{n} \sum_{i = 1}^n \hat f(v, \boldsymbol x_{i,\setminus j}), where \boldsymbol x_{i,\setminus j} denotes the feature vector without j-th component of observation i in some reference dataset, often the training or the test data.

The corresponding PDP represents the graph (v, \text{PD}_j(v)) for a grid of values v \in \mathbb R.

Pros and Cons

Pro: Simple to compute.
Pro: Easy to interpret thanks to Ceteris Paribus logic.
Con: Suboptimal when Ceteris Paribus clause is unnatural.
Con: Model is applied to rare/impossible feature value combinations.
Con: No information about interactions.

An alternative to the partial dependence plot, the accumulated local effects (ALE) plot (Apley and Zhu 2016), attempts to overcome the first two (related) downsides. However, ALE plots are more difficult to calculate and to interpret. We will not consider them here.

3.3.2.2.1 Example: Random forest (continued)

Let us now study the partial dependence plots of all features.

for (v in xvars) {
  p <- partial_dep(fit, v = v, X = diamonds[ix, ]) |>
    plot(color = "chartreuse4") +
    ggtitle(paste("PDP for", v))
  print(p)
}

Comment: All effects as assessed by partial dependence plots make sense.

4 Gradient Boosted Trees

Another way of combining decision trees to an ensemble model are gradient boosted trees. They belong to the most successful ML models. We will first explain what boosting is. Then, we introduce the gradient boosted trees algorithm.

4.1 Boosting

The idea of boosting was investigated, e.g., in Schapire (1990) and could be described as follows:

Fit simple model \hat f to data.
For k = 1, \dots, K do:
1. Find simple model \hat f^k that corrects the mistakes of \hat f.
2. Update: \hat f \leftarrow \hat f + \hat f^k.

As base learners (or weak learners) f^k, often short decision trees are used. But how to find them specifically?

AdaBoost: In 1995, Freund and Schapire proposed the famous AdaBoost algorithm for binary classification. It combines decision trees with a particular reweighting scheme.
Gradient boosting: While AdaBoost is still in use, the most common way to find the updates \hat f^k goes back to Jerome Friedman and his article on Gradient Boosting Machines (Friedman 2001). The key idea of gradient boosting is to choose \hat f^k to make the total loss Q(\hat f + \hat f^k) = \sum_{i = 1}^n L(y_i, \hat f(\boldsymbol x_i) + \hat f^k (\boldsymbol x_i)) smaller.

4.2 Gradient boosting

Gradient boosting is gradient descent in function space. Gradient descent is a numerical method to find the (local) minimum of a smooth function h: \mathbb R^n \to \mathbb R. It works as follows:

Start at some value \hat x \in \mathbb R^n.
Repeat: \hat x \leftarrow \hat x - \lambda g.

\lambda > 0 is the step size or learning rate, and g = \left[\frac{\partial h(x)}{\partial x}\right]_{x = \hat x} is the gradient of h at \hat x.

It points in the direction of the steepest ascent, so the update step goes in the opposite direction to make h smaller. Imagine you are standing on a hill in foggy weather: which direction should you go to get down? You look for the direction in which the hill is steepest, and then take a step in the opposite direction, and so on.

The following picture on Wikipedia illustrates the situation:

https://en.wikipedia.org/wiki/Gradient_descent

Let’s apply gradient descent to the total loss Q(f) = \sum_{i = 1}^n L(y_i, f(\boldsymbol x_i)) viewed as a function of the n parameters f = \left(f(\boldsymbol x_1), \dots, f(\boldsymbol x_n)\right) \in \mathbb R^n, i.e., of the n predictions on the training data.

We proceed as in standard gradient descent:

Start at some value \hat f.
Repeat: \hat f \leftarrow \hat f - \lambda g with g = \left[\frac{\partial Q(f)}{\partial f}\right]_{f = \hat f} having components g_i = \left[\frac{\partial L(y_i, f(\boldsymbol x_i))}{\partial f(\boldsymbol x_i)}\right]_{f(\boldsymbol x_i) = \hat f(\boldsymbol x_i)}.

For the squared error loss L(y, z) = (y-z)^2/2, these components equal g_i = -(\underbrace{y_i - \hat f(\boldsymbol x_i)}_{\text{Residual } r_i}), i.e., the gradient descent updates equal \hat f(\boldsymbol x_i) \leftarrow \hat f(\boldsymbol x_i) + \underbrace{\lambda r_i}_{\hat f^k \text{?}}. Does this mean that the update steps in gradient boosting are \hat f^k = -\lambda g_i? This would not make sense for two major reasons:

y_i and thus g_i are unknown at prediction stage.
The model should be applicable not only to \boldsymbol x \in \{\boldsymbol x_1, \dots, \boldsymbol x_n\}. Rather we would like to apply it to any \boldsymbol x.

The solution is to replace -g_i by predictions of a decision tree (or of another base learner) with covariates \boldsymbol X. This leads us to the following algorithms (slightly modified from Hastie, Tibshirani, and Friedman (2001)):

Algorithm: Gradient boosted trees for the squared error loss

Initialize \hat f(\boldsymbol x) = \frac{1}{n}\sum_{i = 1}^{n} y_i
For k = 1, \dots, K do:
1. For i = 1, \dots, n, calculate residuals r_i = y_i - \hat f(\boldsymbol x_i)
2. Model the r_i as a function of the \boldsymbol x_i by fitting a regression tree \hat f^k with squared error loss
3. Update: \hat f(\boldsymbol x) \leftarrow \hat f(\boldsymbol x) + \lambda \hat f^k(\boldsymbol x)
Output \hat f(\boldsymbol x)

For general loss functions, the algorithm is a bit more complicated:

The residuals are replaced by negative gradients of the loss function L.
Leaf values might be suboptimal \rightarrow replace by optimal values.

Algorithm: Gradient boosted trees for general loss functions

Initialize \hat f(\boldsymbol x) = \text{argmin}_\gamma \sum_{i = 1}^{n} L(y_i, \gamma)
For k = 1, \dots, K do:
1. For i = 1, \dots, n, calculate negative gradients (pseudo-residuals) r_i = -\left[\frac{\partial L(y_i, f(\boldsymbol x_i))}{\partial f(\boldsymbol x_i)}\right]_{f(\boldsymbol x_i) = \hat f(\boldsymbol x_i)}
2. Model r_i as function of \boldsymbol x_i by regression tree \hat f^k with terminal regions R_1, \dots, R_J, using squared error loss
3. For each j = 1, \dots, J, use line-search to find the optimal leaf value \gamma_j = \text{argmin}_{\gamma} \sum_{\boldsymbol x_i \in R_j} L(y_i, \hat f(\boldsymbol x_i) + \gamma)
4. Update: \hat f(\boldsymbol x) \leftarrow \hat f(\boldsymbol x) + \lambda \underbrace{\sum_{j = 1}^{J} \gamma_j \boldsymbol 1\{\boldsymbol x \in R_j\}}_{\text{modified tree}}
Output \hat f(\boldsymbol x)

Remarks

Predictions are made by the final model \hat f(\boldsymbol x), which equals a sum of decision trees whose leaf values have been replaced.
While a random forest equals the average of many deep trees, a boosted trees model equals the sum of many shallow trees.
We will see later how to select the learning rate \lambda, the number of trees K, and further parameters.
For classification models with more than two categories, the algorithm needs to be slightly extended, but the idea is the same.
AdaBoost (Freund and Schapire, 1995) is gradient tree boosting with a special (exponential) loss.

4.2.1 Illustration

To illustrate the boosting process, we will try to approximate a (univariate) sine function with 1, 2, 10, 100, and 1000 boosted tree stumps (single-split trees).

library(xgboost)
library(tidyverse)

x <- seq(0, 2 * pi, length = 1000)
y <- sin(x)

insample <- 1:750
x_train <- x[insample]
y_train <- y[insample]

# Boosted tree stumps. Or use deeper trees for faster convergence
fit <- xgb.train(
  params = list(learning_rate = 1, max_depth = 1),
  data = xgb.DMatrix(cbind(x = x_train), label = y_train),
  nrounds = 1000
)

# Get predictions for different number of rounds
df_plot <- data.frame(x = x, Truth = y)
N <- c(1, 2, 10, 100, 1000)
for (n in N) {
  df_plot[as.character(n)] <-
    predict(fit, cbind(x = x), iteration = c(1, n + 1))
}

# Transpose for plotting
df_plot_long <- df_plot |>
  pivot_longer(-x, values_to = "Prediction") |>
  mutate(name = factor(name, levels = c(N, "Truth")))

ggplot(df_plot_long, aes(x = x, y = Prediction)) +
  geom_line(color = "chartreuse4") +
  facet_wrap("name")

Comments: The approximation works very well, but only within the data range used for training, i.e., in [0, 3\pi/2].

4.3 Modern implementations

Modern implementations of gradient boosted trees include XGBoost, LightGBM, and CatBoost. These are the predominant algorithms in ML competitions on tabular data, check this comparison for differences with a screenshot as per February 7, 2025:

They differ from Friedman’s original approach in multiple ways, e.g.:

Second order gradients: XGBoost introduced the use of second order gradients, mainly removing the line-search step for all loss functions, see Chen and Guestrin (2016) or this slide deck from Tianqi Chen.
Histogram binning: LightGBM (Ke et al. 2017), a very fast competitor of XGBoost, introduced histogram binning: Continuous covariates are binned into a histogram (e.g., in B = 255 bins) before training, and their values are replaced by short unsigned integers representing the corresponding bin number. During tree growth, to determine the best split position of a given covariate, at most B split positions must be evaluated instead of evaluating the loss gain for all unique feature values. This significantly speeds up tree construction. Nowadays, histogram binning is standard, e.g., also in XGBoost.
Possibility to penalize the objective function.

4.3.1 Example: XGBoost

As an initial example on gradient boosting and XGBoost, we fit a model for logarithmic diamond prices with squared error as loss function. The number of rounds/trees is initially chosen by cross-validation and early stopping, i.e., until CV validation (R)MSE stops improving for a couple or rounds. The learning rate (weight of each tree) is chosen by trial and error in order to end up with a reasonably small/large number of trees, see the next section for more details.

library(ggplot2)
library(withr)
library(xgboost)
library(MetricsWeighted)

diamonds <- transform(
  diamonds,
  log_price = log(price),
  log_carat = log(carat)
)

y <- "log_price"
xvars <- c("log_carat", "color", "cut", "clarity")

# Split into train and valid
with_seed(
  9838,
  ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)

y_train <- diamonds[ix, y]
X_train <- diamonds[ix, xvars]

y_test <- diamonds[[y]][-ix]
X_test <- diamonds[-ix, xvars]

# XGBoost data interface
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train)

# Minimal set of parameters
params <- list(
  objective = "reg:squarederror",
  learning_rate = 0.1
)

# Add trees until CV validation MSE stops improving over
# last 20 rounds
cvm <- xgb.cv(
  params = params,
  data = dtrain,
  nrounds = 5000,
  nfold = 5,
  early_stopping_rounds = 20,
  showsd = FALSE,
  print_every_n = 50
)

## [1]  train-rmse:6.618138 test-rmse:6.618111 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 20 rounds.
## 
## [51] train-rmse:0.106212 test-rmse:0.108808 
## [101]    train-rmse:0.097096 test-rmse:0.101262 
## [151]    train-rmse:0.095865 test-rmse:0.101083 
## [201]    train-rmse:0.094934 test-rmse:0.101009 
## Stopping. Best iteration:
## [190]    train-rmse:0.095084+0.000183    test-rmse:0.100988+0.000864

# Fit model on full training data using optimal number of rounds
fit <- xgb.train(
  params = params,
  data = dtrain,
  print_every_n = 50,
  nrounds = cvm$best_iteration
)

# Test performance
rmse(y_test, predict(fit, data.matrix(X_test))) # 0.0988

## [1] 0.09879191

Comments:

More trees would mean better training performance but worse CV performance.
Test performance is better than the random forest. Can we do even better?
Compared to simple validation, cross-validation would provide an even better way to find the optimal number of boosting rounds/trees.

4.4 Parameter tuning

Gradient boosted trees offer many parameters to select from. Unlike with random forests, they need to be tuned to achieve good results. It would be naive to use an algorithm like XGBoost without parameter tuning. Here is a selection:

Number of boosting rounds K: In contrast to random forests, more trees/rounds is not always beneficial because the model begins to overfit after some time. The optimal number of rounds is usually found by early stopping, i.e., one lets the algorithm stop as soon as the (cross-)validation performance stops improving, see the example above.
Learning rate \lambda: The learning rate determines the training speed and the impact of each tree to the final model. Typical values are between 0.001 and 1. In practical applications, it is set to a value that leads to a reasonable amount of trees (100 - 1000). Usually, halving the learning rate means twice as much boosting rounds for comparable performance.
Regularization parameters: Depending on the implementation, additional parameters are
- the tree depth (often 1 - 7) or the number of leaves (often 2 - 63),
- the strength of the L1 and/or L2 penalties added to the objective function (often between 0 and 5),
- the per-tree row subsampling rate (often between 0.8 and 1),
- the per-tree or per-split column subsampling rate (often between 0.6 and 1),
- …
Reasonable regularization parameters are chosen by trial and error or systematically by randomized/grid search CV. Usually, it takes a couple of manual iterations until the range of the parameter values have been set appropriately.

Overall, the modeling strategy could be summarized as follows:

Using default regularization parameters, set the learning rate to get a reasonable number of trees with CV-based early stopping.
Use randomized search CV with early stopping to tune regularization parameters such as tree depth. Iterate if needed.
Use the best parameter combination (incl. number of trees) to fit the model on the training data. “Best” typically means in terms of CV performance. As mentioned in the last chapter and depending on the situation, it can also mean “good CV performance and not too heavy overfit compared to insample performance” or some other reasonable criterion.

Remark: Since the learning rate, the number of boosting rounds, and other regularization parameters are highly interdependent, a large randomized grid search to select the learning rate, boosting rounds, and regularization parameters is often not ideal. Above suggestion (fix learning rate, select number of rounds by early stopping and do grid search only on regularization parameters) is more focused, see also the example below.

4.4.1 Example: XGBoost

We will use XGBoost to fit diamond prices using the squared error as loss function and RMSE as corresponding evaluation metric, now using a sophisticated tuning strategy.

# This code is pretty long. It may serve as a general template to fit
# a high-performance XGBoost model
library(ggplot2)
library(withr)
library(xgboost)
library(MetricsWeighted)
library(hstats)

diamonds <- transform(
  diamonds,
  log_price = log(price),
  log_carat = log(carat)
)

y <- "log_price"
xvars <- c("log_carat", "color", "cut", "clarity")

# Split into train and test
with_seed(
  9838,
  ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)

y_train <- diamonds[ix, y]
X_train <- diamonds[ix, xvars]

y_test <- diamonds[-ix, y]
X_test <- diamonds[-ix, xvars]

# XGBoost data interface
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train)

# If grid search is to be run again, set tune <- TRUE
tune <- FALSE

if (tune) {
  # Use default parameters to set learning rate with suitable
  # number of rounds
  params <- list(
    objective = "reg:squarederror",
    learning_rate = 0.1
  )

  # Cross-validation
  cvm <- xgb.cv(
    params = params,
    data = dtrain,
    nrounds = 5000,
    nfold = 5,
    early_stopping_rounds = 20,
    showsd = FALSE,
    print_every_n = 50
  )
  cvm # -> a lr of 0.1 provides ~200 trees, which seems convenient

  # Final grid search after some iterations
  grid <- expand.grid(
    iteration = NA,
    cv_score = NA,
    train_score = NA,
    objective = "reg:squarederror",
    learning_rate = 0.1,
    max_depth = 6:7,
    reg_lambda = c(0, 2.5, 5, 7.5),
    reg_alpha = c(0, 4),
    colsample_bynode = c(0.8, 1),
    subsample = c(0.8, 1),
    min_child_weight = c(1, 10),
    min_split_loss = c(0, 1e-04),
    stringsAsFactors = FALSE
  )

  # Grid search or randomized search if grid is too large
  max_size <- 32
  grid_size <- nrow(grid)
  if (grid_size > max_size) {
    grid <- grid[sample(grid_size, max_size), ]
    grid_size <- max_size
  }

  # Loop over grid and fit XGBoost with 5-fold CV and early stopping
  pb <- txtProgressBar(0, grid_size, style = 3)
  for (i in seq_len(grid_size)) {
    cvm <- xgb.cv(
      params = as.list(grid[i, -(1:2)]),
      data = dtrain,
      nrounds = 5000,
      nfold = 5,
      early_stopping_rounds = 20,
      verbose = 0
    )

    # Store result
    grid[i, 1] <- cvm$best_iteration
    grid[i, 2:3] <- cvm$evaluation_log[, c(4, 2)][cvm$best_iteration]
    setTxtProgressBar(pb, i)

    # Save grid to survive hard crashs
    saveRDS(grid, file = "simulation/diamonds_xgb.rds")
  }
}

# Load grid and select best iteration
grid <- readRDS("simulation/diamonds_xgb.rds")
grid <- grid[order(grid$cv_score), ]
head(grid)

# Fit final, tuned model
fit <- xgb.train(
  params = as.list(grid[1, -(1:3)]),
  data = dtrain,
  nrounds = grid[1, "iteration"]
)

Now, the model is ready to be inspected by evaluating

its test performance,
by looking at split gain importance and
partial dependence plots.

# Performance on test data
pred <- predict(fit, data.matrix(X_test))
rmse(y_test, pred) # 0.0986

## [1] 0.09857134

r_squared(y_test, pred, reference_mean = mean(y_train)) # 0.9906

## [1] 0.9906315

# Variable importance regarding MSE improvement
imp <- xgb.importance(model = fit)
xgb.plot.importance(imp, col = "chartreuse4")

# Partial dependence plots
pred_fun <- function(m, X) predict(m, data.matrix(X))
for (v in xvars) {
  p <- partial_dep(fit, v = v, X = X_train, pred_fun = pred_fun) |>
    plot("chartreuse4") +
    ggtitle(paste("PDP for", v))
  print(p)
}

Comment: Compared to the random forest, the test performance is slightly better. Effects seem to be comparable.

4.4.2 Example: Taxi

We will now tune and fit a LightGBM model to predict (logarithmic) taxi trip durations, using the same covariates as in previous taxi models, and following the same basic tuning strategy as demonstrated for XGBoost. Instead of max_depth, the main regularization parameter of LightGBM is the number of leaf nodes.

Note: Since model tuning is slow, and the data is large, we could replace the 5-fold cross-validation by simple validation, e.g., using a 60%/20%/20% data split.

library(arrow)
library(ggplot2)
library(dplyr)
library(data.table)
library(withr)
library(lightgbm)
library(MetricsWeighted)
library(hstats)

df <- read_parquet("taxi/yellow_tripdata_2018-01.parquet")

setDT(df)
head(df)

# ====================================================================
# Prepare data for modeling
# ====================================================================

df[
  ,
  duration := as.numeric(
    difftime(
      tpep_dropoff_datetime,
      tpep_pickup_datetime,
      units = "mins"
    )
  )
]
df <- df[between(trip_distance, 0.2, 100) & between(duration, 1, 120)]
df[
  ,
  `:=`(
    pu_hour = factor(data.table::hour(tpep_pickup_datetime)),
    weekday = factor(data.table::wday(tpep_pickup_datetime)), # 1 = Su
    pu_loc = forcats::fct_lump_min(factor(PULocationID), 1e5),
    log_duration = log(duration),
    log_distance = log(trip_distance)
  )
]

xvars <- c("log_distance", "weekday", "pu_hour", "pu_loc")
y <- "log_duration"

# Random split
with_seed(
  1,
  ix <- sample(nrow(df), 0.8 * nrow(df))
)

y_train <- df[[y]][ix]
X_train <- df[ix, xvars, with = FALSE]

y_test <- df[[y]][-ix]
X_test <- df[-ix, xvars, with = FALSE]

# Data interface of LGB
dtrain <- lgb.Dataset(
  data.matrix(X_train),
  label = y_train,
  params = list(feature_pre_filter = FALSE)
)

grid_file <- "taxi/taxi_lgb.rds"

# ====================================================================
# Tune model using strategy outlined above (use as LightGBM template)
# ====================================================================

tune <- FALSE

if (tune) {
  # Select learning rate so that optimal number of rounds is
  # somewhere between 100 and 1000
  params <- list(
    learning_rate = 1,
    objective = "mse",
    num_threads = 7 # choose number of threads
  )

  cvm <- lgb.cv(
    params = params,
    data = dtrain,
    nrounds = 1000,
    nfold = 5,
    early_stopping_rounds = 20,
    eval_freq = 50
  )

  # Random search
  grid <- expand.grid(
    iteration = NA,
    score = NA,
    objective = "mse",
    learning_rate = 1,
    num_leaves = c(15, 31, 63),
    lambda_l2 = c(0, 2.5, 5, 7.5),
    lambda_l1 = c(0, 4),
    colsample_bynode = c(0.8, 1),
    bagging_fraction = c(0.8, 1),
    min_data_in_leaf = c(20, 50, 100),
    min_sum_hessian_in_leaf = c(0, 0.001, 0.01),
    stringsAsFactors = FALSE
  )

  max_size <- 10
  grid_size <- nrow(grid)
  if (grid_size > max_size) {
    grid <- grid[sample(grid_size, max_size), ]
    grid_size <- max_size
  }

  pb <- txtProgressBar(0, grid_size, style = 3)
  for (i in seq_len(grid_size)) { # i <- 1
    cvm <- lgb.cv(
      params = as.list(grid[i, -(1:2)]),
      data = dtrain,
      nrounds = 5000,
      nfold = 5,
      early_stopping_rounds = 20,
      verbose = -1
    )

    # Store result
    grid[i, 1:2] <- as.list(cvm)[c("best_iter", "best_score")]
    setTxtProgressBar(pb, i)

    # Save grid to survive hard crashs
    saveRDS(grid, file = grid_file)
  }
}

# ====================================================================
# Fit model on best parameter combination
# ====================================================================

grid <- readRDS(grid_file) |>
  arrange(score)
head(grid)

# Fit model on best parameter combination
if (FALSE) { # Set to true to refit
  fit_lgb <- lgb.train(
    params = as.list(grid[1, -(1:2)]),
    data = dtrain,
    nrounds = grid[1, "iteration"]
  )
  lgb.save(fit_lgb, "taxi/fit_lgb.txt")
} else {
  fit_lgb <- lgb.load("taxi/fit_lgb.txt")
}

# ====================================================================
# Inspect
# ====================================================================

# Performance on test data
pred <- predict(fit_lgb, data.matrix(X_test))
rmse(y_test, pred) # 0.3014

## [1] 0.3013554

r_squared(y_test, pred, reference_mean = mean(y_train))

## [1] 0.8232967

# Variable importance regarding MSE improvement
imp <- lgb.importance(model = fit_lgb)
lgb.plot.importance(imp)

# PDPs
for (v in xvars) {
  p <- partial_dep(fit_lgb, v = v, X = data.matrix(X_train)) |>
    plot("chartreuse4") +
    ggtitle(paste("PDP for", v))
  print(p)
}

Comments

It takes long to tune a boosted trees model on many million data rows.
The resulting model is better than the linear regression, which had an R-squared (without data split) of about 78%, but not dramatically.
The effect of the dominating feature (log_distance) is realistic. Again, as with diamonds prices, we see that tree-based models do not have the capability to extrapolate a linear relationship. Depending on the application, this can be a problem.
The effect of the hour of day is not intuitive. Why are the predictions at 12:00 as high as during rush-hour?

5 Exercises

Explain in simple words: What is bagging? What is a random forest? What is out-of-bag performance?
Try to improve the random forest fitted on the diamonds data by playing with the mtry parameter. What does it do? Pick the mtry with best OOB performance. Does this improve the test performance?
Fit a random forest on the claims data for the binary variable clm using covariates veh_value, veh_body, veh_age, gender, area, and agecat. Choose a suitable tree depth either by cross-validation or by minimizing OOB error on the training data. Make sure to fit a probability random forest, i.e., predicting probabilities, not classes. Evaluate the final model on an independent test data set. (Note that the “ranger” package uses the “Brier score” as the evaluation metric for probabilistic predictions. In the binary case, is the same as the MSE.) Interpret the results by split gain importance and partial dependence plots.
Consider the fully tuned XGBoost model for the diamonds data in the lecture notes above. Study the online documentation of XGBoost to figure out how to make the model monotonically increasing in log_carat. Test your approach without repeating the grid search part. How does the partial dependence plot for log_carat look now?
In the gradient boosted trees algorithm for the squared error, why do we fit a regression tree to the residuals instead of simply adding as new model the residuals?
(Optional) Develop an XGBoost model for the claims data set with binary response clm, and covariates veh_value, veh_body, veh_age, gender, area, and agecat. Use a clean cross-validation/test approach. Use log loss as loss function and evaluation metric. Interpret its results. You don’t need to write all the code from scratch, but rather modify the XGBoost code from the lecture notes.

6 Summary

In this chapter, we learned about decision trees, random forests, and tree boosting. Individual decision trees are very easy to interpret, but do not perform too well. Tree ensembles such as a random forest or gradient boosting trees, on the other hand, are usually very powerful but difficult to interpret. We have introduced interpretation tools to look inside such “black boxes”. The main reason why random forests and boosted trees often provide more accurate models than a linear model is their ability to automatically learn interactions and other nonlinear effects.

References

Apley, Daniel W., and Jingyu Zhu. 2016. “Visualizing the Effects of Predictor Variables in Black Box Supervised Learning Models.” https://doi.org/10.48550/arXiv.1612.08468.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Breiman, Leo, Jerome Friedman, Charles J. Stone, and R. A. Olshen. 1984. Classification and Regression Trees. Chapman; Hall/CRC.

Chen, Tianqi, and Carlos Guestrin. 2016. “XGBoost: A Scalable Tree Boosting System.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–94. KDD ’16. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Ann. Stat. 29 (5): 1189–1232. https://doi.org/10.1214/aos/1013203451.

Goldstein, Alex, Adam Kapelner, Justin Bleich, and Emil Pitkin. 2015. “Peeking Inside the Black Box: Visualizing Statistical Learning with Plots of Individual Conditional Expectation.” Journal of Computational and Graphical Statistics 24 (1): 44–65. https://doi.org/10.1080/10618600.2014.907095.

Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning. 2nd ed. Springer Series in Statistics. New York, NY, USA: Springer New York Inc. https://web.stanford.edu/~hastie/ElemStatLearn/.

James, G., D. Witten, T. Hastie, and R. Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. Springer Texts in Statistics. Springer US.

Ke, Guolin, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. “LightGBM: A Highly Efficient Gradient Boosting Decision Tree.” In Advances in Neural Information Processing Systems, 30:3149–57. Curran Associates, Inc.

Schapire, Robert E. 1990. “The Strength of Weak Learnability.” Machine Learning 5 (2): 197–227. https://doi.org/10.1023/A:1022648800760.

5. Trees

Statistical Computing

Michael Mayer

2025-05-10

1 Introduction

2 Decision Trees

2.1 How they work

2.2 Example

2.3 Properties

3 Random Forests

3.1 Bagging

3.1.1 Example

3.1.2 Bias-Variance Trade-Off

3.2 From bagging to random forests

3.2.1 Example

3.3 Interpreting a black box

3.3.1 Variable importance

3.3.1.1 Example: Random forest (continued)

3.3.2 Effects

3.3.2.1 Individual conditional expectations

3.3.2.1.1 Example: Random forest (continued)

3.3.2.2 Partial dependence

3.3.2.2.1 Example: Random forest (continued)

4 Gradient Boosted Trees

4.1 Boosting

4.2 Gradient boosting

4.2.1 Illustration

4.3 Modern implementations

4.3.1 Example: XGBoost

4.4 Parameter tuning

4.4.1 Example: XGBoost

4.4.2 Example: Taxi

5 Exercises

6 Summary

References