A decision tree is a simple, easy-to-interpret modeling technique for both regression and classification problems. Compared to other methods, decision trees usually do not perform very well. Their relevance lies in the fact that they are the building blocks of two of the most successful ML algorithms: random forests and gradient boosted trees. In this chapter, we will introduce these tree-based methods.
On our journey to estimate the model f by \hat f, we have considered mainly linear functions f so far. We now move to a different function class: decision trees. They have been introduced in 1984 by Leo Breiman, Jerome Friedman and others [1] and are sometimes called “Classification and Regression Trees” (CART).
(Binary) decision trees are calculated recursively by partitioning the data in two pieces. Partitions are chosen to optimize the given average loss by asking the best “yes/no” question about the covariates, e.g., “is carat < 1?” or “is color better than F?”.
For regression problems, the most frequently used loss function is the squared error. For classification, its “information” (= cross-entropy = log loss = half the unit logistic deviance) or the very similar Gini impurity.
Predictions are calculated by sending an observation through the tree, starting with the question at the “trunk” and ending in a “leaf”. The prediction is the value associated with the leaf. For regression situations, such leaf value typically equals the average response of all observations in the leaf. In classification settings, it may be the most frequent class in the leaf or all class probabilities.
The concept of a decision tree is best understood with an example.
We will use the dataCar
data set to predict the claim
probability with a decision tree. As features, we use
veh_value
, veh_body
, veh_age
,
gender
, area
and agecat
.
library(rpart)
library(rpart.plot)
library(insuranceData)
data(dataCar)
fit <- rpart(
clm ~ veh_value + veh_body + veh_age + gender + area + agecat,
data = dataCar,
method = "class",
parms = list(split = "information"),
xval = 0,
cp = -1,
maxdepth = 3
)
prp(fit, type = 2, extra = 7, shadow.col = "gray",
faclen = 0, box.palette = "auto", branch.type = 4,
varlen = 0, cex = 0.9, digits = 3, split.cex = 0.8)
dataCar[1, c("agecat", "veh_value", "veh_body")]
predict(fit, dataCar[1, ])
## 0 1
## 1 0.9330357 0.06696429
Comments
agecat >= 5
)
chosen? The algorithm scans all covariates for all possible split
positions and picks the one with best average loss improvement. In this
case, splitting on covariate agecat
at the value 5 reduced
the average loss most.Properties of decision trees
These properties typically translate 1:1 to combinations of trees like random forests or boosted trees.
In 2001, Leo Breiman introduced a very powerful tree-based algorithm called random forest, see [2]. A random forest consists of many decision trees. To ensure that the trees differ, two sources or randomness are injected:
Predictions are found by pooling the predictions of all trees, e.g., by averaging or majority voting.
Comments about random forests
Let us now fit a random forest for diamond prices with typical parameters and 500 trees. 80% of the data is used for training, the other 20% we use for evaluating the performance. (Throughout the rest of the lecture, we will ignore the problematic aspect of having repeated rows for some diamonds.)
library(ggplot2)
library(withr)
library(ranger)
library(MetricsWeighted)
library(hstats)
# Train/test split
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
fit <- ranger(
price ~ carat + color + cut + clarity,
num.trees = 500,
data = diamonds[ix, ],
importance = "impurity",
seed = 83
)
fit
## Ranger result
##
## Call:
## ranger(price ~ carat + color + cut + clarity, num.trees = 500, data = diamonds[ix, ], importance = "impurity", seed = 83)
##
## Type: Regression
## Number of trees: 500
## Sample size: 43152
## Number of independent variables: 4
## Mtry: 2
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 308191.3
## R squared (OOB): 0.9804707
# Performance on test data
pred <- predict(fit, diamonds[-ix, ])$predictions
rmse(diamonds$price[-ix], pred) # 553 USD
## [1] 553.37
train_mean <- mean(diamonds[["price"]][ix])
r_squared(diamonds$price[-ix], pred, reference_mean = train_mean) # 0.9814
## [1] 0.981392
Comments
In contrast to a single decision tree or a linear model, a combination of many trees is not easy to interpret. It is good practice for any ML model to study at least variable importance and the strongest effects, not just its performance. A pure prediction machine is hardly of any interest and might even contain mistakes like using covariates derived from the response. Model interpretation helps to fight such problems and thus also to increase trust in a model.
There are different approaches to measure the importance of a covariate. Since there is no general mathematical definition of “importance”, the results of different approaches might be inconsistent across each other. For tree-based methods, a usual approach is to measure how many times a covariate X was used in a split or how much total loss improvement came from splitting on X.
Approaches that work for any supervised model (including neural nets) include permutation importance and SHAP importance.
One of the main reasons for the success of modern methods like random forests is the fact that they automatically learn interactions between two or more covariates. Thus, the effect of a covariate X typically depends on the values of other covariates. In the extreme case, the effect of X is different for each observation. The best what we can do is to study the average effect of X over many observations, i.e., averaging the effects over interactions. This leads us to partial dependence plots: They work for any supervised ML model and are constructed as follows: A couple of observations are selected. Then, their average prediction is visualized against X when sliding their value of X over a reasonable grid of values, keeping all other variables fixed. The more natural the Ceteris Paribus clause, the more reliable the partial dependence plots.
Remark: A partial dependence plot of a covariate in a linear regression is simply a visualization of its coefficient.
Alternatives to partial dependence plots include accumulated local effect plots and SHAP dependence plots. Both relax the Ceteris Paribus clause.
For our last example, we will now look at variable importance and partial dependence plots.
# Variable importance regarding MSE improvement
imp <- sort(importance(fit))
imp <- imp / sum(imp)
barplot(imp, horiz = TRUE, col = "chartreuse4")
# Partial dependence plots
for (v in c("carat", "color", "cut", "clarity")) {
p <- partial_dep(fit, v = v, X = diamonds[ix, ]) |>
plot() +
ggtitle(paste("PDP for", v))
print(p)
}
Comments
carat
is the most important
predictor.clm
using covariates veh_value
,
veh_body
, veh_age
, gender
,
area
, and agecat
. Choose a suitable tree depth
either by cross-validation or by minimizing OOB error on the training
data. Make sure to fit a probability random forest, i.e.,
predicting probabilities, not classes. Evaluate the final model on an
independent test data set. (Note that the “ranger” package uses the
“Brier score” as the evaluation metric for probabilistic predictions. In
the binary case, is the same as the MSE.) Interpret the results by split
gain importance and partial dependence plots.The idea of boosting was introduced by Schapire in 1990 [3] and roughly works as follows: A simple model is fit to the data. Then, another simple model is added, trying to correct the errors from the first model. This process is repeated until some stopping criterion triggers. As simple models or base learners, usually small decision trees are used, an idea pushed by Jerome Friedman in his famous 2001 article on the very general framework of gradient boosting machines [4].
Modern variants of such gradient boosted trees are XGBoost, LightGBM and CatBoost. These are the predominant algorithms in ML competitions on tabular data, check this comparison for differences with a screenshot as per Feb. 07, 2025:
.
Predictions are calculated similar to random forests, i.e., by combining predictions of all trees. As loss/objective function, one can choose among many possibilities. Often, using the same loss function as a corresponding GLM is a good choice.
As an initial example on gradient boosting and XGBoost, we fit a model for diamond prices with squared error as loss function. The number of rounds/trees is initially chosen by cross-validation and early stopping, i.e., until CV validation (R)MSE stops improving for a couple or rounds. The learning rate (weight of each tree) is chosen by trial and error in order to end up with a reasonably small/large number of trees, see the next section for more details.
library(ggplot2)
library(withr)
library(xgboost)
library(MetricsWeighted)
y <- "price"
xvars <- c("carat", "color", "cut", "clarity")
# Split into train and test
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
y_train <- diamonds[[y]][ix]
X_train <- diamonds[ix, xvars]
y_test <- diamonds[[y]][-ix]
X_test <- diamonds[-ix, xvars]
# XGBoost data interface
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train)
# Minimal set of parameters
params <- list(
objective = "reg:squarederror",
learning_rate = 0.02
)
# Add trees until CV validation MSE stops improving over the last 20 rounds
cvm <- xgb.cv(
params = params,
data = dtrain,
nrounds = 5000,
nfold = 5,
early_stopping_rounds = 20,
showsd = FALSE,
print_every_n = 50
)
## [1] train-rmse:5467.195833 test-rmse:5466.731231
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 20 rounds.
##
## [51] train-rmse:2110.914866 test-rmse:2114.931255
## [101] train-rmse:956.782913 test-rmse:969.281048
## [151] train-rmse:618.278608 test-rmse:641.135898
## [201] train-rmse:536.683784 test-rmse:566.338548
## [251] train-rmse:514.616839 test-rmse:549.390725
## [301] train-rmse:505.662478 test-rmse:545.010866
## [351] train-rmse:498.994457 test-rmse:543.668008
## Stopping. Best iteration:
## [349] train-rmse:499.215916+2.259174 test-rmse:543.632124+9.772028
# Fit model on full training data using optimal number of boosting round
fit <- xgb.train(
params = params, data = dtrain, print_every_n = 50, nrounds = cvm$best_iteration
)
# Test performance
rmse(y_test, predict(fit, data.matrix(X_test))) # 541.2
## [1] 541.6217
Comments:
Gradient boosted trees offer a quite a lot of parameters. Unlike with random forests, they need to be tuned to achieve good results. It would be naive to use an algorithm like XGBoost without parameter tuning. Here is a selection:
Number of boosting rounds: In contrast to random forests, more trees/rounds is not always beneficial because the model begins to overfit after some time. The optimal number of rounds is usually found by early stopping, i.e., one lets the algorithm stop as soon as the (cross-)validation performance stops improving, see the example above.
Learning rate: The learning rate determines training speed and the impact of each tree to the final model. Typical values are between 0.01 and 0.5. In practical applications, it is set to a value that leads to a reasonable amount of trees (100-1000). Usually, halving the learning rate means twice as much boosting rounds for comparable performance.
Regularization parameters: Depending on the implementation, additional parameters are
Reasonable regularization parameters are chosen by trial and error or systematically by randomized or grid search CV. Usually, it takes a couple of iterations until the range of the parameter values have been set appropriately.
Overall, the modelling strategy is as follows:
Note: Since learning rate, number of boosting rounds and regularization parameters are heavily interdependent, a “big” randomized grid search CV to choose learning rate, boosting rounds and regularization is often not ideal. Above suggestion (fix learning rate, select number of rounds by early stopping and do grid search only on regularization parameters) is more focussed, see also the example below.
We will use XGBoost to fit diamond prices using the squared error as loss function and RMSE as evaluation metric, now using the tuning strategy outlined above.
library(ggplot2)
library(withr)
library(xgboost)
library(MetricsWeighted)
library(hstats)
y <- "price"
xvars <- c("carat", "color", "cut", "clarity")
# Split into train and test
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
y_train <- diamonds[[y]][ix]
X_train <- diamonds[ix, xvars]
y_test <- diamonds[[y]][-ix]
X_test <- diamonds[-ix, xvars]
# XGBoost data interface
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train)
# If grid search is to be run again, set tune <- TRUE
# Note that if run as rmarkdown, the path to the grid is "gridsearch",
# otherwise it is "r/gridsearch"
tune <- FALSE
if (tune) {
# Use default parameters to set learning rate with suitable number of rounds
params <- list(
objective = "reg:squarederror",
learning_rate = 0.02
)
# Cross-validation
cvm <- xgb.cv(
params = params,
data = dtrain,
nrounds = 5000,
nfold = 5,
early_stopping_rounds = 20,
showsd = FALSE,
print_every_n = 50
)
cvm # -> a lr of 0.02 provides about 370 trees, which is a convenient amount
# Final grid search after some iterations
grid <- expand.grid(
iteration = NA,
cv_score = NA,
train_score = NA,
objective = "reg:squarederror",
learning_rate = 0.02,
max_depth = 6:7,
reg_lambda = c(0, 2.5, 5, 7.5),
reg_alpha = c(0, 4),
colsample_bynode = c(0.8, 1),
subsample = c(0.8, 1),
min_split_loss = c(0, 1e-04),
min_child_weight = c(1, 10)
)
# Grid search or randomized search if grid is too large
max_size <- 32
grid_size <- nrow(grid)
if (grid_size > max_size) {
grid <- grid[sample(grid_size, max_size), ]
grid_size <- max_size
}
# Loop over grid and fit XGBoost with five-fold CV and early stopping
pb <- txtProgressBar(0, grid_size, style = 3)
for (i in seq_len(grid_size)) {
cvm <- xgb.cv(
params = as.list(grid[i, -(1:2)]),
data = dtrain,
nrounds = 5000,
nfold = 5,
early_stopping_rounds = 20,
verbose = 0
)
# Store result
grid[i, 1] <- cvm$best_iteration
grid[i, 2:3] <- cvm$evaluation_log[, c(4, 2)][cvm$best_iteration]
setTxtProgressBar(pb, i)
# Save grid to survive hard crashs
saveRDS(grid, file = "gridsearch/diamonds_xgb.rds")
}
}
# Load grid and select best iteration
grid <- readRDS("gridsearch/diamonds_xgb.rds")
grid <- grid[order(grid$cv_score), ]
head(grid)
# Fit final, tuned model
fit <- xgb.train(
params = as.list(grid[1, -(1:3)]),
data = dtrain,
nrounds = grid[1, "iteration"]
)
Now, the model is ready to be inspected by evaluating
# Performance on test data
pred <- predict(fit, data.matrix(X_test))
rmse(y_test, pred) # 539.9
## [1] 539.6096
r_squared(y_test, pred, reference_mean = mean(y_train)) # 0.9823
## [1] 0.9823059
# Variable importance regarding MSE improvement
imp <- xgb.importance(model = fit)
xgb.plot.importance(imp)
# Partial dependence plots
pred_fun <- function(m, X) predict(m, data.matrix(X))
for (v in xvars) {
p <- partial_dep(fit, v = v, X = X_train, pred_fun = pred_fun) |>
plot() +
ggtitle(paste("PDP for", v))
print(p)
}
Comment: The resulting model seems comparable to the random forest with slightly better performance. The grid search did not improve the results in this case.
Study the online documentation of XGBoost to figure out how to
make the model monotonically increasing in carat. Test your insights
without rerunning the grid search in our last example, i.e., just by
refitting the final model. How does the partial dependence plot for
carat
look now?
Develop an XGBoost model for the claims data set with binary
response clm
, and covariates veh_value
,
veh_body
, veh_age
, gender
,
area
, and agecat
. Use a clean
cross-validation/test approach. Use log loss as loss function and
evaluation metric. Interpret its results. You don’t need to write all
the code from scratch, but rather modify the XGBoost code from the
lecture notes.
Optional. Study the documentation of LightGBM. Use LightGBM to develop a competitor to the XGBoost claims model from Exercise 2. The XGBoost code needs to be slightly adapted. Compare grid search time.
In this chapter, we have met decision trees, random forests and tree boosting. Single decision trees are very easy to interpret but do not perform too well. On the other hand, tree ensembles like the random forest or gradient boosted trees usually perform very well but are tricky to interpret. We have introduced interpretation tools to look into such “black boxes”. The main reason why random forests and boosted trees often provide more accurate models than a linear model lies in their ability to automatically learn interactions and other non-linear effects.
[1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification and Regression Trees”, Wadsworth, Belmont, CA, 1984.
[2] L. Breiman, “Random forests”. In: Machine Learning, 2001, 45(1).
[3] R. Schapire, “The strength of weak learnability”, Machine Learning, Vol 5, Nr. 2, 1990.
[4] J. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine”, 2001.