A decision tree is a simple, easy-to-interpret modeling technique for both regression and classification problems. Compared to other methods, decision trees usually do not perform very well. Their relevance lies in the fact that they are the building blocks of two of the most successful ML algorithms: random forests and gradient boosted trees. In this chapter, we will introduce these tree-based methods. For more background, the reader is referred to the main references Hastie, Tibshirani, and Friedman (2001) and James et al. (2021).
On our journey to estimate the model f by \hat f, we have so far mainly considered linear functions f. We now turn to another class of functions: decision trees. They were introduced in Breiman et al. (1984) and are sometimes called “Classification and Regression Trees” (CART).
(Binary) decision trees are computed recursively by partitioning the data into two parts. Partitions are chosen to optimize total loss by asking the best “yes/no” question about the covariates, e.g.: Is the age of the driver \le 22?
For regression problems, the most frequently used loss function L is the squared error. For classification, it is “information” (= cross-entropy = log loss = half the unit logistic deviance) or the very similar Gini impurity.
Predictions are computed by sending an observation down the tree, starting with the question at the “trunk” and ending in a leaf, say in leaf j. The prediction is the value \gamma_j associated with that leaf. In regression situations, \gamma_j usually corresponds to the average response of all training observations that fall into that leaf. In classification situations, it may be the most frequent class in the leaf or the vector of class probabilities.
More formally, let R_1, \dots, R_J denote the terminal regions corresponding to the J leaves in the sense that \boldsymbol x falls in leaf j \Leftrightarrow \boldsymbol x \in R_j. Then, the prediction function can be expressed as \hat f(\boldsymbol x) = \sum_{j = 1}^{J} \gamma_j \boldsymbol 1\{\boldsymbol x \in R_j\}.
The concept of a decision tree is best understood with an example.
We will use the dataCar
data set to predict the claim
probability with a decision tree. As features, we use
veh_value
, veh_body
, veh_age
,
gender
, area
and agecat
.
library(rpart)
library(rpart.plot)
library(insuranceData)
data(dataCar)
fit <- rpart(
clm ~ veh_value + veh_body + veh_age + gender + area + agecat,
data = dataCar,
method = "class",
parms = list(split = "information"),
xval = 0,
cp = -1,
maxdepth = 3
)
prp(fit, type = 2, extra = 7, shadow.col = "gray",
faclen = 0, box.palette = "auto", branch.type = 4,
varlen = 0, cex = 0.9, digits = 3, split.cex = 0.8)
dataCar[1, c("agecat", "veh_value", "veh_body")]
predict(fit, dataCar[1, ])
## 0 1
## 1 0.9330357 0.06696429
Comments
agecat >= 5
) selected? The algorithm searches all
covariates for all possible split positions and selects the one with the
best total loss improvement. In this case, the split on the covariate
agecat
at the value 5 reduced the total loss the most. As
loss function, we used “information”, which is equivalent to (half) the
unit logistic deviance.Here, we list a couple of properties of decision trees.
These properties (except the last one) translate to combinations of trees like random forests or boosted trees.
Before we delve into these tree ensemble techniques, here is a short video of Jerome Friedman explaining how he tried to efficiently approximate nearest neighbor and came up with a new method: decision trees.
In 2001, Leo Breiman introduced a powerful tree-based algorithm called random forest (Breiman 2001). A random forest consists of many decision trees. The trees differ for two reasons. One of them is an ensembling technique called “bagging”. Before studying random forests, we first have a look at bagging.
Combining multiple models (base learners) into a single one is a common technique to slightly improve the overall performance of a model. This is referred to as model ensembling. For example, one could create an ensemble model by running multiple k-nearest neighbor regression models, each with a different value for k (e.g., 5, 6, 7), and then averaging the resulting predictions. The combined model predictions tend to have lower variance and thus often a better out-of-sample performance, just like a diversified stock portfolio.
Another way of creating an ensemble model is based on the Bootstrap: Bagging (from “bootstrap aggregating) draws B bootstrap samples and fits a model on each of those samples. The resulting B models are then combined to a single one.
Algorithm: bagging (for regression)
For classification, the last step is replaced by taking averages for each predicted probability or doing majority voting.
Let’s try out the idea of bagging using diamonds data. We first fit a relatively deep decision tree on the training data without bagging. This is our reference model. Then, we fit a tree on each of 20 bootstrapped versions of the training data. In order to evaluate the results, we store the validation predictions of each of the 20 bootstrapped models. We evaluate them by taking their successive means (averaging the predictions of the first b models for b = 1, \dots, 20).
library(rpart)
library(tidyverse)
library(withr)
RMSE <- function(y, pred) {
sqrt(mean((y - pred)^2))
}
diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))
# Split diamonds into 80% for "training" and 20% for validation
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
train <- diamonds[ix, ]
valid <- diamonds[-ix, ]
# Tree without bagging for comparison
form <- log_price ~ log_carat + clarity + color + cut
fit_unbagged <- rpart(form, data = train, xval = 0, cp = -1, maxdepth = 8)
rmse_unbagged <- RMSE(valid$log_price, predict(fit_unbagged, valid))
# Fit B models on B bootstrap samples and store their predictions on the validation data
# (one column per bootstrap sample)
B <- 20
valid_pred <- matrix(nrow = nrow(valid), ncol = B)
for (j in 1:B) {
data_boot <- train[sample(nrow(train), replace = TRUE), ]
fit_boot <- rpart(form, data = data_boot, xval = 0, cp = -1, maxdepth = 8)
valid_pred[, j] <- predict(fit_boot, valid)
}
# Very high correlations across models (subset of three models shown)
round(cor(valid_pred[, 1:3]), 3)
## [,1] [,2] [,3]
## [1,] 1.000 0.997 0.997
## [2,] 0.997 1.000 0.997
## [3,] 0.997 0.997 1.000
# Rowwise successive mean predictions
pred_bagged <- t(apply(valid_pred, 1, function(z) cumsum(z) / 1:length(z)))
# Validation RMSE per column
rmse_bagged <- apply(pred_bagged, 2, function(z) RMSE(valid$log_price, z))
# Visualization
rmse_bagged |>
as_tibble_col("Validation RMSE") |>
mutate(`Bagging round` = row_number()) |>
ggplot(aes(`Bagging round`, `Validation RMSE`)) +
geom_point(color = "chartreuse4") +
geom_hline(yintercept = rmse_unbagged)
Comments
Remarks on bagging
We have mentioned that ensemble models try to reduce variance to improve their test performance. But what does variance mean in this context? And why does a smaller variance leads to better test performance? For the MSE as performance measure, we have the following decomposition of the generalization error (James et al. (2021)):
\underbrace{\mathbb E(y_o - \hat f(x_o))^2}_{\text{Expected test MSE of } x_o} = \text{Var}(\hat f(x_o)) + \left[\text{Bias}(\hat f(x_o))\right]^2 + \underbrace{\text{Var}(\varepsilon)}_{\text{Irreducible error}}, where
Models with low bias (e.g. deep decision trees or k-nearest-neighbor with small k) usually have high variance. On the other hand, models with low variance (like linear regression) tend to have high bias. This is the famous Bias-Variance Trade-Off. A bagged decision tree tries to have low bias (deep trees) and low variance (from bagging).
To turn a bagged decision tree into a random forest, a second source of randomness is injected: Each split scans only a random selection “mtry” of the p covariates to find the best split, usually \sqrt{p} or p/3. “mtry” is the main tuning parameter of a random forest. This extra randomness further decorrelates the bagged trees, which improves the diversification effect. Properties of bagging also apply to random forests.
Algorithm: random forest (for regression)
For classification, the last step is again replaced by taking averages for each predicted probability or doing majority voting.
Comments about random forests
Let us now fit a random forest for (logarithmic) diamond prices with default parameters and 500 trees. 80% of the data is used for training, the other 20% we use for evaluating the performance. (Throughout the rest of the lecture, we will ignore the problematic aspect of having repeated rows for some diamonds.)
library(ggplot2)
library(withr)
library(ranger)
library(MetricsWeighted)
diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))
y <- "log_price"
xvars <- c("log_carat", "color", "cut", "clarity")
# Train/test split
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
fit <- ranger(
reformulate(xvars, y),
num.trees = 500,
data = diamonds[ix, ],
importance = "impurity",
seed = 83
)
fit
## Ranger result
##
## Call:
## ranger(reformulate(xvars, y), num.trees = 500, data = diamonds[ix, ], importance = "impurity", seed = 83)
##
## Type: Regression
## Number of trees: 500
## Sample size: 43152
## Number of independent variables: 4
## Mtry: 2
## Target node size: 5
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 0.01080754
## R squared (OOB): 0.9894831
# Performance on test data
pred <- predict(fit, diamonds[-ix, ])$predictions
y_test <- diamonds[-ix, y]
rmse(y_test, pred) # 0.1020
## [1] 0.1020477
# Reference (intercept only) model "fitted" on training data
r_squared(y_test, pred, reference_mean = mean(diamonds[ix, y])) # 0.9900
## [1] 0.989959
# Size of the fitted model
format(object.size(fit), units = "auto")
## [1] "125.1 Mb"
Comments
In contrast to a single decision tree or a linear model, a combination of many trees is not easy to interpret. It is good practice for any ML model to not only study
A pure prediction machine is hardly of any interest and might even contain mistakes like using covariates derived from the response. Model interpretation helps to fight such problems and thus also to increase trust in a model and its creator. The field of ML/statistics caring about interpreting complex models is called XAI (from eXplainable Artificial Intelligence). We present some of the methods here, and some other later in the chapter on neural networks.
Most of the text of this section is adapted from the responsible ML lecture.
There are different approaches to measure the importance of a covariate. Since no general mathematical definition of “importance” exists, the results of different approaches might be inconsistent across each other. For tree-based methods, a usual approach is to measure how many times a covariate X^{(j)} was used in a split or how much total loss improvement came from splitting on X^{(j)}. For a GLM, we could study partial Chi-squared tests (model with feature X^{(j)} versus model without).
Approaches that work for any supervised model include permutation importance (see later) and SHAP importance (not covered).
Let us study relative split gain importance regarding MSE improvement for the features in our random forest.
imp <- sort(importance(fit))
imp <- imp / sum(imp)
barplot(imp, horiz = TRUE, col = "chartreuse4")
Comment: As expected, logarithmic carat is by far the most important feature.
One of the advantages of an additive linear regression over a complex ML model is its simple structure. The coefficient(s) of a feature X^{(j)} immediately tells us how predictions will react on changes in X^{(j)}, keeping all other covariates fixed (“Ceteris Paribus”). Due to additivity, the effect of X^{(j)} is the same for all observations.
But what if the linear regression contains complex interaction effects? Or if the model is an ML model such as a random forest or k-nearest neighbor? Options include studying individual conditional expectations and/or partial dependence plots.
The individual conditional expectation (ICE) function (Goldstein et al. 2015) for covariate X^{(j)} of model f and an observation with covariate vector \boldsymbol x \in {\mathbb R}^p is given by \text{ICE}_j: v \in {\mathbb R} \mapsto f(v, \boldsymbol x_{\setminus j}), where \boldsymbol x_{\setminus j} denotes all but the j-th component of \boldsymbol x, which is being replaced by the value v.
The corresponding ICE curve represents the graph (v, \text{ICE}_j(v)) for a grid of values v \in \mathbb R. An ICE plot finally shows the ICE curves of many observations.
Remarks
Pros and Cons
How do ICE and c-ICE plots for two selected features look?
library(hstats)
# ICE and c-ICE plots of log_carat and clarity
v <- "log_carat"
ic <- ice(fit, v = v, X = diamonds[ix, ])
plot(ic, color = "chartreuse4") +
ggtitle(paste("ICE plot for", v))
plot(ic, color = "chartreuse4", center = TRUE) |>
ggtitle(paste("c-ICE plot for", v))
## $title
##
## $subtitle
## [1] "c-ICE plot for log_carat"
##
## attr(,"class")
## [1] "labels"
v <- "clarity"
ic <- ice(fit, v = v, X = diamonds[ix, ])
plot(ic, color = "chartreuse4") +
ggtitle(paste("ICE plot for", v))
plot(ic, color = "chartreuse4", center = TRUE) |>
ggtitle(paste("c-ICE plot for", v))
## $title
##
## $subtitle
## [1] "c-ICE plot for clarity"
##
## attr(,"class")
## [1] "labels"
Comments
log_carat
: The curves are increasing,
which makes sense: everything else fixed, the expected (log) price
increases with (log) carat. For very large values of log carat,
predictions tail off. This is not realistic but a consequence of the
fact that tree-based models cannot extrapolate.log_carat
: Vertical scatter is not
strong, indicating weak interaction effects. If our model would use
price
as response, the interaction effects would be much
stronger (why?).clarity
: Better clarity is associated with higher
predictions. This trend is easier to see in the centered ICE plot.
There, we can also see strong interaction effects (some curves are much
steeper than others, etc.).The pointwise average of many ICE curves is called a partial dependence plot (PDP), originally introduced by Friedman (2001) in his seminal work on gradient boosting. It describes the effect of the variable X^{(j)}, averaged over all interaction effects and holding all other variables fixed. The empirical partial dependence function for model \hat f and feature X^{(j)} is given by \text{PD}_j(v) = \frac{1}{n} \sum_{i = 1}^n \hat f(v, \boldsymbol x_{i,\setminus j}), where \boldsymbol x_{i,\setminus j} denotes the feature vector without j-th component of observation i in some reference dataset, often the training or the test data.
The corresponding PDP represents the graph (v, \text{PD}_j(v)) for a grid of values v \in \mathbb R.
Pros and Cons
An alternative to the partial dependence plot, the accumulated local effects (ALE) plot (Apley and Zhu 2016), attempts to overcome the first two (related) downsides. However, ALE plots are more difficult to calculate and to interpret. We will not consider them here.
Let us now study the partial dependence plots of all features.
for (v in xvars) {
p <- partial_dep(fit, v = v, X = diamonds[ix, ]) |>
plot(color = "chartreuse4") +
ggtitle(paste("PDP for", v))
print(p)
}
Comment: All effects as assessed by partial dependence plots make sense.
Another way of combining decision trees to an ensemble model are gradient boosted trees. They belong to the most successful ML models. We will first explain what boosting is. Then, we introduce the gradient boosted trees algorithm.
The idea of boosting was investigated e.g. in Schapire (1990) and could be described as follows:
As base learners (or weak learners) f^k, often short decision trees are used. But how to find them specifically?
AdaBoost: In 1995, Freund and Schapire proposed the famous AdaBoost algorithm for binary classification. It combines decision trees with a particular reweighting scheme.
Gradient boosting: While AdaBoost is still in use, the most common way to find the updates \hat f^k goes back to Jerome Friedman and his article on Gradient Boosting Machines (Friedman 2001). The key idea of gradient boosting is to choose \hat f^k to make the total loss Q(\hat f + \hat f^k) = \sum_{i = 1}^n L(y_i, \hat f(\boldsymbol x_i) + \hat f^k (\boldsymbol x_i)) smaller.
Gradient boosting is gradient descent in function space. Gradient descent is a numerical method to find the (local) minimum of a smooth function h: \mathbb R^n \to \mathbb R. It works as follows:
\lambda > 0 is the step size or learning rate, and g = \left[\frac{\partial h(x)}{\partial x}\right]_{x = \hat x} is the gradient of h at \hat x.
It points in the direction of the steepest ascent, so the update step goes in the opposite direction to make h smaller. Imagine you are standing on a hill in foggy weather: which direction should you go to get down? You look for the direction in which the hill is steepest, and then take a step in the opposite direction, and so on.
The following picture on Wikipedia illustrates the situation:
Let’s apply gradient descent to the total loss Q(f) = \sum_{i = 1}^n L(y_i, f(\boldsymbol x_i)) viewed as a function of the n parameters f = \left(f(\boldsymbol x_1), \dots, f(\boldsymbol x_n)\right) \in \mathbb R^n, i.e. of the n predictions on the training data.
We proceed as in standard gradient descent:
For the squared error loss L(y, z) = (y-z)^2/2, these components equal g_i = -(\underbrace{y_i - \hat f(\boldsymbol x_i)}_{\text{Residual } r_i}), i.e., the gradient descent updates equal \hat f(\boldsymbol x_i) \leftarrow \hat f(\boldsymbol x_i) + \underbrace{\lambda r_i}_{\hat f^k \text{?}}. Does this mean that the update steps in gradient boosting are \hat f^k = -\lambda g_i? This would not make sense for two major reasons:
The solution is to replace -g_i by predictions of a decision tree (or of another base learner) with covariates \boldsymbol X. This leads us to the following algorithms (slightly modified from Hastie, Tibshirani, and Friedman (2001)):
Algorithm: Gradient boosted trees for the squared error loss
For general loss functions, the algorithm is a bit more complicated:
Algorithm: Gradient boosted trees for general loss functions
Remarks
Modern implementations of gradient boosted trees include XGBoost, LightGBM, and CatBoost. These are the predominant algorithms in ML competitions on tabular data, check this comparison for differences with a screenshot as per October 10, 2022:
.
They differ from Friedman’s original approach in multiple ways, e.g.:
tree_method = "hist"
As an initial example on gradient boosting and XGBoost, we fit a model for logarithmic diamond prices with squared error as loss function. The number of rounds/trees is initially chosen by cross-validation and early stopping, i.e., until CV validation (R)MSE stops improving for a couple or rounds. The learning rate (weight of each tree) is chosen by trial and error in order to end up with a reasonably small/large number of trees, see the next section for more details.
library(ggplot2)
library(withr)
library(xgboost)
diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))
y <- "log_price"
xvars <- c("log_carat", "color", "cut", "clarity")
# Split into train and valid
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
y_train <- diamonds[ix, y]
X_train <- diamonds[ix, xvars]
y_test <- diamonds[[y]][-ix]
X_test <- diamonds[-ix, xvars]
# XGBoost data interface
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train)
# Minimal set of parameters
params <- list(
objective = "reg:squarederror",
learning_rate = 0.1
)
# Add trees until CV validation MSE stops improving over the last 20 rounds
cvm <- xgb.cv(
params = params,
data = dtrain,
nrounds = 5000,
nfold = 5,
early_stopping_rounds = 20,
showsd = FALSE,
print_every_n = 50
)
## [1] train-rmse:6.618142 test-rmse:6.618132
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 20 rounds.
##
## [51] train-rmse:0.106174 test-rmse:0.108990
## [101] train-rmse:0.097156 test-rmse:0.101336
## [151] train-rmse:0.095869 test-rmse:0.101035
## [201] train-rmse:0.094924 test-rmse:0.100912
## Stopping. Best iteration:
## [183] train-rmse:0.095223+0.000278 test-rmse:0.100907+0.001007
# Fit model on full training data using optimal number of boosting round
fit <- xgb.train(
params = params,
data = dtrain,
print_every_n = 50,
nrounds = cvm$best_iteration
)
# Test performance
rmse(y_test, predict(fit, data.matrix(X_test))) # 0.09876
## [1] 0.09881612
Comments:
Gradient boosted trees offer many parameters to select from. Unlike with random forests, they need to be tuned to achieve good results. It would be naive to use an algorithm like XGBoost without parameter tuning. Here is a selection:
Number of boosting rounds K: In contrast to random forests, more trees/rounds is not always beneficial because the model begins to overfit after some time. The optimal number of rounds is usually found by early stopping, i.e., one lets the algorithm stop as soon as the (cross-)validation performance stops improving, see the example above.
Learning rate \lambda: The learning rate determines the training speed and the impact of each tree to the final model. Typical values are between 0.001 and 1. In practical applications, it is set to a value that leads to a reasonable amount of trees (100 - 1000). Usually, halving the learning rate means twice as much boosting rounds for comparable performance.
Regularization parameters: Depending on the implementation, additional parameters are
Reasonable regularization parameters are chosen by trial and error or systematically by randomized/grid search CV. Usually, it takes a couple of manual iterations until the range of the parameter values have been set appropriately.
Overall, the modeling strategy could be summarized as follows:
Remark: Since the learning rate, the number of boosting rounds, and other regularization parameters are highly interdependent, a large randomized grid search to select the learning rate, boosting rounds, and regularization parameters is often not ideal. Above suggestion (fix learning rate, select number of rounds by early stopping and do grid search only on regularization parameters) is more focused, see also the example below.
We will use XGBoost to fit diamond prices using the squared error as loss function and RMSE as corresponding evaluation metric, now using a sophisticated tuning strategy.
# This code is pretty long. It may serve as a general template to fit a
# high-performance XGBoost model
library(ggplot2)
library(withr)
library(xgboost)
library(MetricsWeighted)
library(hstats)
diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))
y <- "log_price"
xvars <- c("log_carat", "color", "cut", "clarity")
# Split into train and test
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
y_train <- diamonds[ix, y]
X_train <- diamonds[ix, xvars]
y_test <- diamonds[-ix, y]
X_test <- diamonds[-ix, xvars]
# XGBoost data interface
dtrain <- xgb.DMatrix(data.matrix(X_train), label = y_train)
# If grid search is to be run again, set tune <- TRUE
tune <- FALSE
if (tune) {
# Use default parameters to set learning rate with suitable number of rounds
params <- list(
objective = "reg:squarederror",
learning_rate = 0.1
)
# Cross-validation
cvm <- xgb.cv(
params = params,
data = dtrain,
nrounds = 5000,
nfold = 5,
early_stopping_rounds = 20,
showsd = FALSE,
print_every_n = 50
)
cvm # -> a lr of 0.1 provides about 200 trees, which is a convenient amount
# Final grid search after some iterations
grid <- expand.grid(
iteration = NA,
cv_score = NA,
train_score = NA,
objective = "reg:squarederror",
learning_rate = 0.1,
max_depth = 6:7,
reg_lambda = c(0, 2.5, 5, 7.5),
reg_alpha = c(0, 4),
colsample_bynode = c(0.8, 1),
subsample = c(0.8, 1),
min_child_weight = c(1, 10),
min_split_loss = c(0, 1e-04)
)
# Grid search or randomized search if grid is too large
max_size <- 32
grid_size <- nrow(grid)
if (grid_size > max_size) {
grid <- grid[sample(grid_size, max_size), ]
grid_size <- max_size
}
# Loop over grid and fit XGBoost with five-fold CV and early stopping
pb <- txtProgressBar(0, grid_size, style = 3)
for (i in seq_len(grid_size)) {
cvm <- xgb.cv(
params = as.list(grid[i, -(1:2)]),
data = dtrain,
nrounds = 5000,
nfold = 5,
early_stopping_rounds = 20,
verbose = 0
)
# Store result
grid[i, 1] <- cvm$best_iteration
grid[i, 2:3] <- cvm$evaluation_log[, c(4, 2)][cvm$best_iteration]
setTxtProgressBar(pb, i)
# Save grid to survive hard crashs
saveRDS(grid, file = "simulation/diamonds_xgb.rds")
}
}
# Load grid and select best iteration
grid <- readRDS("simulation/diamonds_xgb.rds")
grid <- grid[order(grid$cv_score), ]
head(grid)
# Fit final, tuned model
fit <- xgb.train(
params = as.list(grid[1, -(1:3)]),
data = dtrain,
nrounds = grid[1, "iteration"]
)
Now, the model is ready to be inspected by evaluating
# Performance on test data
pred <- predict(fit, data.matrix(X_test))
rmse(y_test, pred) # 0.0985
## [1] 0.09868352
r_squared(y_test, pred, reference_mean = mean(y_train)) # 0.9906
## [1] 0.9906101
# Variable importance regarding MSE improvement
imp <- xgb.importance(model = fit)
xgb.plot.importance(imp, col = "chartreuse4")
# Partial dependence plots
pred_fun <- function(m, X) predict(m, data.matrix(X))
for (v in xvars) {
p <- partial_dep(fit, v = v, X = X_train, pred_fun = pred_fun) |>
plot("chartreuse4") +
ggtitle(paste("PDP for", v))
print(p)
}
Comment: Compared to the random forest, the test performance is slightly better. Effects seem to be comparable.
We will now tune and fit a LightGBM model to predict (logarithmic)
taxi trip durations, using the same covariates as in previous taxi
models, and following the same basic tuning strategy as demonstrated for
XGBoost. Instead of max_depth
, the main regularization
parameter of LightGBM is the number of leaf nodes.
Note: Since model tuning is slow, and the data is large, we could replace the 5-fold cross-validation by simple validation, e.g., using a 60%/20%/20% data split.
library(arrow)
library(ggplot2)
library(dplyr)
library(data.table)
library(withr)
library(lightgbm)
library(MetricsWeighted)
library(hstats)
df <- read_parquet("taxi/yellow_tripdata_2018-01.parquet")
setDT(df)
head(df)
#=======================================================================
# Prepare data for modeling
#=======================================================================
df[, duration := as.numeric(
difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins")
)]
df = df[between(trip_distance, 0.2, 100) & between(duration, 1, 120)]
df[, `:=`(
pu_hour = factor(data.table::hour(tpep_pickup_datetime)),
weekday = factor(data.table::wday(tpep_pickup_datetime)), # 1 = Sunday
pu_loc = forcats::fct_lump_min(factor(PULocationID), 1e5),
log_duration = log(duration),
log_distance = log(trip_distance)
)]
xvars <- c("log_distance", "weekday", "pu_hour", "pu_loc")
y <- "log_duration"
# Random split
with_seed(1,
ix <- sample(nrow(df), 0.8 * nrow(df))
)
y_train <- df[[y]][ix]
X_train <- df[ix, xvars, with = FALSE]
y_test <- df[[y]][-ix]
X_test <- df[-ix, xvars, with = FALSE]
# Data interface of LGB
dtrain <- lgb.Dataset(
data.matrix(X_train),
label = y_train,
params = list(feature_pre_filter = FALSE)
)
grid_file <- "taxi/taxi_lgb.rds"
#=======================================================================
# Tune model using strategy outlined above (use as LightGBM template)
#=======================================================================
tune <- FALSE
if (tune) {
# Select learning rate so that optimal number of rounds is
# somewhere between 100 and 1000
params <- list(
learning_rate = 1,
objective = "mse",
num_threads = 7 # choose number of threads in line with your system
)
cvm <- lgb.cv(
params = params,
data = dtrain,
nrounds = 1000,
nfold = 5,
early_stopping_rounds = 20,
eval_freq = 50
)
# Random search
grid <- expand.grid(
iteration = NA,
score = NA,
objective = "mse",
learning_rate = 1,
num_leaves = c(15, 31, 63),
lambda_l2 = c(0, 2.5, 5, 7.5),
lambda_l1 = c(0, 4),
colsample_bynode = c(0.8, 1),
bagging_fraction = c(0.8, 1),
min_data_in_leaf = c(20, 50, 100),
min_sum_hessian_in_leaf = c(0, 0.001, 0.01),
stringsAsFactors = FALSE
)
max_size <- 10
grid_size <- nrow(grid)
if (grid_size > max_size) {
grid <- grid[sample(grid_size, max_size), ]
grid_size <- max_size
}
pb <- txtProgressBar(0, grid_size, style = 3)
for (i in seq_len(grid_size)) { # i <- 1
cvm <- lgb.cv(
params = as.list(grid[i, -(1:2)]),
data = dtrain,
nrounds = 5000,
nfold = 5,
early_stopping_rounds = 20,
verbose = -1
)
# Store result
grid[i, 1:2] <- as.list(cvm)[c("best_iter", "best_score")]
setTxtProgressBar(pb, i)
# Save grid to survive hard crashs
saveRDS(grid, file = grid_file)
}
}
#=======================================================================
# Fit model on best parameter combination
#=======================================================================
grid <- readRDS(grid_file) |>
arrange(score)
head(grid)
# Fit model on best parameter combination
if (FALSE) { # Set to true to refit
fit_lgb <- lgb.train(
params = as.list(grid[1, -(1:2)]),
data = dtrain,
nrounds = grid[1, "iteration"]
)
lgb.save(fit_lgb, "taxi/fit_lgb.txt")
} else{
fit_lgb <- lgb.load("taxi/fit_lgb.txt")
}
#=======================================================================
# Inspect
#=======================================================================
# Performance on test data
pred <- predict(fit_lgb, data.matrix(X_test))
rmse(y_test, pred) # 0.3014
## [1] 0.3013554
r_squared(y_test, pred, reference_mean = mean(y_train))
## [1] 0.8232967
# Variable importance regarding MSE improvement
imp <- lgb.importance(model = fit_lgb)
lgb.plot.importance(imp)
# PDPs
for (v in xvars) {
p <- partial_dep(fit_lgb, v = v, X = data.matrix(X_train)) |>
plot("chartreuse4") +
ggtitle(paste("PDP for", v))
print(p)
}
Comments
log_distance
) is
realistic. Again, as with diamonds prices, we see that tree-based models
do not have the capability to extrapolate a linear relationship.
Depending on the application, this can be a problem.Explain in simple words: What is bagging? What is a random forest? What is out-of-bag performance?
Try to improve the random forest fitted on the diamonds data by
playing with the mtry
parameter. What does it do? Pick the
mtry
with best OOB performance. Does this improve the test
performance?
Fit a random forest on the claims data for the binary variable
clm
using covariates veh_value
,
veh_body
, veh_age
, gender
,
area
, and agecat
. Choose a suitable tree depth
either by cross-validation or by minimizing OOB error on the training
data. Make sure to fit a probability random forest, i.e.,
predicting probabilities, not classes. Evaluate the final model on an
independent test data set. (Note that the “ranger” package uses the
“Brier score” as the evaluation metric for probabilistic predictions. In
the binary case, is the same as the MSE.) Interpret the results by split
gain importance and partial dependence plots.
Consider the fully tuned XGBoost model for the diamonds data in
the lecture notes above. Study the online documentation of XGBoost to
figure out how to make the model monotonically increasing in
log_carat
. Test your approach without repeating the grid
search part. How does the partial dependence plot for
log_carat
look now?
In the gradient boosted trees algorithm for the squared error, why do we fit a regression tree to the residuals instead of simply adding as new model the residuals itself?
(Optional) Develop an XGBoost model for the claims data set with
binary response clm
, and covariates veh_value
,
veh_body
, veh_age
, gender
,
area
, and agecat
. Use a clean
cross-validation/test approach. Use log loss as loss function and
evaluation metric. Interpret its results. You don’t need to write all
the code from scratch, but rather modify the XGBoost code from the
lecture notes.
In this chapter, we learned about decision trees, random forests, and tree boosting. Individual decision trees are very easy to interpret, but do not perform too well. Tree ensembles such as a random forest or gradient boosting trees, on the other hand, are usually very powerful but difficult to interpret. We have introduced interpretation tools to look inside such “black boxes”. The main reason why random forests and boosted trees often provide more accurate models than a linear model is their ability to automatically learn interactions and other nonlinear effects.