In the previous chapter, we have met performance measures like the RMSE or the deviance to measure how good our models are. Unfortunately, we cannot fully rely on these values due to overfitting: The more our models overfit, the less we can trust in their “insample” performance, i.e., the performance on the data used to calculate the models. Selecting models based on their insample performance is equally bad. Overfitting should not be rewarded!
In this chapter, we will meet ways to estimate the performance of a model in a fair way and use it to select the best model among alternatives. They are all based on data splitting techniques, where the models are evaluated on fresh data not used for model calculation. A fantastic reference for this chapter is [1]. Before introducing these techniques, we will meet a competitor of the linear model.
Remark on performance measures: While loss functions are used by the algorithm to fit the model, a performance measure or an evaluation metric is a function used to monitor performance and to select models. Ideally, it is consistent with the loss function of the specific algorithm (for instance RMSE as metric and squared error as loss), but sometimes one makes an exception. For classification, besides monitoring average (multi-)log loss, e.g., one sometimes considers the confusion matrix and its derived measures like accuracy, precision, recall, F1-score etc. Wikipedia summarizes these concepts. The confusion matrix tabulates the combinations of actual classes and predicted classes. Note that measures derived from the confusion matrix do not enjoy good statistical properties and focussing on them might lead to suboptimal models. They should thus not be used for making decisions but rather as additional, easy to understand information.
A very simple and intuitive alternative to the linear model is the k-nearest-neighbour approach, originally introduced by Evelyn Fix and J. L. Hodges in an unpublished technical report in 1951. It can be applied for both regression and classification and works without fitting anything. The prediction for an observation is obtained by
By “nearest” we usually mean Euclidean distance in the covariate space. If covariates are not on the same scale, it makes sense to standardize them first by subtracting the mean and dividing by the standard deviation. Otherwise, distances would be dominated by the covariate on the largest scale. Categorical features need to be one-hot- or integer-encoded first. Note that one-hot-encoded covariates may or may not be standardized.
For regression tasks, the responses of the k nearest neighbours are often combined by computing their arithmetic mean. For classification tasks, they are condensed by their most frequent value or to class probabilities.
What prediction would we get with 5-nearest-neighbour regression for the 10’000th row of the diamonds data set?
library(ggplot2)
library(FNN)
# Covariate matrix
xvars <- c("carat", "color", "cut", "clarity")
X <- scale(data.matrix(diamonds[, xvars]))
# The 10'000th observation
diamonds[10000, c("price", xvars)]
# Its prediction
knn.reg(X, test = X[10000, ], k = 5, y = diamonds$price)
## Prediction:
## [1] 4907.8
# Its five nearest neighbours
neighbours <- c(knnx.index(X, X[10000, , drop = FALSE], k = 5))
diamonds[neighbours, ]
Comments
depth
, table
, x
, y
,
z
). These rows most certainly represent the same diamond,
introducing additional overfit. We need to keep this problematic aspect
of the diamonds data in mind.Motivation for this chapter: Insample, a 1-nearest-neighbour regression predicts without error, a consequence of massive overfitting. This hypothetical example indicates that insample performance is often not worth a penny. Models need to be evaluated on fresh, independent data not used for model calculation. This leads us to simple validation.
With simple validation, the original data set is partitioned into training data used to calculate the models and a separate validation data set used to evaluate model performance and/or to select models. Typically, 10%-30% of rows are used for validation.
We can use the validation performance to compare algorithms (regression versus k-nearest-neighbour etc.) and also to choose their hyperparameters like the “k” of k-nearest-neighbour.
Furthermore, the performance difference between training and validation data gives an impression of the amount of overfitting (or rather of the optimism).
We now use a 80%/20% split on the diamonds data to calculate RMSE of 5-nearest-neighbour on both training and validation data.
library(ggplot2)
library(FNN)
library(withr)
y <- "price"
xvars <- c("carat", "color", "cut", "clarity")
# Split diamonds into 80% for "training" and 20% for validation
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
X_train <- diamonds[ix, xvars]
X_valid <- diamonds[-ix, xvars]
y_train <- diamonds[[y]][ix]
y_valid <- diamonds[[y]][-ix]
# Standardize training data
X_train <- scale(data.matrix(X_train))
# Apply training scale to validation data
X_valid <- scale(
data.matrix(X_valid),
center = attr(X_train, "scaled:center"),
scale = attr(X_train, "scaled:scale")
)
# Performance
RMSE <- function(y, pred) {
sqrt(mean((y - pred)^2))
}
pred_train <- knn.reg(X_train, test = X_train, k = 5, y = y_train)$pred
cat("Training RMSE:", RMSE(y_train, pred_train))
## Training RMSE: 512.8187
pred_valid <- knn.reg(X_train, test = X_valid, k = 5, y = y_train)$pred
cat("Validation RMSE:", RMSE(y_valid, pred_valid))
## Validation RMSE: 608.5641
Comment: Validation RMSE is substantially worse than training RMSE, a clear sign of overfitting. However, it is still much better than the (full-sample) performance of linear regression (see Exercise 1, Chapter 1).
Can we find a k with better validation RMSE?
library(tidyr)
# Tuning grid with different values for parameter k
paramGrid <- data.frame(train = NA, valid = NA, k = 1:20)
# Calculate performance for each row in the parameter grid
for (i in 1:nrow(paramGrid)) {
k <- paramGrid[i, "k"]
# Performance on training data
pred_train <- knn.reg(X_train, test = X_train, k = k, y = y_train)$pred
paramGrid[i, "train"] <- RMSE(y_train, pred_train)
# Performance on valid data
pred_valid <- knn.reg(X_train, test = X_valid, k = k, y = y_train)$pred
paramGrid[i, "valid"] <- RMSE(y_valid, pred_valid)
}
# Best validation RMSE
head(paramGrid[order(paramGrid$valid), ], 2)
# Plot results
pivot_longer(paramGrid, cols = -k, values_to = "RMSE", names_to = "Data") |>
ggplot(aes(x = k, y = RMSE, group = Data, color = Data)) +
geom_point() +
geom_line()
Comments
If our data set is large and training takes long, then the simple validation strategy introduced above is usually good enough. For smaller data or if training is fast, there is a better alternative that uses the data in a more economic way and takes more robust decisions. It is called k-fold cross-validation and works as follows:
The CV performance is a good basis to choose the best and final model among alternatives. The final model is retrained on all folds.
Notes
We now use five-fold CV on the diamonds data to find the optimal k, i.e., to tune our nearest-neighbour approach.
library(ggplot2)
library(FNN)
library(withr)
RMSE <- function(y, pred) {
sqrt(mean((y - pred)^2))
}
y <- "price"
xvars <- c("carat", "color", "cut", "clarity")
# Scaled feature matrix
X <- scale(data.matrix(diamonds[xvars]))
# Split diamonds into folds
nfolds <- 5
with_seed(
9838,
fold_ix <- sample(1:nfolds, nrow(diamonds), replace = TRUE)
)
table(fold_ix)
## fold_ix
## 1 2 3 4 5
## 10798 10702 10807 10744 10889
# Tuning grid with different values for parameter k
paramGrid <- data.frame(RMSE = NA, k = 1:20)
# Calculate performance for each row in the parameter grid
for (i in 1:nrow(paramGrid)) {
k <- paramGrid[i, "k"]
scores <- numeric(nfolds)
for (fold in 1:nfolds) {
X_train <- X[fold_ix != fold, ]
X_valid <- X[fold_ix == fold, ]
y_train <- diamonds[[y]][fold_ix != fold]
y_valid <- diamonds[[y]][fold_ix == fold]
pred <- knn.reg(X_train, test = X_valid, k = k, y = y_train)$pred
scores[fold] <- RMSE(y_valid, pred)
}
paramGrid[i, "RMSE"] <- mean(scores)
}
# Best CV-scores
head(paramGrid[order(paramGrid$RMSE), ], 2)
ggplot(paramGrid, aes(x = k, y = RMSE)) +
geom_point(color = "chartreuse4") +
geom_line(color = "chartreuse4") +
ggtitle("Performance by cross-validation")
Comment: Using 6 neighbours seems to be the best choice with a CV RMSE of 618 USD. Again, the fact that certain diamonds appear multiple times leaves a slightly bad feeling. Should we really trust these results?
In the above example, we have systematically compared the CV-performance of k-nearest-neighbour by iterating over a grid of possible values for k. Such strategy to tune models, i.e., to select hyperparameters of a model is called grid search CV. In the next chapter, we will meet situations where multiple parameters have to be optimized simultaneously. Then, the number of parameter combinations and the grid size explode. To save time, we could evaluate only a random subset of parameter combinations, an approach called randomized search CV.
Often, modeling involves many decisions. Even if guided by (cross-)validation, each decision tends to make the resulting final model look better than it is, an effect that can be called overfitting on the validation data. As a consequence, we often do not know how well the final model will perform in reality. As a solution, we can set aside a small test data set used to assess the performance of the final model. A size of 5%-20% is usually sufficient. It is important to look at the test data just once at the very end of the modeling process - after each decision has been made.
Note: Such additional test data set is only necessary if one uses the validation data set to make decisions. If the validation data set is just used to assess the true performance of a model, then we do not need this extra data set. Then, we can use the terms “validation data” and “test data” interchangeably.
Depending on whether one does simple validation or cross-validation, the typical workflow is as follows:
Workflow A
Workflow B
The only difference between the two workflows is whether simple validation or cross-validation is used for decision making.
Remark: For simplicity, Workflow A is sometimes done without refitting on the combination of training and validation data. In that case, the final model is fitted on the training data only.
We will now go through Workflow B for our diamond price model. We will (1) tune the “k” of our nearest-neighbour regression and (2) compete with a linear regression. The model with best CV performance will be evaluated on the test data.
library(ggplot2)
library(FNN)
library(withr)
RMSE <- function(y, pred) {
sqrt(mean((y - pred)^2))
}
y <- "price"
xvars <- c("carat", "color", "cut", "clarity")
dia <- diamonds[, c(y, xvars)]
# dia <- unique(dia) # -> Exercise 1
# Split diamonds into 80% for training and 20% for testing
with_seed(
9838,
ix <- sample(nrow(dia), 0.8 * nrow(dia))
)
train <- dia[ix, ]
test <- dia[-ix, ]
y_train <- train[[y]]
y_test <- test[[y]]
# Standardize training data
X_train <- scale(data.matrix(train[, xvars]))
# Apply training scale to test data
X_test <- scale(
data.matrix(test[, xvars]),
center = attr(X_train, "scaled:center"),
scale = attr(X_train, "scaled:scale")
)
# Split training data into folds
nfolds <- 5
with_seed(
9838,
fold_ix <- sample(1:nfolds, nrow(train), replace = TRUE)
)
# Cross-validation performance of k-nearest-neighbor for k = 1-20
paramGrid <- data.frame(RMSE = NA, k = 1:20)
for (i in 1:nrow(paramGrid)) {
k <- paramGrid[i, "k"]
scores <- numeric(nfolds)
for (fold in 1:nfolds) {
X_train_cv <- X_train[fold_ix != fold, ]
y_train_cv <- y_train[fold_ix != fold]
X_valid_cv <- X_train[fold_ix == fold, ]
y_valid_cv <- y_train[fold_ix == fold]
pred <- knn.reg(X_train_cv, test = X_valid_cv, k = k, y = y_train_cv)$pred
scores[fold] <- RMSE(y_valid_cv, pred)
}
paramGrid[i, "RMSE"] <- mean(scores)
}
# Best CV performance
head(paramGrid[order(paramGrid$RMSE), ], 2)
# Cross-validation performance of linear regression
rmse_reg <- numeric(nfolds)
for (fold in 1:nfolds) {
fit <- lm(reformulate(xvars, y), data = train[fold_ix != fold, ])
pred <- predict(fit, newdata = train[fold_ix == fold, ])
rmse_reg[fold] <- RMSE(y_train[fold_ix == fold], pred)
}
(rmse_reg <- mean(rmse_reg))
## [1] 1154.322
# The overall best model is 6-nearest-neighbor
pred <- knn.reg(X_train, test = X_test, k = 6, y = y_train)$pred
# Test performance for the best model
RMSE(y_test, pred)
## [1] 610.4224
Comments
The data is often split randomly into partitions or folds. As long as the rows are independent, this leads to honest estimates of model performance.
However, if the rows are not independent, e.g. for time series data or grouped data, such a strategy is flawed and usually leads to overly optimistic results. This is a common mistake in modeling.
When data represents a time series, splitting is best done in a way that does not destroy the temporal order. For simple validation, e.g., the first 80% of rows could be used for training and the remaining 20% for validation. The specific strategy depends on how the model will be applied.
Often, data is grouped or clustered by some (hopefully known) ID variable, e.g.,
Then, instead of distributing rows into partitions, we should distribute groups/IDs in order to not destroy the data structure and to get honest performance estimates. We speak of grouped splitting and group k-fold CV.
In our example with diamonds data, it would be useful to have a column with diamond “id” that could be used for grouped splitting. (How would you create a proxy for this?)
If rows are independent, there is a variant of random splitting that often provides better results and is therefore frequently used: stratified splitting. With stratified splitting or stratified k-fold CV, rows are split to ensure approximately equal distribution of a key variable (the response or deciding covariate) across partitions/folds.
Regarding the problem that some diamonds seem to appear multiple
times in the data: As an alternative to grouped splitting,
repeat the last example also on data deduplicated by price
and all covariates. Do the results change? Which results do you trust
more?
Use 5-fold cross-validation to select the best polynomial degree
to represent log(carat)
in a Gamma GLM for diamonds prices
with log-link (with additional covariates color
,
cut
, and clarity
). Evaluate the result on test
data. Use the average Gamma deviance as performance measure
(e.g. function deviance_gamma()
in the package
“MetricsWeighted”). Again make sure that the code is fully
reproducible.
Optional: Compare the linear regression for price
(using log(carat)
, color
, cut
,
and clarity
as covariates) with a corresponding Gamma GLM
with log-link by simple validation. Use once (R)MSE for comparison and
once Gamma deviance. What do you observe?
In this chapter, we have met strategies to estimate model performance in a fair way. These strategies are also used for model selection and tuning. They are an essential part of the full modeling process. ML models without appropriate validation strategy are not to be trusted.
[1] T. Hastie, R., Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2001.