1 Introduction

In the previous chapter, we have met performance measures like the RMSE or the deviance to measure how good our models are. Unfortunately, we cannot fully rely on these values due to overfitting: The more our models overfit, the less we can trust in their “insample” performance, i.e., the performance on the data used to calculate the models. Selecting models based on their insample performance is equally bad. Overfitting should not be rewarded!

In this chapter, we will meet ways to estimate the performance of a model in a fair way and use it to select the best model among alternatives. They are all based on data splitting techniques, where the models are evaluated on fresh data not used for model calculation. A fantastic reference for this chapter is [1]. Before introducing these techniques, we will meet a competitor of the linear model.

Remark on performance measures: While loss functions are used by the algorithm to fit the model, a performance measure or an evaluation metric is a function used to monitor performance and to select models. Ideally, it is consistent with the loss function of the specific algorithm (for instance RMSE as metric and squared error as loss), but sometimes one makes an exception. For classification, besides monitoring average (multi-)log loss, e.g., one sometimes considers the confusion matrix and its derived measures like accuracy, precision, recall, F1-score etc. Wikipedia summarizes these concepts. The confusion matrix tabulates the combinations of actual classes and predicted classes. Note that measures derived from the confusion matrix do not enjoy good statistical properties and focussing on them might lead to suboptimal models. They should thus not be used for making decisions but rather as additional, easy to understand information.

2 Nearest-Neighbour

A very simple and intuitive alternative to the linear model is the k-nearest-neighbour approach, originally introduced by Evelyn Fix and J. L. Hodges in an unpublished technical report in 1951. It can be applied for both regression and classification and works without fitting anything. The prediction for an observation is obtained by

searching the closest k neighbours in the data set and then
combining their responses.

By “nearest” we usually mean Euclidean distance in the covariate space. If covariates are not on the same scale, it makes sense to standardize them first by subtracting the mean and dividing by the standard deviation. Otherwise, distances would be dominated by the covariate on the largest scale. Categorical features need to be one-hot- or integer-encoded first. Note that one-hot-encoded covariates may or may not be standardized.

For regression tasks, the responses of the k nearest neighbours are often combined by computing their arithmetic mean. For classification tasks, they are condensed by their most frequent value or to class probabilities.

2.1 Example: nearest-neighbour

What prediction would we get with 5-nearest-neighbour regression for the 10’000th row of the diamonds data set?

library(ggplot2)
library(FNN)

# Covariate matrix
xvars <- c("carat", "color", "cut", "clarity")
X <- scale(data.matrix(diamonds[, xvars]))

# The 10'000th observation
diamonds[10000, c("price", xvars)]

# Its prediction
knn.reg(X, test = X[10000, ], k = 5, y = diamonds$price)

## Prediction:
## [1] 4907.8

# Its five nearest neighbours
neighbours <- c(knnx.index(X, X[10000, , drop = FALSE], k = 5))
diamonds[neighbours, ]

Comments

The five nearest diamonds are extremely similar. One of them is the observation of interest itself, introducing a relevant amount of overfitting.
The average price of these five observations gives us the nearest-neighbour prediction for the 10’000th diamond.
Would we get better results for a different choice of the number of neighbours k?
Three lines are identical up to the perspective variables (depth, table, x, y, z). These rows most certainly represent the same diamond, introducing additional overfit. We need to keep this problematic aspect of the diamonds data in mind.

Motivation for this chapter: Insample, a 1-nearest-neighbour regression predicts without error, a consequence of massive overfitting. This hypothetical example indicates that insample performance is often not worth a penny. Models need to be evaluated on fresh, independent data not used for model calculation. This leads us to simple validation.

3 Simple Validation

With simple validation, the original data set is partitioned into training data used to calculate the models and a separate validation data set used to evaluate model performance and/or to select models. Typically, 10%-30% of rows are used for validation.

We can use the validation performance to compare algorithms (regression versus k-nearest-neighbour etc.) and also to choose their hyperparameters like the “k” of k-nearest-neighbour.

Furthermore, the performance difference between training and validation data gives an impression of the amount of overfitting (or rather of the optimism).

3.1 Example: simple validation

We now use a 80%/20% split on the diamonds data to calculate RMSE of 5-nearest-neighbour on both training and validation data.

library(ggplot2)
library(FNN)
library(withr)

y <- "price"
xvars <- c("carat", "color", "cut", "clarity")

# Split diamonds into 80% for "training" and 20% for validation
with_seed(
  9838,
  ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)

X_train <- diamonds[ix, xvars]
X_valid <- diamonds[-ix, xvars]

y_train <- diamonds[[y]][ix]
y_valid <- diamonds[[y]][-ix]

# Standardize training data
X_train <- scale(data.matrix(X_train))

# Apply training scale to validation data
X_valid <- scale(
  data.matrix(X_valid),
  center = attr(X_train, "scaled:center"),
  scale = attr(X_train, "scaled:scale")
)

# Performance
RMSE <- function(y, pred) {
  sqrt(mean((y - pred)^2))
}

pred_train <- knn.reg(X_train, test = X_train, k = 5, y = y_train)$pred
cat("Training RMSE:", RMSE(y_train, pred_train))

## Training RMSE: 512.8187

pred_valid <- knn.reg(X_train, test = X_valid, k = 5, y = y_train)$pred
cat("Validation RMSE:", RMSE(y_valid, pred_valid))

## Validation RMSE: 608.5641

Comment: Validation RMSE is substantially worse than training RMSE, a clear sign of overfitting. However, it is still much better than the (full-sample) performance of linear regression (see Exercise 1, Chapter 1).

Can we find a k with better validation RMSE?

library(tidyr)

# Tuning grid with different values for parameter k
paramGrid <- data.frame(train = NA, valid = NA, k = 1:20)
    
# Calculate performance for each row in the parameter grid
for (i in 1:nrow(paramGrid)) {
  k <- paramGrid[i, "k"]
  
  # Performance on training data
  pred_train <- knn.reg(X_train, test = X_train, k = k, y = y_train)$pred
  paramGrid[i, "train"] <- RMSE(y_train, pred_train)
  
  # Performance on valid data
  pred_valid <- knn.reg(X_train, test = X_valid, k = k, y = y_train)$pred
  paramGrid[i, "valid"] <- RMSE(y_valid, pred_valid)
}

# Best validation RMSE
head(paramGrid[order(paramGrid$valid), ], 2)

# Plot results
pivot_longer(paramGrid, cols = -k, values_to = "RMSE", names_to = "Data") |> 
ggplot(aes(x = k, y = RMSE, group = Data, color = Data)) +
  geom_point() +
  geom_line()

Comments

The amount of overfitting decreases for growing k, which makes sense.
Selecting k based on the training data would lead to a suboptimal model.
Based on the validation data, we would choose k=5. It has a minimal RMSE of 608 USD.
Why is the RMSE on the training data not 0 for 1-nearest-neighbour?
Why is it problematic that some diamonds appear multiple times in the dataset?

4 Cross-Validation (CV)

If our data set is large and training takes long, then the simple validation strategy introduced above is usually good enough. For smaller data or if training is fast, there is a better alternative that uses the data in a more economic way and takes more robust decisions. It is called k-fold cross-validation and works as follows:

Split the data into k pieces D = \{D_1, \dots, D_k\} called “folds”. Typical values for k are five or ten.
Set aside one of the pieces (D_j) for validation.
Fit the model on all other pieces, i.e., on D \setminus D_j.
Calculate the model performance on the validation data D_j.
Repeat Steps 2-4 until each piece was used for validation once.
The average of the k model performances yields the CV performance of the model.

The CV performance is a good basis to choose the best and final model among alternatives. The final model is retrained on all folds.

Notes

The “best” model is typically the one with best CV performance. Depending on the situation, it could also be a model with “good CV performance and not too heavy overfit compared to insample performance” or some other reasonable criterion.
If cross-validation is fast, you can repeat the process for additional data splits. Such repeated cross-validation leads to even more robust results.

4.1 Example: cross-validation

We now use five-fold CV on the diamonds data to find the optimal k, i.e., to tune our nearest-neighbour approach.

library(ggplot2)
library(FNN)
library(withr)

RMSE <- function(y, pred) {
  sqrt(mean((y - pred)^2))
}

y <- "price"
xvars <- c("carat", "color", "cut", "clarity")

# Scaled feature matrix
X <- scale(data.matrix(diamonds[xvars]))

# Split diamonds into folds
nfolds <- 5
with_seed(
  9838,
  fold_ix <- sample(1:nfolds, nrow(diamonds), replace = TRUE)
)
table(fold_ix)

## fold_ix
##     1     2     3     4     5 
## 10798 10702 10807 10744 10889

# Tuning grid with different values for parameter k
paramGrid <- data.frame(RMSE = NA, k = 1:20)
    
# Calculate performance for each row in the parameter grid
for (i in 1:nrow(paramGrid)) {
  k <- paramGrid[i, "k"]
  
  scores <- numeric(nfolds)
  
  for (fold in 1:nfolds) {
    X_train <- X[fold_ix != fold, ]
    X_valid <- X[fold_ix == fold, ]

    y_train <- diamonds[[y]][fold_ix != fold]
    y_valid <- diamonds[[y]][fold_ix == fold]

    pred <- knn.reg(X_train, test = X_valid, k = k, y = y_train)$pred
    scores[fold] <- RMSE(y_valid, pred)
  }
  paramGrid[i, "RMSE"] <- mean(scores)
}

# Best CV-scores 
head(paramGrid[order(paramGrid$RMSE), ], 2)

ggplot(paramGrid, aes(x = k, y = RMSE)) +
  geom_point(color = "chartreuse4") +
  geom_line(color = "chartreuse4") +
  ggtitle("Performance by cross-validation")

Comment: Using 6 neighbours seems to be the best choice with a CV RMSE of 618 USD. Again, the fact that certain diamonds appear multiple times leaves a slightly bad feeling. Should we really trust these results?

5 Grid Search

In the above example, we have systematically compared the CV-performance of k-nearest-neighbour by iterating over a grid of possible values for k. Such strategy to tune models, i.e., to select hyperparameters of a model is called grid search CV. In the next chapter, we will meet situations where multiple parameters have to be optimized simultaneously. Then, the number of parameter combinations and the grid size explode. To save time, we could evaluate only a random subset of parameter combinations, an approach called randomized search CV.

6 Test Data and Final Workflow

Often, modeling involves many decisions. Even if guided by (cross-)validation, each decision tends to make the resulting final model look better than it is, an effect that can be called overfitting on the validation data. As a consequence, we often do not know how well the final model will perform in reality. As a solution, we can set aside a small test data set used to assess the performance of the final model. A size of 5%-20% is usually sufficient. It is important to look at the test data just once at the very end of the modeling process - after each decision has been made.

Note: Such additional test data set is only necessary if one uses the validation data set to make decisions. If the validation data set is just used to assess the true performance of a model, then we do not need this extra data set. Then, we can use the terms “validation data” and “test data” interchangeably.

Depending on whether one does simple validation or cross-validation, the typical workflow is as follows:

Workflow A

Split data into train/valid/test, e.g., by ratios 70%/20%/10%.
Train different models on the training data and assess their performance on the validation data. Choose the best model, retrain it on the combination of training and validation data and call it “final model”.
Assess the performance of the final model on the test data.

Workflow B

Split data into train/test, e.g., by ratios 90%/10%.
Evaluate and tune different models by k-fold cross-validation on the training data. Choose the best model, retrain it on the full training data.
Assess performance of the final model on the test data.

The only difference between the two workflows is whether simple validation or cross-validation is used for decision making.

Remark: For simplicity, Workflow A is sometimes done without refitting on the combination of training and validation data. In that case, the final model is fitted on the training data only.

6.1 Example: test data

We will now go through Workflow B for our diamond price model. We will (1) tune the “k” of our nearest-neighbour regression and (2) compete with a linear regression. The model with best CV performance will be evaluated on the test data.

library(ggplot2)
library(FNN)
library(withr)

RMSE <- function(y, pred) {
  sqrt(mean((y - pred)^2))
}

y <- "price"  
xvars <- c("carat", "color", "cut", "clarity")

dia <- diamonds[, c(y, xvars)]
# dia <- unique(dia)  #  -> Exercise 1

# Split diamonds into 80% for training and 20% for testing
with_seed(
  9838,
  ix <- sample(nrow(dia), 0.8 * nrow(dia))
)

train <- dia[ix, ]
test <- dia[-ix, ]

y_train <- train[[y]]
y_test <- test[[y]]

# Standardize training data
X_train <- scale(data.matrix(train[, xvars]))

# Apply training scale to test data
X_test <- scale(
  data.matrix(test[, xvars]),
  center = attr(X_train, "scaled:center"),
  scale = attr(X_train, "scaled:scale")
)

# Split training data into folds
nfolds <- 5
with_seed(
  9838,
  fold_ix <- sample(1:nfolds, nrow(train), replace = TRUE)
)

# Cross-validation performance of k-nearest-neighbor for k = 1-20
paramGrid <- data.frame(RMSE = NA, k = 1:20)

for (i in 1:nrow(paramGrid)) {
  k <- paramGrid[i, "k"]
  scores <- numeric(nfolds)
  
  for (fold in 1:nfolds) {
    X_train_cv <- X_train[fold_ix != fold, ]
    y_train_cv <- y_train[fold_ix != fold]
    
    X_valid_cv <- X_train[fold_ix == fold, ]
    y_valid_cv <- y_train[fold_ix == fold]
    
    pred <- knn.reg(X_train_cv, test = X_valid_cv, k = k, y = y_train_cv)$pred
    scores[fold] <- RMSE(y_valid_cv, pred)
  }
  paramGrid[i, "RMSE"] <- mean(scores)
}

# Best CV performance
head(paramGrid[order(paramGrid$RMSE), ], 2)

# Cross-validation performance of linear regression
rmse_reg <- numeric(nfolds)

for (fold in 1:nfolds) {
  fit <- lm(reformulate(xvars, y), data = train[fold_ix != fold, ])
  pred <- predict(fit, newdata = train[fold_ix == fold, ])
  rmse_reg[fold] <- RMSE(y_train[fold_ix == fold], pred)
}
(rmse_reg <- mean(rmse_reg))

## [1] 1154.322

# The overall best model is 6-nearest-neighbor
pred <- knn.reg(X_train, test = X_test, k = 6, y = y_train)$pred

# Test performance for the best model
RMSE(y_test, pred)

## [1] 610.4224

Comments

6-nearest-neighbour regression performs much better than linear regression.
Its performance on the independent test data is even better than CV suggests. Could this be a consequence of the fact that certain diamonds appear multiple times in the data, introducing potential “leakage” from training to test data?

7 Random Splitting?

The data is often split randomly into partitions or folds. As long as the rows are independent, this leads to honest estimates of model performance.

However, if the rows are not independent, e.g. for time series data or grouped data, such a strategy is flawed and usually leads to overly optimistic results. This is a common mistake in modeling.

7.1 Time-series data

When data represents a time series, splitting is best done in a way that does not destroy the temporal order. For simple validation, e.g., the first 80% of rows could be used for training and the remaining 20% for validation. The specific strategy depends on how the model will be applied.

7.2 Grouped data

Often, data is grouped or clustered by some (hopefully known) ID variable, e.g.,

multiple rows belong to the same patient/customer or
duplicated rows (accidental or not).

Then, instead of distributing rows into partitions, we should distribute groups/IDs in order to not destroy the data structure and to get honest performance estimates. We speak of grouped splitting and group k-fold CV.

In our example with diamonds data, it would be useful to have a column with diamond “id” that could be used for grouped splitting. (How would you create a proxy for this?)

7.3 Stratification

If rows are independent, there is a variant of random splitting that often provides better results and is therefore frequently used: stratified splitting. With stratified splitting or stratified k-fold CV, rows are split to ensure approximately equal distribution of a key variable (the response or deciding covariate) across partitions/folds.

8 Exercises

Regarding the problem that some diamonds seem to appear multiple times in the data: As an alternative to grouped splitting, repeat the last example also on data deduplicated by price and all covariates. Do the results change? Which results do you trust more?
Use 5-fold cross-validation to select the best polynomial degree to represent log(carat) in a Gamma GLM for diamonds prices with log-link (with additional covariates color, cut, and clarity). Evaluate the result on test data. Use the average Gamma deviance as performance measure (e.g. function deviance_gamma() in the package “MetricsWeighted”). Again make sure that the code is fully reproducible.
Optional: Compare the linear regression for price (using log(carat), color, cut, and clarity as covariates) with a corresponding Gamma GLM with log-link by simple validation. Use once (R)MSE for comparison and once Gamma deviance. What do you observe?

9 Chapter Summary

In this chapter, we have met strategies to estimate model performance in a fair way. These strategies are also used for model selection and tuning. They are an essential part of the full modeling process. ML models without appropriate validation strategy are not to be trusted.

10 Chapter References

[1] T. Hastie, R., Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2001.

Model Selection and Validation

Michael Mayer

2024-09-02

1 Introduction

2 Nearest-Neighbour

2.1 Example: nearest-neighbour

3 Simple Validation

3.1 Example: simple validation

4 Cross-Validation (CV)

4.1 Example: cross-validation

5 Grid Search

6 Test Data and Final Workflow

6.1 Example: test data

7 Random Splitting?

7.1 Time-series data

7.2 Grouped data

7.3 Stratification

8 Exercises

9 Chapter Summary

10 Chapter References