This chapter on linear models is the starting point for the part “Statistical ML in Action”. We will first outline this part. Then, we will revisit linear regression, followed by one of its most important generalizations: the generalized linear model (GLM). In the last section, we will learn about technologies for modeling large data.
The remaining chapters
are dedicated to Machine Learning (ML).
ML can be viewed as a collection of statistical algorithms used to
Our focus is on supervised ML. Depending on whether we are predicting numbers or classes, we speak of regression or classification.
Most examples will be based on the diamonds
and the
dataCar
datasets from the first chapter. Additionally, we
will work with a large data set containing information about many
millions taxi trips in New York City in January 2018. Each row
represents a taxi trip. The columns represent information like distance
or start and end time. In the last section, “Modeling Large Data”, you
will find the download link for this data set.
The material of this part is based on our online lecture.
Our general setup is as follows: a distributional property T of a response Y should be approximated by a model f: \boldsymbol x\in \mathbb R^p \mapsto \mathbb R of a p-dimensional feature vector \boldsymbol X = (X^{(1)}, \dots, X^{(p)}) with value \boldsymbol x = (x^{(1)}, \dots, x^{(p)}) \in \mathbb R^p, i.e., T(Y\mid \boldsymbol X = \boldsymbol x) \approx f(\boldsymbol x). For brevity, we write T(Y\mid \boldsymbol X = \boldsymbol x) = T(Y\mid \boldsymbol x). Examples of T are the expectation \mathbb E, or a quantile q_\alpha. The model f is then estimated by \hat f from the training data by minimizing some objective function typically of the form Q(f) = \sum_{i = 1}^n L(y_i, f(\boldsymbol x_i)) + \lambda \Omega(f), where
Once found, \hat f serves as our prediction function that can be applied to new data. In addition, we can examine the structure of \hat f to gain insight into the relationship between response and covariates:
Remarks
In order to get used to the terms mentioned above, we will look at the mother of all supervised learning algorithms: (multiple) linear regression. It was first published by Adrien-Marie Legendre in 1805 and is still very frequently used thanks to its simplicity, interpretability, and flexibility. It further serves as a simple benchmark for more complex algorithms and is the starting point for extensions like the generalized linear model.
Linear regression postulates the model equation \mathbb E(Y \mid \boldsymbol x) = f(\boldsymbol x) = \beta_o + \beta_1 x^{(1)} + \dots + \beta_p x^{(p)}, where (\beta_o, \beta_1, \dots, \beta_p) \in \mathbb R^{p+1} is the parameter vector to be estimated from the data.
The model equation of the linear regression relates the covariates to the expected response \mathbb E(Y\mid \boldsymbol x) by a linear formula in the parameters \beta_o, \dots, \beta_p. The additive constant \beta_o is called the intercept. The parameter \beta_j tells us by how much Y is expected to change when the value of feature X^{(j)} is increased by 1, keeping all other covariates fixed (“Ceteris Paribus”). The parameter \beta_j is called effect of X^{(j)} on the expected response.
A linear regression with just one covariate X is called a simple linear regression with equation \mathbb E(Y \mid x) = \alpha + \beta x.
The optimal \hat f to estimate f is found by minimizing as objective function the sum of squared prediction errors (residuals) \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat y_i)^2. Remember: y_i is the observed response of the ith data row and \hat y_i its prediction (or fitted value).
Once the model is fitted, we can use the coefficients \hat\beta_o, \dots, \hat\beta_p to make predictions and to study empirical effects of the covariates on the expected response.
To discuss the typical output of a linear regression, we will now model diamond prices by size. The model equation is \mathbb E(\text{price} \mid \text{carat}) = \alpha + \beta \cdot \text{carat}.
library(ggplot2)
fit <- lm(price ~ carat, data = diamonds)
summary(fit)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
intercept <- coef(fit)[[1]]
slope <- coef(fit)[[2]]
# Visualize the regression line
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.2, shape = ".") +
coord_cartesian(xlim = c(0, 3), ylim = c(-3000, 20000)) +
geom_abline(slope = slope, intercept = intercept, color = "chartreuse4", size = 1)
# Predictions for diamonds with 1.3 carat?
predict(fit, data.frame(carat = 1.3))
## 1
## 7826.993
# By hand
intercept + slope * 1.3
## [1] 7826.993
Comments
How good is a specific linear regression model? We may consider two aspects, namely
How accurate are the model predictions? That is, how well do the predictions match the observed response? In accordance with the least-squares approach, this is best quantified by the sum of squared prediction errors \sum_{i = 1}^n (y_i - \hat y_i)^2 or, equivalently, by the mean-squared error \text{MSE} = \frac{1}{n}\sum_{i = 1}^n (y_i - \hat y_i)^2. To quantify the size of the typical prediction error on the same scale as Y, we can take the square-root of the MSE and study the root-mean-squared error (RMSE). Minimizing MSE also minimizes RMSE.
Besides an absolute performance measure like the RMSE, we gain additional insights by studying a relative performance measure like the R-squared. It measures the relative decrease in MSE compared to the MSE of the “empty” or “null” model consisting only of an intercept. Put differently, the R-squared measures the proportion of variability of Y explained by the covariates.
Let us calculate these performance measures for the simple linear regression above.
mse <- function(y, pred) {
mean((y - pred)^2)
}
(MSE <- mse(diamonds$price, predict(fit, diamonds)))
## [1] 2397955
(RMSE <- sqrt(MSE))
## [1] 1548.533
empty_model <- lm(price ~ 1, data = diamonds) # predictions equal mean(diamonds$price)
MSE_empty <- mse(diamonds$price, predict(empty_model, diamonds))
# R-squared
(MSE_empty - MSE) / MSE_empty
## [1] 0.8493305
Comments
The main assumption of linear regression is a correctly specified model equation \mathbb E(Y \mid \boldsymbol x) = \beta_o + \beta_1 x^{(1)} + \dots + \beta_p x^{(p)}. This means that predictions are not systematically too high or too small for certain values of the covariates.
How is this assumption checked in practice? In a simple regression setting, the points in the scatterplot should be located around the regression line for all covariate values. For a multiple linear regression, this translates to the empirical condition that residuals (differences between observed and fitted response) do not show bias if plotted against covariate values.
Additional assumptions like independence of rows, constant variance of the error term \varepsilon in the equation Y = f(\boldsymbol x) + \varepsilon and normal distribution of \varepsilon guarantee optimality of the least-squares estimator \hat \beta_o, \dots, \hat \beta_p and the correctness of inferential statistics (standard errors, p values, confidence intervals). In that case, we talk of the normal linear model. Its conditions are checked by studying diagnostic plots. We skip this part for brevity and since we are not digging into inferential statistics.
Looking at the scatter plot augmented with the regression line, we can see systematically too low (even negative!) predictions for very small diamonds. This indicates a misspecified model. Later we will see how to fix this.
In the following, we will list some problems that often occur in linear regression. We will only mention them without going into detail.
Like many other ML algorithms, linear regression cannot handle missing values. Rows with missing responses can be safely omitted, while missing values in covariates should usually be handled. The simplest (often too naive) approach is to fill in missing values with a typical value such as the mean or the most frequent value.
Gross outliers in the covariates can distort the result of linear regression. Do not delete them, but try to reduce their effect by using logarithms or more robust regression techniques. Outliers in the response can also be problematic, especially for inferential statistics.
If too many parameters are used relative to the number of observations, the resulting model may look good but would not generalize well to new data. This is referred to as overfitting. A small amount of overfitting is not problematic. However, do not fit a model with p=100 parameters to a data set with only n=200 rows. The resulting model would be garbage. An n/p ratio greater than 50 is usually safe for stable parameter estimation.
When the association between two or more covariates is strong, their coefficients are difficult to interpret because the Ceteris Paribus clause is usually unnatural in such situations. For example, in a house price model, it is unnatural to examine the effect of an additional room while the living area remains unchanged. This is even more problematic for causally dependent covariates: Consider a model with covariates X and X^2. It would certainly not make sense to examine the effect of X while X^2 remains fixed.
Strong collinearity can be detected by looking at correlations across (numeric) covariates. It is mainly a problem when interpreting effects or for statistical inference of effects. Predictions or other “global” model characteristics like the R-squared are not affected.
Often, collinearity can be reduced by transforming the covariates so that the Ceteris Paribus clause becomes natural. For example, instead of using the number of rooms and the living area in a house price model, it might be helpful to represent the living area by the derived variable “living area per room”.
Note: Perfectly collinear covariates (for example X and 2X) cannot be used for algorithmic reasons.
Since algorithms usually only understand numbers, categorical variables have to be encoded by numbers. The standard approach is called one-hot-encoding (OHE) and works as follows: Each level x_k of the categorical variable X gets its own binary dummy variable D_k = \boldsymbol 1(X = x_k), indicating if X has this particular value or not. In linear models, one of the dummy variables (D_1, say) needs to be dropped due to perfect collinearity (for each row, the sum of OHE variables is always 1). Its level is automatically being represented by the intercept. This variant of OHE is called dummy coding.
For our diamonds data set, OHE for the variable color
looks as follows (the first column is the original categorical variable,
the other columns are the dummy variables):
Comments on categorical covariates
Let us now extend the simple linear regression for diamond prices by
adding dummy variables for the categorical covariate
color
.
library(ggplot2)
# Turn ordered into unordered factor
diamonds <- transform(diamonds, color = factor(color, ordered = FALSE))
fit <- lm(price ~ carat + color, data = diamonds)
summary(fit)
##
## Call:
## lm(formula = price ~ carat + color, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18345.1 -765.8 -72.8 558.5 12288.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2136.23 20.12 -106.162 < 2e-16 ***
## carat 8066.62 14.04 574.558 < 2e-16 ***
## colorE -93.78 23.25 -4.033 5.51e-05 ***
## colorF -80.26 23.40 -3.429 0.000605 ***
## colorG -85.54 22.67 -3.773 0.000161 ***
## colorH -732.24 24.35 -30.067 < 2e-16 ***
## colorI -1055.73 27.31 -38.657 < 2e-16 ***
## colorJ -1914.47 33.78 -56.679 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1472 on 53932 degrees of freedom
## Multiple R-squared: 0.864, Adjusted R-squared: 0.8639
## F-statistic: 4.893e+04 on 7 and 53932 DF, p-value: < 2.2e-16
Comments
color
has
changed the slope of carat
from 7756 in the simple linear
regression to 8067. The effect of carat
adjusted
for color
is thus 8067.color
has received its own coefficient. Switching from the reference color “D”
to “E” is associated with an average price reduction of 94 USD. The
effect of color “F” compared to color “D” is about -80 USD.Linear regression is flexible regarding how variables are represented in the linear equation. Possibilities include
These elements are very important for making realistic models.
Important numeric covariates can be represented by more than one parameter (= linear slope) to model more flexible and non-linear associations with the response. For example, the addition of a quadratic term allows curvature. The addition of a sufficient number of polynomial terms can approximate any smooth relationship.
For example, the model equation for a cubic regression is \mathbb E(Y \mid x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3x^3. An alternative to polynomial terms are regression splines, i.e., piecewise polynomials. Using non-linear terms complicates the interpretation of regression coefficients. An option is to study predictions while sliding the covariate values over its range (systematic predictions).
How would a cubic regression approximate the relationship between diamond prices and carat?
The polynomial terms used for modeling look as follows:
library(tidyverse)
fit <- lm(price ~ poly(carat, 3), data = diamonds)
# Plot effect of carat on average price
data.frame(carat = seq(0.3, 4.5, by = 0.1)) %>%
mutate(price = predict(fit, .)) %>%
ggplot(aes(x = carat, y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point()
Comments
Once fitted, the effect of a covariate does not depend on the values of the other covariates. This is a direct consequence of the additivity of the model equation. The additivity assumption is sometimes too strict. E.g., a treatment effect might be larger for younger patients than for older. Or an extra 0.1 carat of diamond weight is worth more for a beautiful white diamond compared to an unspectacular yellow one. In such cases, adding interaction terms provides the necessary flexibility. Mathematically, an interaction term between covariates X and Z equals their product, where categorical covariates are first replaced by their dummy variables. Practically, it means that the effect X depends on the value of Z.
Comments
Let us now fit a linear regression for diamond prices with covariates
carat
and color
, once without and once with
interaction. We interpret the resulting models by looking at systematic
predictions (sliding both carat and color over their range).
library(tidyverse)
# Turn all ordered factors into unordered
diamonds <- mutate_if(diamonds, is.ordered, factor, ordered = FALSE)
no_interaction <- lm(price ~ carat + color, data = diamonds)
with_interaction <- lm(price ~ carat * color, data = diamonds)
# Plot effect of carat grouped by color
to_plot <- expand.grid(
carat = seq(0.3, 2.5, by = 0.1),
color = levels(diamonds$color)
) %>%
mutate(
no_interaction = predict(no_interaction, .),
with_interaction = predict(with_interaction, .)
) %>%
pivot_longer(
ends_with("interaction"),
names_to = "model",
values_to = "prediction"
)
ggplot(to_plot, aes(x = carat, y = prediction, group = color, color = color)) +
geom_line() +
geom_point() +
facet_wrap(~ model)
Comments
carat
does not depend on the color. Similarly, the effect
of color
does not depend on the size. This is not very
realistic, as the color effects are likely to be greater with large
diamonds.Covariates are often transformed before entering the model:
Not surprisingly, coefficients explain how the transformed variables acts on the expected response. For a log-transformed covariate X, we can even interpret the coefficient regarding the untransformed X. In the model equation \mathbb E(Y\mid x) = \alpha + \beta \log(x), we can say: A 1% increase in feature X leads to an increase in \mathbb E(Y\mid x) of about \beta/100. Indeed, we have \mathbb E(Y\mid 101\% \cdot x) - \mathbb E(Y\mid x) = \alpha + \beta \log (1.01 \cdot x) - \alpha - \beta \log(x) \\ = \beta \log\left(\frac{1.01 \cdot x}{x}\right)= \beta \log(1.01) \approx \beta/100. Thus, taking logarithms of covariates not only deals with outliers, it also offers us the possibility to talk about percentages.
What would a linear regression with logarithmic carat as single covariate give?
library(tidyverse)
fit <- lm(price ~ log(carat), data = diamonds)
fit
##
## Call:
## lm(formula = price ~ log(carat), data = diamonds)
##
## Coefficients:
## (Intercept) log(carat)
## 6238 5836
to_plot <- data.frame(carat = seq(0.3, 4.5, by = 0.1)) %>%
mutate(price = predict(fit, .))
# log-scale
ggplot(to_plot, aes(x = log(carat), y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
ggtitle("log-scale")
# original scale
ggplot(to_plot, aes(x = carat, y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
ggtitle("Original scale")
Comments
log(carat)
leads to a expected price increase of 5836
USD.carat
is
associated with an average price increase of about 5836/100 = 60 USD.We have seen that taking logarithms not only reduces outlier effects
in covariates but they also allow to think in percentages. What happens
if we log-transform the response variable? The model of a simple linear
regression would be
\mathbb E(\log(Y) \mid x) = \alpha + \beta x.
Claim: The effect \beta tells us by how much
percentage we can expect Y to
change when increasing the value of feature X by 1. Thus, a logarithmic response leads to
a multiplicative instead of an additive model.
Proof
Assume for a moment that we can swap taking expectations and logarithms (disclaimer: we cannot). In that case, the model would be
\log(\mathbb E(Y\mid x)) = \alpha + \beta x or, after exponentiation, \mathbb E(Y\mid x) = e^{\alpha + \beta x}. The additive effect of increasing x by 1 would be \mathbb E(Y\mid x+1) - \mathbb E(Y\mid x) = e^{\alpha + \beta (x+1)} - e^{\alpha + \beta x} \\ = e^{\alpha + \beta x}e^\beta - e^{\alpha + \beta x} = e^{\alpha + \beta x}(e^\beta - 1) = \mathbb E(Y\mid x)(e^\beta - 1). Dividing both sides by \mathbb E(Y\mid x) gives \underbrace{\frac{\mathbb E(Y\mid x+1) - \mathbb E(Y\mid x)}{\mathbb E(Y\mid x)}}_{\text{Relative change in } \mathbb E(Y \mid x)} = e^\beta-1 \approx \beta = \beta \cdot 100\%. Indeed: A one point increase in feature X is associated with a relative increase in \mathbb E(Y\mid x) of about \beta \cdot 100\%.
Since expectations and logarithms cannot be swapped, the calculation is not 100% correct. One consequence of this imperfection is that predictions backtransformed to the scale of Y are biased. One of the motivations of the generalized linear models GLM (see next section) will be to mend this problem in an elegant way.
How would our simple linear regression look like with
log(price)
as response?
library(tidyverse)
fit <- lm(log(price) ~ carat, data = diamonds)
summary(fit)
##
## Call:
## lm(formula = log(price) ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2844 -0.2449 0.0335 0.2578 1.5642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.215021 0.003348 1856 <2e-16 ***
## carat 1.969757 0.003608 546 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3972 on 53938 degrees of freedom
## Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
## F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16
to_plot <- data.frame(carat = seq(0.3, 2.5, by = 0.1)) %>%
mutate(price = exp(predict(fit, .)))
# log-scale
ggplot(to_plot, aes(x = carat, y = log(price))) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
coord_cartesian(x = c(0, 3)) +
ggtitle("log-scale")
# original scale
ggplot(to_plot, aes(x = carat, y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
coord_cartesian(x = c(0, 3)) +
ggtitle("Original scale")
Comments
Using logarithms for either price or carat did not provide a satisfactory model yet. What about applying logarithms to both response and covariate at the same time?
library(tidyverse)
fit <- lm(log(price) ~ log(carat), data = diamonds)
summary(fit)
##
## Call:
## lm(formula = log(price) ~ log(carat), data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.50833 -0.16951 -0.00591 0.16637 1.33793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.448661 0.001365 6190.9 <2e-16 ***
## log(carat) 1.675817 0.001934 866.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2627 on 53938 degrees of freedom
## Multiple R-squared: 0.933, Adjusted R-squared: 0.933
## F-statistic: 7.51e+05 on 1 and 53938 DF, p-value: < 2.2e-16
to_plot <- data.frame(carat = seq(0.3, 2.5, by = 0.1)) %>%
mutate(price = exp(predict(fit, .)))
# log-log-scale
ggplot(to_plot, aes(x = log(carat), y = log(price))) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
coord_cartesian(x = log(c(0.3, 3))) +
ggtitle("Log-log scale")
# Back-transformed
ggplot(to_plot, aes(x = carat, y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
coord_cartesian(x = c(0.3, 3)) +
ggtitle("Original scale")
# Relative bias on original scale
mean(exp(fitted(fit))) / mean(diamonds$price) - 1
## [1] -0.03039724
Comments
To end the section on linear regression, we extend the
log-log-example above by adding color
, cut
and
clarity
as categorical covariates.
library(tidyverse)
diamonds <- mutate_if(diamonds, is.ordered, factor, ordered = FALSE)
fit <- lm(log(price) ~ log(carat) + color + cut + clarity, data = diamonds)
summary(fit)
##
## Call:
## lm(formula = log(price) ~ log(carat) + color + cut + clarity,
## data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01107 -0.08636 -0.00023 0.08341 1.94778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.856856 0.005758 1364.43 <2e-16 ***
## log(carat) 1.883718 0.001129 1668.75 <2e-16 ***
## colorE -0.054277 0.002118 -25.62 <2e-16 ***
## colorF -0.094596 0.002142 -44.16 <2e-16 ***
## colorG -0.160378 0.002097 -76.49 <2e-16 ***
## colorH -0.251071 0.002225 -112.85 <2e-16 ***
## colorI -0.372574 0.002492 -149.50 <2e-16 ***
## colorJ -0.510983 0.003074 -166.24 <2e-16 ***
## cutGood 0.080048 0.003890 20.57 <2e-16 ***
## cutVery Good 0.117215 0.003619 32.39 <2e-16 ***
## cutPremium 0.139345 0.003579 38.94 <2e-16 ***
## cutIdeal 0.161218 0.003548 45.44 <2e-16 ***
## claritySI2 0.427879 0.005178 82.64 <2e-16 ***
## claritySI1 0.592954 0.005149 115.17 <2e-16 ***
## clarityVS2 0.742164 0.005178 143.34 <2e-16 ***
## clarityVS1 0.812277 0.005257 154.52 <2e-16 ***
## clarityVVS2 0.947271 0.005418 174.83 <2e-16 ***
## clarityVVS1 1.018743 0.005575 182.73 <2e-16 ***
## clarityIF 1.113732 0.006030 184.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1338 on 53921 degrees of freedom
## Multiple R-squared: 0.9826, Adjusted R-squared: 0.9826
## F-statistic: 1.693e+05 on 18 and 53921 DF, p-value: < 2.2e-16
Comments
The linear regression model has many extensions:
This section covers the generalized linear model (GLM). It was introduced in Nelder and Wedderburn (1972).
The model equation of a generalized linear model with monotone link function g and inverse link g^{-1} is assumed to satisfy: \mathbb E(Y \mid \boldsymbol x) = f(\boldsymbol x) = g^{-1}(\eta(\boldsymbol x)) = g^{-1}(\beta_o + \beta_1 x^{(1)} + \dots + \beta_p x^{(p)}), or similarly g(\mathbb E(Y \mid \boldsymbol x)) = \eta(\boldsymbol x) = \beta_o + \beta_1 x^{(1)} + \dots + \beta_p x^{(p)}, where Y conditional on the covariates belongs to the so-called exponential dispersion family. The linear part \eta of the model is called the linear predictor.
Thus, a GLM has three components:
The following table lists some of the most commonly used GLMs.
Regression | Distribution | Range of Y | Natural link | Unit deviance |
---|---|---|---|---|
Linear | Normal | (-\infty, \infty) | Identity | (y - \hat y)^2 |
Logistic | Binary | \{0, 1\} | logit | -2(y\log(\hat y) + (1-y) \log(1-\hat y)) |
Poisson | Poisson | [0, \infty) | log | 2(y \log(y / \hat y) - (y - \hat y)) |
Gamma | Gamma | (0, \infty) | 1/x (typical: log) | 2((y - \hat y) / \hat y - \log(y / \hat y)) |
Multinomial | Multinomial | \{C_1, \dots, C_m\} | mlogit | -2\sum_{j = 1}^m 1(y = C_j)\log(\hat y_j) |
Some remarks
The normal linear model allows us to model \mathbb E(Y\mid \boldsymbol x) by an additive linear function. In principle, this would also work for
However, in such cases, an additive linear model equation is usually not very realistic. As such, the main assumption of the linear regression model is violated:
GLMs deal with such problems by using a suitable link function like the logarithm. At least for the first two examples, this could not be achieved by a linear regression with log response because \log(0) is not defined.
Further advantages of the GLM over the linear regression are:
The interpretation of model coefficients in GLMs is guided by the link function. (The expectations are always conditional).
Parameters of a GLM are estimated by Maximum-Likelihood. This amounts to minimizing the (total) deviance, which equals the sum Q(f, D) = \sum_{(y_i, \boldsymbol x_i) \in D} L(y_i, f(\boldsymbol x_i)) of the unit deviances over the model data D (eventually weighted by case weights). For the normal linear model, the total deviance is equal to n times the MSE. In fact, the total deviance plays the same role for GLMs as the MSE does for linear regression. Consequently, it is sometimes useful to consider as a relative performance measure the relative deviance improvement compared to an intercept-only model. For the normal linear regression model, this Pseudo-R-squared corresponds to the usual R-squared.
Outlook: The loss functions used in the context of GLMs are used one-to-one as loss functions for other ML methods such as gradient boosting or neural networks. There, the “appropriate” loss function is chosen from the context. For example, if the response is binary, one usually chooses the binary cross-entropy as the loss function.
We will now model the number of claims for the
insuranceData
by a Poisson GLM with its natural link
function, the log. This ensures that we can interpret the effects of the
covariates on a relative scale and that the predictions are
positive.
For simplicity, we do not take the exposure into account.
library(ggplot2)
library(insuranceData)
data(dataCar)
summary(dataCar)
## veh_value exposure clm numclaims
## Min. : 0.000 Min. :0.002738 Min. :0.00000 Min. :0.00000
## 1st Qu.: 1.010 1st Qu.:0.219028 1st Qu.:0.00000 1st Qu.:0.00000
## Median : 1.500 Median :0.446270 Median :0.00000 Median :0.00000
## Mean : 1.777 Mean :0.468651 Mean :0.06814 Mean :0.07276
## 3rd Qu.: 2.150 3rd Qu.:0.709103 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :34.560 Max. :0.999316 Max. :1.00000 Max. :4.00000
##
## claimcst0 veh_body veh_age gender area
## Min. : 0.0 SEDAN :22233 Min. :1.000 F:38603 A:16312
## 1st Qu.: 0.0 HBACK :18915 1st Qu.:2.000 M:29253 B:13341
## Median : 0.0 STNWG :16261 Median :3.000 C:20540
## Mean : 137.3 UTE : 4586 Mean :2.674 D: 8173
## 3rd Qu.: 0.0 TRUCK : 1750 3rd Qu.:4.000 E: 5912
## Max. :55922.1 HDTOP : 1579 Max. :4.000 F: 3578
## (Other): 2532
## agecat X_OBSTAT_
## Min. :1.000 01101 0 0 0:67856
## 1st Qu.:2.000
## Median :3.000
## Mean :3.485
## 3rd Qu.:5.000
## Max. :6.000
##
# Distribution of the claim count
ggplot(dataCar, aes(x = numclaims)) +
geom_bar(fill = "chartreuse4") +
ggtitle("Distribution of 'numclaims'")
fit <- glm(
numclaims ~ veh_value + veh_body + veh_age + gender + area + agecat,
data = dataCar,
family = poisson(link = "log")
)
summary(fit)
##
## Call:
## glm(formula = numclaims ~ veh_value + veh_body + veh_age + gender +
## area + agecat, family = poisson(link = "log"), data = dataCar)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.373692 0.328875 -4.177 2.95e-05 ***
## veh_value 0.046388 0.016394 2.830 0.004660 **
## veh_bodyCONVT -2.049451 0.669159 -3.063 0.002193 **
## veh_bodyCOUPE -0.780776 0.337177 -2.316 0.020579 *
## veh_bodyHBACK -1.048694 0.318527 -3.292 0.000994 ***
## veh_bodyHDTOP -0.891602 0.327733 -2.721 0.006518 **
## veh_bodyMCARA -0.514153 0.409172 -1.257 0.208909
## veh_bodyMIBUS -1.179985 0.350059 -3.371 0.000749 ***
## veh_bodyPANVN -0.809562 0.339102 -2.387 0.016969 *
## veh_bodyRDSTR -0.793357 0.659848 -1.202 0.229235
## veh_bodySEDAN -1.010881 0.317897 -3.180 0.001473 **
## veh_bodySTNWG -1.030325 0.317876 -3.241 0.001190 **
## veh_bodyTRUCK -1.035543 0.328328 -3.154 0.001611 **
## veh_bodyUTE -1.246375 0.322009 -3.871 0.000109 ***
## veh_age -0.012071 0.017616 -0.685 0.493202
## genderM -0.015962 0.030046 -0.531 0.595238
## areaB 0.062837 0.042795 1.468 0.142015
## areaC 0.008432 0.038990 0.216 0.828787
## areaD -0.111103 0.052961 -2.098 0.035918 *
## areaE -0.028520 0.057864 -0.493 0.622095
## areaF 0.101640 0.066141 1.537 0.124361
## agecat -0.078076 0.010238 -7.626 2.42e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 26768 on 67855 degrees of freedom
## Residual deviance: 26629 on 67834 degrees of freedom
## AIC: 36108
##
## Number of Fisher Scoring iterations: 6
# Bias on original scale?
mean(predict(fit, type = "response")) / mean(dataCar$numclaims) - 1
## [1] 6.017409e-14
Comments
veh_value
increases the log of the
expected count by 0.046. On the original count scale, this is an
increase of approximately 4.6%. The exact effect is e^{0.046388}-1 = 0.047 = 4.7\%.In order to illustrate logistic regression, we will model the binary variable “claim yes=1/no=0” of the claims data by a logistic regression. Its logit link ensures that predicted probabilities are between 0 and 1 and that covariates act in a multiplicative way on the odds of having a claim.
library(ggplot2)
library(insuranceData)
data(dataCar)
# Distribution of the claim count
ggplot(dataCar, aes(x = factor(clm))) +
geom_bar(fill = "chartreuse4") +
ggtitle("Distribution of 'clm'")
fit <- glm(
clm ~ veh_value + veh_body + veh_age + gender + area + agecat,
data = dataCar,
family = binomial(link = "logit")
)
summary(fit)
##
## Call:
## glm(formula = clm ~ veh_value + veh_body + veh_age + gender +
## area + agecat, family = binomial(link = "logit"), data = dataCar)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.27962 0.38325 -3.339 0.000841 ***
## veh_value 0.05098 0.01787 2.853 0.004334 **
## veh_bodyCONVT -2.14392 0.70853 -3.026 0.002479 **
## veh_bodyCOUPE -0.89535 0.39235 -2.282 0.022486 *
## veh_bodyHBACK -1.13441 0.37287 -3.042 0.002347 **
## veh_bodyHDTOP -0.95742 0.38191 -2.507 0.012177 *
## veh_bodyMCARA -0.57311 0.46768 -1.225 0.220413
## veh_bodyMIBUS -1.27231 0.40312 -3.156 0.001599 **
## veh_bodyPANVN -0.92452 0.39409 -2.346 0.018978 *
## veh_bodyRDSTR -1.23888 0.82480 -1.502 0.133091
## veh_bodySEDAN -1.12596 0.37228 -3.025 0.002490 **
## veh_bodySTNWG -1.12698 0.37229 -3.027 0.002468 **
## veh_bodyTRUCK -1.15426 0.38272 -3.016 0.002562 **
## veh_bodyUTE -1.35734 0.37622 -3.608 0.000309 ***
## veh_age -0.01270 0.01896 -0.670 0.502834
## genderM -0.01048 0.03218 -0.326 0.744688
## areaB 0.09862 0.04598 2.145 0.031962 *
## areaC 0.04015 0.04189 0.959 0.337794
## areaD -0.08712 0.05653 -1.541 0.123285
## areaE -0.01270 0.06211 -0.204 0.837982
## areaF 0.10581 0.07182 1.473 0.140658
## agecat -0.08312 0.01097 -7.580 3.45e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33767 on 67855 degrees of freedom
## Residual deviance: 33632 on 67834 degrees of freedom
## AIC: 33676
##
## Number of Fisher Scoring iterations: 5
Comments
veh_value
(10’000 USD) increases
the log-odds of having a claim by 0.051. On the odds scale, this is an
increase of approximately 5.1%. The exact calculation is e^{0.05098}-1 = 0.052 = 5.2\%. Thus we can
say that for every additional 10’000 USD vehicle value, the odds of
having a claim is increased by 5.2%.In the 1970s, working with data as big as the diamond data would have been a major challenge. Today, primarily thanks to better hardware, we can solve linear regressions with millions of rows of data within seconds. Curiously, todays BLAS (Basic Linear Algebra Subprograms) or optimizers like BFGS have their roots in the 1970s.
While there are tools for working with data larger than RAM, our focus in this section is on in-memory computing. Thanks to cloud computing instances with 8 TB and more RAM, this is hardly a limitation, unless you are dealing with large image or video data.
The larger the data, the more we need to care about efficient
In the following example, we will use Parquet for data storage, load the data via Arrow, preprocess it with R’s “data.table”, and fit a linear regression with H2O. These technologies share three important properties: They are free to use, open-source, and are (currently) state of the art.
Before going through the example, here a very brief introduction to the mentioned technologies.
Apache Parquet is a column-oriented data storage format introduced in 2013. Unlike csv files, it uses various methods to compress data:
Besides compression, a big advantage compared to a csv file is that Parquet also stores meta information like: “the values represent strings” or “the values represent dates”. A disadvantage is that the Parquet file is not human-readable.
Parquet files can be read via the (Py)Arrow library in R or Python. Many other technologies (Apache Spark, other database systems) also support working with Parquet files. We will get to know some of them later.
Apache Arrow, available since 2016, is a language-independent standard for in-memory processing and transport of data. A core component of the Arrow library is its in-memory columnar data format (also called “Arrow”). For instance, an Arrow table in R has exactly the same representation in memory as in Python or Java, so there is no extra effort to transfer it from one system to the other.
Arrow is widely used in R and Python for reading and writing Parquet files. Furthermore, it plays well together with Apache Spark, a big data technology. We will meet Spark later.
The data.table
package is an R package for working with large data in an efficient way.
It was released in 2006 as an extension of R’s data.frame
and performs most operations in-place, i.e. without making copies.
“data.table” is famous for its fast csv reader/writer. One of its
biggest successes was that its sort algorithm became part of base R.
There is also a Python package “data.table”, but it is less mature. At
the time of writing, “Polars” seems to be a good alternative in
Python.
H2O is an ML software bundle developed by h2o.ai for in-memory cluster computing. (A cluster consists of multiple independent computers.) H2O is written in Java, but offers wrappers for R and Python. Some of its algorithms:
H2O facilitates typical workflows like selecting models via cross-validation and putting final models into production. It is available since (around) 2014.
In this example, we retrieve information about all cab trips in New York City made by yellow cabs in January 2018. Download the data from the official website and save it in subfolder “taxi”. It summarizes more than eight million trips. We will use the data to build a simple linear regression model that predicts trip duration based on start and end time, date, and pickup location.
For a dataset of this size, we could easily have worked with csv
data, dplyr and lm()
instead of the
Parquet/Arrow/data.table/H2O stack. However, imagine using all cab trips
from all available months, which would result in a dataset with roughly
1 billion rows. In this case, the classic approach would be painfully
slow.
Computation time refers to a single run on an Intel i7 CPU with four physical cores.
library(arrow)
library(data.table)
library(ggplot2)
library(h2o)
system.time( # 3 seconds
df <- read_parquet("taxi/yellow_tripdata_2018-01.parquet")
)
dim(df)
# 8'760'687 19
setDT(df)
head(df)
# fwrite(df, file = "taxi/test.csv") # 767 MB vs. 117 MB Parquet
# Data prep
system.time({ # 5 seconds
df[, duration := as.numeric(
difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins")
)]
df = df[between(trip_distance, 0.2, 100) & between(duration, 1, 120)]
df[, `:=`(
pu_hour = factor(data.table::hour(tpep_pickup_datetime)),
weekday = factor(data.table::wday(tpep_pickup_datetime)), # 1 = Sunday
pu_loc = forcats::fct_lump_min(factor(PULocationID), 1e5),
log_duration = log(duration),
log_distance = log(trip_distance)
)]
})
# Fast group-by operations (instantaneous)
df[, .(.N, Mean_duration = mean(duration)), by = pu_loc]
# Partial output
# pu_loc N Mean_duration
# 1: Other 1147218 13.49798
# 2: 239 226155 10.68894
# 3: 262 108240 10.85351
# 4: 140 158934 11.92409
# 5: 246 117323 12.92450
# 6: 143 100424 11.09504
# Plot only subset
ggplot(df[sample(df[, .N], 1e4)], aes(log_duration, log_distance)) +
geom_point(alpha = 0.1, color = "chartreuse4", size = 1)
x <- c("log_distance", "weekday", "pu_hour", "pu_loc")
y <- "log_duration"
# Fit model with 68 parameters via lm(): 40 seconds
system.time(
fit <- lm(reformulate(x, y), data = df)
)
summary(fit)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 1.6153038 0.0011650 1386.474 < 2e-16 ***
# log_distance 0.7409562 0.0001484 4992.110 < 2e-16 ***
# weekday2 0.0368042 0.0004461 82.508 < 2e-16 ***
# weekday3 0.1137304 0.0004399 258.566 < 2e-16 ***
# weekday4 0.1367779 0.0004339 315.250 < 2e-16 ***
# weekday5 0.1691227 0.0004636 364.838 < 2e-16 ***
# weekday6 0.1860435 0.0004506 412.867 < 2e-16 ***
# weekday7 0.0910054 0.0004479 203.176 < 2e-16 ***
# pu_hour1 -0.0105750 0.0008793 -12.027 < 2e-16 ***
# pu_hour2 -0.0259596 0.0009716 -26.718 < 2e-16 ***
# pu_hour3 -0.0538456 0.0010722 -50.222 < 2e-16 ***
# pu_hour4 -0.0951928 0.0012106 -78.630 < 2e-16 ***
# pu_hour5 -0.1828248 0.0013401 -136.425 < 2e-16 ***
# pu_hour6 -0.2510334 0.0012945 -193.922 < 2e-16 ***
# pu_hour7 -0.1358578 0.0009603 -141.468 < 2e-16 ***
# pu_hour8 0.0846989 0.0008258 102.569 < 2e-16 ***
# pu_hour9 0.2584242 0.0007837 329.763 < 2e-16 ***
# pu_hour10 0.2955111 0.0007815 378.120 < 2e-16 ***
# pu_hour11 0.2858473 0.0007868 363.289 < 2e-16 ***
# pu_hour12 0.3019318 0.0007811 386.525 < 2e-16 ***
# pu_hour13 0.3034980 0.0007699 394.187 < 2e-16 ***
# pu_hour14 0.2929642 0.0007688 381.072 < 2e-16 ***
# pu_hour15 0.3076331 0.0007588 405.447 < 2e-16 ***
# pu_hour16 0.3099760 0.0007552 410.439 < 2e-16 ***
# pu_hour17 0.2830588 0.0007639 370.529 < 2e-16 ***
# pu_hour18 0.3006036 0.0007466 402.609 < 2e-16 ***
# pu_hour19 0.2710909 0.0007313 370.699 < 2e-16 ***
# pu_hour20 0.1863767 0.0007387 252.294 < 2e-16 ***
# pu_hour21 0.0967437 0.0007558 128.009 < 2e-16 ***
# pu_hour22 0.0543461 0.0007574 71.749 < 2e-16 ***
# pu_hour23 0.0252677 0.0007697 32.828 < 2e-16 ***
# pu_loc48 0.0193767 0.0011621 16.673 < 2e-16 ***
# pu_loc68 0.0639028 0.0012200 52.381 < 2e-16 ***
# ...
# pu_loc262 -0.1032492 0.0014011 -73.693 < 2e-16 ***
# pu_loc263 -0.1020595 0.0012625 -80.838 < 2e-16 ***
# pu_loc264 0.0168296 0.0013044 12.902 < 2e-16 ***
# pu_locOther -0.0488877 0.0010131 -48.255 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.3349 on 8640016 degrees of freedom
# Multiple R-squared: 0.7818, Adjusted R-squared: 0.7818
#====================================================================
# The same with H2O
#====================================================================
# Create single-node Java "cluster" locally on laptop
h2o.init(min_mem_size = "6G")
# Copy data to cluster
h2o_df <- as.h2o(df[, c(x, y), with = FALSE])
# Calculations are all done in Java
summary(h2o_df)
#
# log_distance weekday pu_hour pu_loc log_duration
# Min. :-1.60944 4:1470310 19:569055 Other:1147218 Min. :0.000
# 1st Qu.:-0.06683 3:1381955 20:537207 237 : 357807 1st Qu.:1.848
# Median : 0.45149 2:1245478 18:511166 161 : 351495 Median :2.346
# Mean : 0.56504 6:1209557 16:483808 236 : 342507 Mean :2.332
# 3rd Qu.: 1.05620 7:1188593 21:476472 162 : 305512 3rd Qu.:2.825
# Max. : 4.56101 5:1081250 15:472400 230 : 305349 Max. :4.787
system.time( # 5 s
fit_h2o <- h2o.glm(
x, "log_duration", training_frame = h2o_df, compute_p_values = TRUE, lambda = 0
)
)
fit_h2o
# Partial output (pu_loc values are different due to a different reference category)
# Coefficients: glm coefficients
# names coefficients std_error z_value p_value standardized_coefficients
# 1 Intercept 1.725249 0.001012 1705.107619 0.000000 2.143916
# 2 pu_loc.107 -0.049542 0.001077 -45.998056 0.000000 -0.049542
# 3 pu_loc.113 -0.032204 0.001189 -27.089905 0.000000 -0.032204
# 4 pu_loc.114 -0.005347 0.001258 -4.249429 0.000021 -0.005347
# 5 pu_loc.132 -0.439152 0.001141 -384.927728 0.000000 -0.439152
#
# ---
# names coefficients std_error z_value p_value standardized_coefficients
# 63 weekday.3 0.113730 0.000440 258.566151 0.000000 0.113730
# 64 weekday.4 0.136778 0.000434 315.249577 0.000000 0.136778
# 65 weekday.5 0.169123 0.000464 364.837805 0.000000 0.169123
# 66 weekday.6 0.186043 0.000451 412.866598 0.000000 0.186043
# 67 weekday.7 0.091005 0.000448 203.175671 0.000000 0.091005
# 68 log_distance 0.740956 0.000148 4992.109899 0.000000 0.663356
#
# H2ORegressionMetrics: glm
# ** Reported on training data. **
#
# MSE: 0.1121562
# RMSE: 0.3348973
# MAE: 0.2598114
# RMSLE: 0.1149344
# Mean Residual Deviance : 0.1121562
# R^2 : 0.7817626
Comments
lm()
and h2o.glm()
are
identical (up to using a different reference level for the pickup
location).h2o.glm()
is faster than lm()
. One of the
reasons is that its solvers do not require internal dummy encoding of
the factors.For the first four exercises, start with this snippet to turn the ordered factors into unordered ones.
library(tidyverse)
diamonds <- mutate_if(diamonds, is.ordered, factor, ordered = FALSE)
Try to improve the model from Exercise 1 by adding interactions between our main predictor “carat” and “color”, between “carat” and “cut”, and also between “carat” and “clarity”. Why could this make sense? How many additional parameters are required? How does the RMSE react?
In the regression in Exercise 1, represent “carat” by a restricted cubic spline with four knots. What is a restricted cubic spline (check on the internet)? How much better does the RMSE get? Visualize the effect of “carat”. Hint: Restricted cubic splines are available in the R package “splines”.
Fit a Gamma regression with log-link to explain diamond prices by “log(carat)”, “color”, “cut”, and “clarity”. Compare the coefficients with those of a linear regression having the same covariates, but using “log(price)” as response. Calculate the relative bias of the average prediction. Why isn’t it 0?
Fit the Gamma GLM of Exercise 4 with H2O, optionally replacing the data preparation with “data.table”. Do you get the same results?
In this chapter, we have revisited multiple linear regression and some of its many aspects. Additionally, we met an important generalization of linear regression, namely the generalized linear model (GLM). It includes binary logistic regression and Poisson count regression as relevant special cases. In the last section, we met tools to work with large data.