Before heading into linear regression and the generalized linear model, we introduce some basic notation.
Structured data is organized as one or multiple tables. Each row of a table represents an observation, each column X a variable and each cell a value. A value in column X can be viewed as a realization of the random variable X.
Examples of unstructured data are images, text, audio or video data. We will deal with structured data only.
Throughout this lecture, we will consider the following two data sets:
diamonds
: Diamonds prices along with the four
“C”-variables: Carat, Color, Cut, and Clarity. Each observation/row
represents a diamond.
dataCar
: Insurance claim data on vehicle insurance
policies from 2004-2005. Some variables like gender
describe the policy holder, others like veh_age
the vehicle
and some variables carry information on eventual claims. Each row
represents a policy.
Let us have look at the first six observations of the diamonds data set.
library(ggplot2)
head(diamonds)
Comment: The data set is neatly structured. It seems to be sorted by price.
We distinguish variables by their data type.
Data types are important in determining suitable analysis methods.
In the diamonds
data set, we will consider the numeric
variables price
and carat
and the following
ordered categoricals:
color
with ordered categories D > E > F > G
> H > I > J,cut
with ordered categories Fair < Good < Very
Good < Premium < Ideal, andclarity
with ordered categories I1 < SI2 < SI1
< VS2 < VS1 < VVS2 < VVS1 < IF.There are no unordered categoricals or binary variables in this data set.
Statistical modeling always starts with a descriptive analysis of the data. This typically involves a numeric and/or graphical summary of each variable and the most relevant variable pairs, e.g.,
The more time we invest in the descriptive analysis, the more we learn about the data. Additionally, we might
During or after the descriptive analysis, we usually do the data preprocessing for the modeling task.
Let us summarize the diamonds data set.
library(ggplot2)
# Number of observations
nrow(diamonds) # 53940
## [1] 53940
# Univariate description
summary(diamonds[, c("price", "carat", "color", "cut", "clarity")])
## price carat color cut clarity
## Min. : 326 Min. :0.2000 D: 6775 Fair : 1610 SI1 :13065
## 1st Qu.: 950 1st Qu.:0.4000 E: 9797 Good : 4906 VS2 :12258
## Median : 2401 Median :0.7000 F: 9542 Very Good:12082 SI2 : 9194
## Mean : 3933 Mean :0.7979 G:11292 Premium :13791 VS1 : 8171
## 3rd Qu.: 5324 3rd Qu.:1.0400 H: 8304 Ideal :21551 VVS2 : 5066
## Max. :18823 Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
ggplot(diamonds, aes(x = price)) +
geom_histogram(fill = "chartreuse4") +
ggtitle("Distribution of 'price'")
ggplot(diamonds, aes(x = carat)) +
geom_histogram(fill = "chartreuse4") +
ggtitle("Distribution of 'carat'")
ggplot(diamonds, aes(x = color)) +
geom_bar(fill = "chartreuse4") +
ggtitle("Distribution of 'color'")
ggplot(diamonds, aes(x = cut)) +
geom_bar(fill = "chartreuse4") +
ggtitle("Distribution of 'cut'")
ggplot(diamonds, aes(x = clarity)) +
geom_bar(fill = "chartreuse4") +
ggtitle("Distribution of 'clarity'")
# Selected bivariate descriptions
ggplot(diamonds, aes(y = price, x = carat)) +
geom_point(alpha = 0.2, shape = ".", color = "chartreuse4") +
ggtitle("Price against carat")
ggplot(diamonds, aes(y = log(price), x = log(carat))) +
geom_point(alpha = 0.2, shape = ".", color = "chartreuse4") +
ggtitle("Price against carat (log-log-scale)")
ggplot(diamonds, aes(y = price, x = color)) +
geom_boxplot(fill = "chartreuse4", varwidth = TRUE) +
ggtitle("Price against color")
ggplot(diamonds, aes(y = price, x = cut)) +
geom_boxplot(fill = "chartreuse4", varwidth = TRUE) +
ggtitle("Price against cut")
ggplot(diamonds, aes(y = price, x = clarity)) +
geom_boxplot(fill = "chartreuse4", varwidth = TRUE) +
ggtitle("Price against clarity")
Comments
color
, cut
, and
clarity
are rare.The general modeling task is as follows: we want to approximate a response variable Y by a function f of m covariates X_1, \dots, X_m, i.e., Y \approx f(X_1, \dots, X_m). The function f is unknown and we want to estimate it by \hat f from observed data.
Note: Think of the response Y and the covariates X_1, \dots, X_m as columns in a data set.
Normally, we are interested in modeling a specific property of Y, usually its expectation E(Y) (= theoretic mean). In that case, we can make above approximate relationship more explicit by writing down the model equation E(Y) = f(X_1, \dots, X_m).
Once found, \hat f serves as our prediction function that can be applied to fresh data. Furthermore, we can investigate the structure of \hat f to gain insights about the relationship between response and covariates: what variables are especially important? how do they influence the response?
Remark: Other terms for “response variable” are “output”, “target” or “dependent variable”. Other terms for “covariate” are “input”, “feature”, “independent variable” or “predictor”.
In order to get used to the terms mentioned above, we will look at the mother of all machine learning algorithms: (multiple) linear regression. It was first published by Adrien-Marie Legendre in 1805 [1] and is still very frequently used thanks to its simplicity, interpretability, and flexibility. It further serves as a simple benchmark for more complex algorithms and is the starting point for extensions like the generalized linear model.
The model equation of the linear regression is as follows: E(Y) = f(X_1, \dots, X_m) = \beta_0 + \beta_1 X_1 + \cdots + \beta_m X_m. It relates the covariates X_1, \dots, X_m to the expected response E(Y) by a linear formula in the parameters \beta_0, \dots, \beta_m. The additive constant \beta_0 is called the intercept. The parameter \beta_j tells us by how much Y is expected to change when X_j is increased by 1, keeping all other covariates fixed (“Ceteris Paribus”). Indeed: E(Y \mid X_j = x + 1) - E(Y \mid X_j = x) = \beta_j (x + 1) - \beta_j x = \beta_j. The parameter \beta_j is called effect of X_j on the expected response E(Y).
A linear regression with just one covariate is called a simple linear regression with equation E(Y) = \alpha + \beta X.
The optimal \hat f to estimate f is found by minimizing the sum of squared prediction errors (residuals) \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (y_i - \hat y_i)^2. y_i is the observed response of observation i and \hat y_i its prediction (or fitted value) \hat y_i = \hat f(\text{Values of covariates of observation } i).
Once the model is fitted, we can use the coefficients \hat\beta_0, \dots, \hat\beta_m to make predictions and to study empirical effects of the covariates on the expected response.
In order to discuss the typical output of a linear regression, we will now model diamond prices by their size. The model equation is E(\text{price}) = \alpha + \beta \cdot \text{carat}.
library(ggplot2)
fit <- lm(price ~ carat, data = diamonds)
summary(fit)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
intercept <- coef(fit)[[1]]
slope <- coef(fit)[[2]]
# Visualize the regression line
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.2, shape = ".") +
coord_cartesian(xlim = c(0, 3), ylim = c(-3000, 20000)) +
geom_abline(slope = slope, intercept = intercept, color = "chartreuse4", size = 1)
# Predictions for diamonds with 1.3 carat?
predict(fit, data.frame(carat = 1.3))
## 1
## 7826.993
# By hand
intercept + slope * 1.3
## [1] 7826.993
Comments
How good is a specific linear regression model? We may consider two aspects, namely
How precise are the model predictions? I.e., how well do predictions correspond with the observed response? In line with the least-squares approach, this is best quantified by the sum of squared prediction errors \sum_{i = 1}^n (y_i - \hat y_i)^2 or, equivalently, by the mean-squared-error \text{MSE} = \frac{1}{n}\sum_{i = 1}^n (y_i - \hat y_i)^2. To quantify the size of the typical prediction error on the same scale as Y, we can take the square-root of the MSE and study the root-mean-squared error (RMSE). Minimizing MSE also minimizes RMSE.
Remark: The squared error, i.e., the function that quantifies the prediction error or loss of a single observation, is an example of a loss function. Later, we will meed other loss functions. Generally speaking, supervised ML algorithms try to minimize as objective function the total (or average) loss over the training sample, eventually adding a regularization penalty against overfitting. Overfitting is the tendency of the model to learn the training dataset by heart, i.e., it performs much better on the training data than on unseen data. The losses can be weighted in order to deal with case weights.
Besides an absolute performance measure like the RMSE, we gain additional insights by studying a relative performance measure like the R-squared. It measures the relative decrease in MSE compared to the MSE of the “empty” or “null” model consisting only of an intercept. Put differently, the R-squared measures the proportion of variability of Y explained by the covariates.
Let us calculate these performance measures for the simple linear regression above.
mse <- function(y, pred) {
mean((y - pred)^2)
}
(MSE <- mse(diamonds$price, predict(fit, diamonds)))
## [1] 2397955
(RMSE <- sqrt(MSE))
## [1] 1548.533
empty_model <- lm(price ~ 1, data = diamonds) # predictions equal mean(diamonds$price)
MSE_empty <- mse(diamonds$price, predict(empty_model, diamonds))
# R-squared
(MSE_empty - MSE) / MSE_empty
## [1] 0.8493305
Comments
The main assumption of linear regression is a correctly specified model equation in the sense that predictions are not systematically too high or too small for certain values of the covariates. In a simple regression setting, this means that the points in the scatterplot are located around the regression line for all covariate values. For a multiple linear regression, this translates to the empirical condition that residuals (differences between observed and fitted response) do not show bias if plotted against covariate values.
Additional assumptions like independence of rows, constant variance of the error term \varepsilon in the equation Y = f(X_1, \dots, X_m) + \varepsilon and normal distribution of \varepsilon guarantee optimality of the least-squares estimator \hat \beta_0, \dots, \hat\beta_m and the correctness of inferential statistics (standard errors, p values, confidence intervals). In that case, we talk of the normal linear model. Its conditions are checked by studying diagnostic plots. We skip this part for brevity and since we are not digging into inferential statistics.
When looking at the scatter plot enhanced with the regression line from above, we can spot systematically too low (even negative!) predictions for very small diamonds. This indicates a wrongly specified model. Later, we will see ways to fix this.
Here, we list some problems that frequently occurs with linear regression. We will only mention them without going into details.
Like many other ML algorithms, linear regression cannot deal with missing values. Rows with missing response can be safely dropped, while missing values in covariates should usually be dealt with. The simplest (often too naive) approach is to fill missing values with a typical value such as the mean or the most frequent value.
Gross outliers in covariates can distort the result of the linear regression. Do not delete them, but try to reduce their effect by taking logarithms or by using more robust regression techniques. Outliers in the response can be problematic as well, especially for inferential statistics.
If too many parameters are used in relation to the number of observations, the resulting model might look good but would not generalize well to new data. This is called overfitting. A small amount of overfitting is unproblematic. Do not fit a model with m=100 parameters on a data set with just n=200 rows. The resulting model would be garbage. A n/m ratio of 50-100 is usually safe for stable estimation of parameters.
If the association between two or more covariates is strong, their coefficients are difficult to interpret because the Ceteris Paribus clause is usually unnatural in such situations. In a house price model, for instance, it is unnatural to study the effect of an additional room, keeping living area fixed. This is even more problematic for causally dependent covariates: imagine a model with x and x^2 as covariates. It would certainly not make sense to study the effect of x while keeping x^2 fixed.
Strong collinearity can be detected by looking at correlations across (numeric) covariates. It is mainly a problem when interpreting effects or for statistical inference of effects. Predictions or other “global” model characteristics like the R-squared are not affected.
Often, collinearity can be reduced by transforming the covariates in a way that makes the Ceteris Paribus clause natural. Instead of, e.g., using number of rooms and living area in a house price model, it might help to represent living area by the derived variable “living area per room”.
Note: Perfectly collinear covariates (for example X and 2X) cannot be used for algorithmic reasons.
Since algorithms usually only understand numbers, categorical variables have to be encoded by numbers. The standard approach is called one-hot-encoding (OHE) and works as follows: Each level x_k of the categorical variable X gets its own binary dummy variable D_k = 1(X = x_k), indicating if X has this particular value or not. In linear models, one of the dummy variables (D_1, say) needs to be dropped due to perfect collinearity (for each row, the sum of OHE variables is always 1). Its level is automatically being represented by the intercept. This variant of OHE is called dummy coding.
For our diamonds data set, OHE for the variable color
looks as follows (the first column is the original categorical variable,
the other columns are the dummy variables):
Comments on categorical covariates
Let us now extend the simple linear regression for diamond prices by
adding dummy variables for the categorical covariate
color
.
library(ggplot2)
# Turn ordered into unordered factor
diamonds <- transform(diamonds, color = factor(color, ordered = FALSE))
fit <- lm(price ~ carat + color, data = diamonds)
summary(fit)
##
## Call:
## lm(formula = price ~ carat + color, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18345.1 -765.8 -72.8 558.5 12288.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2136.23 20.12 -106.162 < 2e-16 ***
## carat 8066.62 14.04 574.558 < 2e-16 ***
## colorE -93.78 23.25 -4.033 5.51e-05 ***
## colorF -80.26 23.40 -3.429 0.000605 ***
## colorG -85.54 22.67 -3.773 0.000161 ***
## colorH -732.24 24.35 -30.067 < 2e-16 ***
## colorI -1055.73 27.31 -38.657 < 2e-16 ***
## colorJ -1914.47 33.78 -56.679 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1472 on 53932 degrees of freedom
## Multiple R-squared: 0.864, Adjusted R-squared: 0.8639
## F-statistic: 4.893e+04 on 7 and 53932 DF, p-value: < 2.2e-16
Comments
color
has
changed the slope of carat
from 7756 in the simple linear
regression to 8067. This is the effect of carat
adjusted for color
.color
has received its own coefficient. Switching from the reference color “D”
to “E” is associated with a price reduction of 94 USD. The effect of
color “F” compared to color “D” is about -80 USD.Linear regression is flexible regarding how variables are represented in the linear equation. Possibilities include
These elements are very important for making realistic models.
Important numeric covariates might be represented by more than just one parameter (= linear slope) to allow for more flexible and non-linear associations to the response. Adding a squared term, e.g., allows for curvature, adding enough multiple polynomial terms can approximate any smooth relationship.
For example, the model equation for a cubic regression is E(Y) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3x^3. An alternative to polynomial terms are regression splines, i.e., piecewise polynomials. Using non-linear terms makes interpretation of regression coefficients difficult. An option is to study predictions while sliding the covariate values over its range (systematic predictions).
How would a cubic regression approximate the relationship between diamond prices and carat?
The polynomial terms used for modeling look as follows:
And here the model including systematic predictions.
library(tidyverse)
fit <- lm(price ~ poly(carat, 3), data = diamonds)
# Plot effect of carat on average price
data.frame(carat = seq(0.3, 4.5, by = 0.1)) %>%
mutate(price = predict(fit, .)) %>%
ggplot(aes(x = carat, y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point()
Comments
Once fitted, the effect of a covariate does not depend on the values of the other covariates. This is a direct consequence of the additivity of the model equation. The additivity assumption is sometimes too strict. E.g., treatment effects might be larger for younger patients than for older. Or an extra 0.1 carat of diamond weight is worth more for a beautiful white diamond compared to an unspectacular yellow one. In such cases, adding interaction terms provides the necessary flexibility. Mathematically, an interaction term between covariates X and Z equals their product. Practically, it means that the effect X depends on the value of Z.
Comments
Let us now fit a linear regression for diamond prices with covariates
carat
and color
, once without and once with
interaction. We interpret the resulting models by looking at systematic
predictions (sliding both carat and color over their range).
library(tidyverse)
# Turn all ordered factors into unordered
diamonds <- mutate_if(diamonds, is.ordered, factor, ordered = FALSE)
no_interaction <- lm(price ~ carat + color, data = diamonds)
with_interaction <- lm(price ~ carat * color, data = diamonds)
# Plot effect of carat grouped by color
to_plot <- expand.grid(
carat = seq(0.3, 2.5, by = 0.1),
color = levels(diamonds$color)
) %>%
mutate(
no_interaction = predict(no_interaction, .),
with_interaction = predict(with_interaction, .)
) %>%
pivot_longer(
ends_with("interaction"),
names_to = "model",
values_to = "prediction"
)
ggplot(to_plot, aes(x = carat, y = prediction, group = color, color = color)) +
geom_line() +
geom_point() +
facet_wrap(~ model)
Comments
carat
does not depend on the color. Similarly, the effect
of color
does not depend on the size. This is not very
realistic as color effects are expected to be larger for large
diamonds.Covariates are often transformed before entering the model:
Not surprisingly, coefficients explain how the transformed variables acts on the expected response. For a log-transformed covariate X, we can even interpret the coefficient regarding the untransformed X. In the model equation E(Y\mid X = x) = \alpha + \beta \log(x), we can say: A 1% increase in X leads to an increase in E(Y) of about \beta/100. Indeed, we have E(Y\mid X = 101\% \cdot x) - E(Y\mid X = x) = \alpha + \beta \log (1.01 \cdot x) - \alpha - \beta \log(x) \\ = \beta \log\left(\frac{1.01 \cdot x}{x}\right)= \beta \log(1.01) \approx \beta/100. Thus, taking logarithms of covariates not only deals with outliers, it also offers us the possibility to talk about percentages.
What would our simple linear regression provide with logarithmic carat as single covariate?
library(tidyverse)
fit <- lm(price ~ log(carat), data = diamonds)
fit
##
## Call:
## lm(formula = price ~ log(carat), data = diamonds)
##
## Coefficients:
## (Intercept) log(carat)
## 6238 5836
to_plot <- data.frame(carat = seq(0.3, 4.5, by = 0.1)) %>%
mutate(price = predict(fit, .))
# log-scale
ggplot(to_plot, aes(x = log(carat), y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
ggtitle("log-scale")
# original scale
ggplot(to_plot, aes(x = carat, y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
ggtitle("Original scale")
Comments
log(carat)
leads to a expected price increase of 5836
USD.carat
is associated with a price increase of about 5836/100
= 60 USD.We have seen that taking logarithms not only reduces outlier effects
in covariates but they also allow to think in percentages. What happens
if we log-transform the response variable? The model of a simple linear
regression would be
E(\log(Y) \mid X = x) = \alpha + \beta x.
Claim: The effect \beta tells us by how much
percentage we can expect Y to
change when increasing X by 1. Thus, a
logarithmic response leads to a multiplicative instead of an
additive model.
Proof
Assume for a moment that we can swap taking expectations and logarithms (disclaimer: we cannot). In that case, the model would be
\log(E(Y\mid X = x)) = \alpha + \beta x or, after exponentiation, E(Y\mid X = x) =e^{\alpha + \beta x}. The additive effect of increasing X by 1 would be E(Y\mid X = x+1) - E(Y\mid X = x) = e^{\alpha + \beta (x+1)} - e^{\alpha + \beta x} \\ = e^{\alpha + \beta x}e^\beta - e^{\alpha + \beta x} = e^{\alpha + \beta x}(e^\beta - 1) = E(Y\mid X = x)(e^\beta - 1). Dividing both sides by E(Y\mid X = x) gives \underbrace{\frac{E(Y\mid X = x+1) - E(Y\mid X = x)}{E(Y\mid X = x)}}_{\text{Relative change in } E(Y)} = e^\beta-1 \approx \beta = \beta \cdot 100\%. Indeed: A one point increase in X is associated with a relative increase in E(Y) of about \beta \cdot 100\%.
Since expectations and logarithms cannot be swapped, the calculation is not 100% correct. One consequence of this imperfection is that predictions backtransformed to the scale of Y are biased. One of the motivations of the generalized linear models GLM (see next section) will be to mend this problem in an elegant way.
How would our simple linear regression look like with
log(price)
as response?
library(tidyverse)
fit <- lm(log(price) ~ carat, data = diamonds)
summary(fit)
##
## Call:
## lm(formula = log(price) ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2844 -0.2449 0.0335 0.2578 1.5642
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.215021 0.003348 1856 <2e-16 ***
## carat 1.969757 0.003608 546 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3972 on 53938 degrees of freedom
## Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
## F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16
to_plot <- data.frame(carat = seq(0.3, 2.5, by = 0.1)) %>%
mutate(price = exp(predict(fit, .)))
# log-scale
ggplot(to_plot, aes(x = carat, y = log(price))) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
coord_cartesian(x = c(0, 3)) +
ggtitle("log-scale")
# original scale
ggplot(to_plot, aes(x = carat, y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
coord_cartesian(x = c(0, 3)) +
ggtitle("Original scale")
Comments
Using logarithms for either price or carat did not provide a satisfactory model yet. What about applying logarithms to both response and covariate at the same time?
library(tidyverse)
fit <- lm(log(price) ~ log(carat), data = diamonds)
summary(fit)
##
## Call:
## lm(formula = log(price) ~ log(carat), data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.50833 -0.16951 -0.00591 0.16637 1.33793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.448661 0.001365 6190.9 <2e-16 ***
## log(carat) 1.675817 0.001934 866.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2627 on 53938 degrees of freedom
## Multiple R-squared: 0.933, Adjusted R-squared: 0.933
## F-statistic: 7.51e+05 on 1 and 53938 DF, p-value: < 2.2e-16
to_plot <- data.frame(carat = seq(0.3, 2.5, by = 0.1)) %>%
mutate(price = exp(predict(fit, .)))
# log-log-scale
ggplot(to_plot, aes(x = log(carat), y = log(price))) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
coord_cartesian(x = log(c(0.3, 3))) +
ggtitle("Log-log scale")
# Back-transformed
ggplot(to_plot, aes(x = carat, y = price)) +
geom_point(data = diamonds, shape = ".", alpha = 0.2, color = "chartreuse4") +
geom_line() +
geom_point() +
coord_cartesian(x = c(0.3, 3)) +
ggtitle("Original scale")
# Relative bias on original scale
mean(diamonds$price) / mean(exp(fitted(fit))) - 1
## [1] 0.0313502
Comments
To end the section on linear regression, we extend the
log-log-example above by adding color
, cut
and
clarity
as categorical covariates.
library(tidyverse)
diamonds <- mutate_if(diamonds, is.ordered, factor, ordered = FALSE)
fit <- lm(log(price) ~ log(carat) + color + cut + clarity, data = diamonds)
summary(fit)
##
## Call:
## lm(formula = log(price) ~ log(carat) + color + cut + clarity,
## data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01107 -0.08636 -0.00023 0.08341 1.94778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.856856 0.005758 1364.43 <2e-16 ***
## log(carat) 1.883718 0.001129 1668.75 <2e-16 ***
## colorE -0.054277 0.002118 -25.62 <2e-16 ***
## colorF -0.094596 0.002142 -44.16 <2e-16 ***
## colorG -0.160378 0.002097 -76.49 <2e-16 ***
## colorH -0.251071 0.002225 -112.85 <2e-16 ***
## colorI -0.372574 0.002492 -149.50 <2e-16 ***
## colorJ -0.510983 0.003074 -166.24 <2e-16 ***
## cutGood 0.080048 0.003890 20.57 <2e-16 ***
## cutVery Good 0.117215 0.003619 32.39 <2e-16 ***
## cutPremium 0.139345 0.003579 38.94 <2e-16 ***
## cutIdeal 0.161218 0.003548 45.44 <2e-16 ***
## claritySI2 0.427879 0.005178 82.64 <2e-16 ***
## claritySI1 0.592954 0.005149 115.17 <2e-16 ***
## clarityVS2 0.742164 0.005178 143.34 <2e-16 ***
## clarityVS1 0.812277 0.005257 154.52 <2e-16 ***
## clarityVVS2 0.947271 0.005418 174.83 <2e-16 ***
## clarityVVS1 1.018743 0.005575 182.73 <2e-16 ***
## clarityIF 1.113732 0.006030 184.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1338 on 53921 degrees of freedom
## Multiple R-squared: 0.9826, Adjusted R-squared: 0.9826
## F-statistic: 1.693e+05 on 18 and 53921 DF, p-value: < 2.2e-16
Comments
As alternative to the multiple linear regression on diamond prices with logarithmic price and logarithmic carat, consider the same model without logarithms. Interpret the output of the model. Does it make sense from a practical perspective?
library(tidyverse)
diamonds <- mutate_if(diamonds, is.ordered, factor, ordered = FALSE)
The linear regression model has many extensions:
This section covers the generalized linear model (GLM). It was introduced in 1972 by Nelder & Wedderburn [2].
The model equation of the GLM is g(E(Y)) = \beta_0 + \beta_1 X_1 + \dots + \beta_m X_m, or similarly E(Y) = g^{-1}(\beta_0 + \beta_1 X_1 + \dots + \beta_m X_m), where Y conditional on the covariates belongs to the so-called exponential dispersion family and g is a transformation.
Thus, a GLM has three components:
The following table lists some of the most commonly used GLMs.
Regression | Distribution | Range of Y | Natural link | Unit deviance |
---|---|---|---|---|
Linear | Normal | (-\infty, \infty) | Identity | (y - \hat y)^2 |
Logistic | Binary | \{0, 1\} | logit | -2(y\log(\hat y) + (1-y) \log(1-\hat y)) |
Poisson | Poisson | [0, \infty) | log | 2(y \log(y / \hat y) - (y - \hat y)) |
Gamma | Gamma | (0, \infty) | 1/x (typical: log) | 2((y - \hat y) / \hat y - \log(y / \hat y)) |
Multinomial | Multinomial | \{C_1, \dots, C_m\} | mlogit | -2\sum_{j = 1}^m 1(y = C_j)\log(\hat y_j) |
Some remarks
The normal linear model allows us to model E(Y) by an additive linear function. In principle, this would also work for
However, in such cases, an additive linear model equation is usually not very realistic. As such, the main assumption of the linear regression model is violated:
GLMs deal with such problems by using a suitable link function like the logarithm. At least for the first two examples, this could not be achieved by a linear regression with log response because log(0) is not defined.
Further advantages of the GLM over the linear regression are:
The interpretation of model coefficients in GLMs is guided by the link function.
Parameters of a GLM are estimated by Maximum-Likelihood. This amounts to minimizing the (total) deviance, which equals the sum of the unit deviances over the model data (eventually weighted by case weights). For the normal linear model, the total deviance is equal to n times the MSE. In fact, the total deviance plays the same role for GLMs as the MSE does for linear regression. Consequently, it is sometimes useful to consider as a relative performance measure the relative deviance improvement compared to an intercept-only model. For the normal linear regression model, this Pseudo-R-squared corresponds to the usual R-squared.
Outlook: The loss functions used in the context of GLMs are used one-to-one as loss functions for other ML methods such as gradient boosting or neural networks. There, the “appropriate” loss function is chosen from the context. For example, if the response is binary, one usually chooses the binary cross-entropy as the objective function.
We now model claim counts for the insuranceData
by a
Poisson GLM with its natural link function, the log. It makes sure that
we can interpret covariate effects on a relative scale and that
predictions are positive.
library(ggplot2)
library(insuranceData)
data(dataCar)
summary(dataCar)
## veh_value exposure clm numclaims
## Min. : 0.000 Min. :0.002738 Min. :0.00000 Min. :0.00000
## 1st Qu.: 1.010 1st Qu.:0.219028 1st Qu.:0.00000 1st Qu.:0.00000
## Median : 1.500 Median :0.446270 Median :0.00000 Median :0.00000
## Mean : 1.777 Mean :0.468651 Mean :0.06814 Mean :0.07276
## 3rd Qu.: 2.150 3rd Qu.:0.709103 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :34.560 Max. :0.999316 Max. :1.00000 Max. :4.00000
##
## claimcst0 veh_body veh_age gender area
## Min. : 0.0 SEDAN :22233 Min. :1.000 F:38603 A:16312
## 1st Qu.: 0.0 HBACK :18915 1st Qu.:2.000 M:29253 B:13341
## Median : 0.0 STNWG :16261 Median :3.000 C:20540
## Mean : 137.3 UTE : 4586 Mean :2.674 D: 8173
## 3rd Qu.: 0.0 TRUCK : 1750 3rd Qu.:4.000 E: 5912
## Max. :55922.1 HDTOP : 1579 Max. :4.000 F: 3578
## (Other): 2532
## agecat X_OBSTAT_
## Min. :1.000 01101 0 0 0:67856
## 1st Qu.:2.000
## Median :3.000
## Mean :3.485
## 3rd Qu.:5.000
## Max. :6.000
##
# Distribution of the claim count
ggplot(dataCar, aes(x = numclaims)) +
geom_bar(fill = "chartreuse4") +
ggtitle("Distribution of 'numclaims'")
fit <- glm(
numclaims ~ veh_value + veh_body + veh_age + gender + area + agecat,
data = dataCar,
family = poisson(link = "log")
)
summary(fit)
##
## Call:
## glm(formula = numclaims ~ veh_value + veh_body + veh_age + gender +
## area + agecat, family = poisson(link = "log"), data = dataCar)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.373692 0.328875 -4.177 2.95e-05 ***
## veh_value 0.046388 0.016394 2.830 0.004660 **
## veh_bodyCONVT -2.049451 0.669159 -3.063 0.002193 **
## veh_bodyCOUPE -0.780776 0.337177 -2.316 0.020579 *
## veh_bodyHBACK -1.048694 0.318527 -3.292 0.000994 ***
## veh_bodyHDTOP -0.891602 0.327733 -2.721 0.006518 **
## veh_bodyMCARA -0.514153 0.409172 -1.257 0.208909
## veh_bodyMIBUS -1.179985 0.350059 -3.371 0.000749 ***
## veh_bodyPANVN -0.809562 0.339102 -2.387 0.016969 *
## veh_bodyRDSTR -0.793357 0.659848 -1.202 0.229235
## veh_bodySEDAN -1.010881 0.317897 -3.180 0.001473 **
## veh_bodySTNWG -1.030325 0.317876 -3.241 0.001190 **
## veh_bodyTRUCK -1.035543 0.328328 -3.154 0.001611 **
## veh_bodyUTE -1.246375 0.322009 -3.871 0.000109 ***
## veh_age -0.012071 0.017616 -0.685 0.493202
## genderM -0.015962 0.030046 -0.531 0.595238
## areaB 0.062837 0.042795 1.468 0.142015
## areaC 0.008432 0.038990 0.216 0.828787
## areaD -0.111103 0.052961 -2.098 0.035918 *
## areaE -0.028520 0.057864 -0.493 0.622095
## areaF 0.101640 0.066141 1.537 0.124361
## agecat -0.078076 0.010238 -7.626 2.42e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 26768 on 67855 degrees of freedom
## Residual deviance: 26629 on 67834 degrees of freedom
## AIC: 36108
##
## Number of Fisher Scoring iterations: 6
# Bias on frequency scale?
mean(predict(fit, type = "response")) / mean(dataCar$numclaims) - 1
## [1] 6.017409e-14
Comments
veh_value
increases the log of the
expected count by 0.046. On the original count scale, this is an
increase of approximately 4.6%. The exact effect is e^{0.046388}-1 = 0.047 = 4.7\%.In order to illustrate logistic regression, we will model the binary variable “claim yes=1/no=0” of the claims data by a logistic regression. Its logit link ensures that predicted probabilities are between 0 and 1 and that covariates act in a multiplicative way on the odds of having a claim.
library(ggplot2)
library(insuranceData)
data(dataCar)
# Distribution of the claim count
ggplot(dataCar, aes(x = factor(clm))) +
geom_bar(fill = "chartreuse4") +
ggtitle("Distribution of 'clm'")
fit <- glm(
clm ~ veh_value + veh_body + veh_age + gender + area + agecat,
data = dataCar,
family = binomial(link = "logit")
)
summary(fit)
##
## Call:
## glm(formula = clm ~ veh_value + veh_body + veh_age + gender +
## area + agecat, family = binomial(link = "logit"), data = dataCar)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.27962 0.38325 -3.339 0.000841 ***
## veh_value 0.05098 0.01787 2.853 0.004334 **
## veh_bodyCONVT -2.14392 0.70853 -3.026 0.002479 **
## veh_bodyCOUPE -0.89535 0.39235 -2.282 0.022486 *
## veh_bodyHBACK -1.13441 0.37287 -3.042 0.002347 **
## veh_bodyHDTOP -0.95742 0.38191 -2.507 0.012177 *
## veh_bodyMCARA -0.57311 0.46768 -1.225 0.220413
## veh_bodyMIBUS -1.27231 0.40312 -3.156 0.001599 **
## veh_bodyPANVN -0.92452 0.39409 -2.346 0.018978 *
## veh_bodyRDSTR -1.23888 0.82480 -1.502 0.133091
## veh_bodySEDAN -1.12596 0.37228 -3.025 0.002490 **
## veh_bodySTNWG -1.12698 0.37229 -3.027 0.002468 **
## veh_bodyTRUCK -1.15426 0.38272 -3.016 0.002562 **
## veh_bodyUTE -1.35734 0.37622 -3.608 0.000309 ***
## veh_age -0.01270 0.01896 -0.670 0.502834
## genderM -0.01048 0.03218 -0.326 0.744688
## areaB 0.09862 0.04598 2.145 0.031962 *
## areaC 0.04015 0.04189 0.959 0.337794
## areaD -0.08712 0.05653 -1.541 0.123285
## areaE -0.01270 0.06211 -0.204 0.837982
## areaF 0.10581 0.07182 1.473 0.140658
## agecat -0.08312 0.01097 -7.580 3.45e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33767 on 67855 degrees of freedom
## Residual deviance: 33632 on 67834 degrees of freedom
## AIC: 33676
##
## Number of Fisher Scoring iterations: 5
Comments
veh_value
(10’000 USD) increases
the log-odds of having a claim by 0.051. On the odds scale, this is an
increase of approximately 5.1%. The exact calculation is e^{0.05098}-1 = 0.052 = 5.2\%. Thus we can
say that for every additional 10’000 USD vehicle value, the odds of
having a claim is increased by 5.2%.Fit a Gamma regression with log-link to explain diamond prices by
log(carat)
, color
, cut
, and
clarity
. Compare the coefficients with those from the
corresponding linear regression with log(price)
as
response. Use dummy coding for the three categorical variables.
Calculate the relative bias of the predictions on the US Dollar
scale
library(tidyverse)
# Turn ordered factors into unordered ones
diamonds <- mutate_if(diamonds, is.ordered, factor, ordered = FALSE)
In this first chapter, we have introduced basic notations. Then, we have revisited multiple linear regression and some of its many aspects. To round up the chapter, we met an important generalization of linear regression, namely the generalized linear model (GLM). It includes the binary logistic regression and Poisson count regression as relevant special cases.
[1] A. Legendre, “Nouvelles méthodes pour la détermination des orbites des comètes”, 1805.
[2] J. A. Nelder and R. W. M. Wedderburn, “Generalized Linear Models”. Journal of the Royal Statistical Society. Series A, Vol. 135, No. 3, 1972.