1 Introduction

In this chapter, we dive into artificial neural networks, one of the main drivers of artificial intelligence.

Neural networks are around since many decades. (Maybe) the first such model was built by Marvin Minsky in 1951. He called his algorithm SNARC (“stochastic neural-analog reinforcement calculator”). Since then, neural networks have gone through several stages of development. One of the milestones was the idea in Werbos (1974) to efficiently calculate gradients in the optimization algorithm by an approach called “backpropagation”. Another milestone was the use of GPUs (graphics processing units) to greatly reduce calculation time.

Artificial neural nets are extremely versatile and powerful. They can be used to

fit simple models like GLMs,
learn interactions and non-linear effects in an automatic way (like tree-based methods),
optimize general loss functions,
fit data much larger than RAM (e.g. images),
learn “online” (update the model with additional data),
fit multiple response variables at the same time,
model input of dimension higher than two (e.g. images, videos),
model input of different input dimensions (e.g. text and images),
fit data with sequential structure in both in- and output (e.g. a text translator),
model data with spatial structure (images),
fit models with many millions of parameters,
do non-linear dimension reduction,
and many more.

In this chapter, we will mainly deal with the first three aspects. Since a lot of new terms are being used, a small glossary can be found in Section “Neural Network Slang”.

2 Understanding Neural Nets

To learn how and why neural networks work, we will go through three steps - each illustrated on the diamonds data:

Step 1: Linear regression as neural net
Step 2: Hidden layers
Step 3: Activation functions

After this, we will be ready to build more complex models.

2.1 Step 1: Linear regression as neural net

Let us revisit the simple linear regression \mathbb E(\text{price} \mid \text{carat}) = \alpha + \beta \cdot \text{carat} calculated on the full diamonds data. In Chapter 4 we have found the solution \hat\alpha = -2256.36 and \hat \beta = 7756.43 by linear least-squares.

Above situation can be viewed as a neural network with

an input layer with two nodes (carat and the intercept called “bias unit” with value 1),
a “fully connected” (= “dense”) output layer with one node (price). Fully connected means that each node of a layer is a linear function of all node values of the previous layer. Each linear function has parameters or weights to be estimated, in our simple case just \alpha and \beta.

Visualized as a graph, the situation looks as follows:

Part of the figures were done with this cool webtool.

To gain confidence in neural networks, we first show that the parameters estimated by a neural network are quite similar to those learned by the linear least squares method. We will use Google’s TensorFlow with its flexible functional Keras interface. A great book on Keras (and neural networks in general) is Chollet (2018).

2.1.1 Example: Simple linear regression

library(ggplot2)
library(keras)

# Input layer: we have 1 covariate
input <- layer_input(shape = 1)

# Output layer connected to the input layer
output <- input |> 
  layer_dense(units = 1)

# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
# summary(nn)

nn |> 
  compile(
    optimizer = optimizer_adam(learning_rate = 1),
    loss = "mse",
    metrics = metric_root_mean_squared_error()
  )

# Fit model - naive without validation
history <- nn |> 
  fit(
    x = diamonds$carat,
    y = diamonds$price,
    epochs = 30,
    batch_size = 100
  )

## Epoch 1/30
## 540/540 - 1s - loss: 27096620.0000 - root_mean_squared_error: 5205.4414 - 1s/epoch - 2ms/step
## Epoch 2/30
## 540/540 - 1s - loss: 20241934.0000 - root_mean_squared_error: 4499.1035 - 896ms/epoch - 2ms/step
## Epoch 3/30
## 540/540 - 1s - loss: 15495953.0000 - root_mean_squared_error: 3936.4900 - 859ms/epoch - 2ms/step
## Epoch 4/30
## 540/540 - 1s - loss: 12341229.0000 - root_mean_squared_error: 3513.0085 - 859ms/epoch - 2ms/step
## Epoch 5/30
## 540/540 - 1s - loss: 10354460.0000 - root_mean_squared_error: 3217.8347 - 867ms/epoch - 2ms/step
## Epoch 6/30
## 540/540 - 1s - loss: 9135827.0000 - root_mean_squared_error: 3022.5530 - 803ms/epoch - 1ms/step
## Epoch 7/30
## 540/540 - 1s - loss: 8336326.0000 - root_mean_squared_error: 2887.2695 - 853ms/epoch - 2ms/step
## Epoch 8/30
## 540/540 - 1s - loss: 7699245.5000 - root_mean_squared_error: 2774.7515 - 848ms/epoch - 2ms/step
## Epoch 9/30
## 540/540 - 1s - loss: 7087159.5000 - root_mean_squared_error: 2662.1719 - 916ms/epoch - 2ms/step
## Epoch 10/30
## 540/540 - 1s - loss: 6467482.0000 - root_mean_squared_error: 2543.1245 - 812ms/epoch - 2ms/step
## Epoch 11/30
## 540/540 - 1s - loss: 5853131.0000 - root_mean_squared_error: 2419.3245 - 840ms/epoch - 2ms/step
## Epoch 12/30
## 540/540 - 1s - loss: 5266740.5000 - root_mean_squared_error: 2294.9380 - 798ms/epoch - 1ms/step
## Epoch 13/30
## 540/540 - 1s - loss: 4726106.5000 - root_mean_squared_error: 2173.9609 - 933ms/epoch - 2ms/step
## Epoch 14/30
## 540/540 - 1s - loss: 4240382.0000 - root_mean_squared_error: 2059.2188 - 917ms/epoch - 2ms/step
## Epoch 15/30
## 540/540 - 1s - loss: 3815855.7500 - root_mean_squared_error: 1953.4215 - 882ms/epoch - 2ms/step
## Epoch 16/30
## 540/540 - 1s - loss: 3457884.2500 - root_mean_squared_error: 1859.5387 - 899ms/epoch - 2ms/step
## Epoch 17/30
## 540/540 - 1s - loss: 3165114.2500 - root_mean_squared_error: 1779.0768 - 867ms/epoch - 2ms/step
## Epoch 18/30
## 540/540 - 1s - loss: 2933515.0000 - root_mean_squared_error: 1712.7507 - 792ms/epoch - 1ms/step
## Epoch 19/30
## 540/540 - 1s - loss: 2757288.0000 - root_mean_squared_error: 1660.5083 - 814ms/epoch - 2ms/step
## Epoch 20/30
## 540/540 - 1s - loss: 2628982.5000 - root_mean_squared_error: 1621.4137 - 837ms/epoch - 2ms/step
## Epoch 21/30
## 540/540 - 1s - loss: 2540100.0000 - root_mean_squared_error: 1593.7692 - 853ms/epoch - 2ms/step
## Epoch 22/30
## 540/540 - 1s - loss: 2481616.7500 - root_mean_squared_error: 1575.3148 - 931ms/epoch - 2ms/step
## Epoch 23/30
## 540/540 - 1s - loss: 2445467.0000 - root_mean_squared_error: 1563.7990 - 873ms/epoch - 2ms/step
## Epoch 24/30
## 540/540 - 1s - loss: 2424014.0000 - root_mean_squared_error: 1556.9246 - 831ms/epoch - 2ms/step
## Epoch 25/30
## 540/540 - 1s - loss: 2411806.2500 - root_mean_squared_error: 1552.9991 - 840ms/epoch - 2ms/step
## Epoch 26/30
## 540/540 - 1s - loss: 2405250.7500 - root_mean_squared_error: 1550.8871 - 943ms/epoch - 2ms/step
## Epoch 27/30
## 540/540 - 1s - loss: 2401943.2500 - root_mean_squared_error: 1549.8204 - 910ms/epoch - 2ms/step
## Epoch 28/30
## 540/540 - 1s - loss: 2400009.0000 - root_mean_squared_error: 1549.1963 - 875ms/epoch - 2ms/step
## Epoch 29/30
## 540/540 - 1s - loss: 2399141.0000 - root_mean_squared_error: 1548.9160 - 840ms/epoch - 2ms/step
## Epoch 30/30
## 540/540 - 1s - loss: 2398710.5000 - root_mean_squared_error: 1548.7771 - 805ms/epoch - 1ms/step

plot(history, metrics = "root_mean_squared_error")

unlist(get_weights(nn))

## [1]  7723.831 -2211.618

# Plot effect of carat on average price
data.frame(carat = seq(0.3, 3, by = 0.1)) |> 
  transform(price = predict(nn, carat, verbose = 0)) |> 
ggplot(aes(x = carat, y = price)) +
  geom_line(color = "chartreuse4") +
  geom_point(color = "chartreuse4")

Comment: The solution of the simple neural network is indeed quite similar to the OLS solution.

2.1.2 The optimization algorithm

Neural nets are typically fitted by mini-batch gradient descent, using backpropagation to efficiently calculate gradients. It works as follows:

Let f_\beta denote a neural net with parameters \beta, and Q(f_\beta, D) = \sum_{(y_i, \boldsymbol x_i) \in D} L(y_i, f_\beta(\boldsymbol x_i)) its total loss on a data set D with respect to the loss function L.

Initialize the parameter vector \beta with random values \hat \beta.
Forward step: Calculate Q(f_{\hat\beta}, D_\text{batch}) on a batch D_\text{batch} of observations. This is a small subset of the training data.
Backpropagation step: Modify \hat \beta to improve Q(f_{\hat\beta}, D_\text{batch}) by gradient descent: Calculate the vector of partial derivatives \nabla \hat \beta = \frac{\partial Q(f_\beta, D_\text{batch})}{\partial \beta}\large\mid_{\beta = \hat \beta} at the current estimates \hat \beta. Use it to update \hat \beta \leftarrow \hat \beta - \lambda \nabla \hat \beta, where \lambda > 0 is a sufficiently small learning rate. In neural nets, parameters are organized in multiple layers, which makes it difficult to calculate \nabla \hat\beta. This is where backpropagation enters the game: It calculates the partial derivatives layer per layer using the chain rule, starting from the output layer.
Repeat Steps 2 and 3 until each observation appeared in a batch. This is called an epoch.
Repeat Step 4 for multiple epochs until some stopping criterion triggers.

Gradient descent on batches of size 1 is called “stochastic gradient descent” (SGD). Further note that there is no guarantee that the algorithm reaches a global minimum.

2.2 Step 2: Hidden layers

Our first neural network above consisted of only an input layer and an output layer. By adding one or more hidden layers between in- and output, the network gains additional parameters, and thus more flexibility. The nodes of a hidden layer can be viewed as latent variables, representing the original covariates. The nodes of a hidden layer are sometimes called encoding. The closer a layer is to the output, the better its nodes are suitable to predict the response variable. In this way, a neural network finds the right transformations and interactions of its covariates in an automatic way. The only ingredients are a large data set and a flexible enough network “architecture” (number of layers, nodes per layer).

Neural nets with more than one hidden layer are called “deep neural nets”.

We will now add a hidden layer with five nodes v_1, \dots, v_5 to our simple linear regression network. The architecture looks as follows:

This network has 16 parameters. How much better than our simple network with just two parameters will it be?

2.2.1 Example

The following code is identical to the last one up to one extra line of code specifying the hidden layer.

library(ggplot2)
library(keras)

# Input layer: we have 1 covariate
input <- layer_input(shape = 1)

# One hidden layer
output <- input |>
  layer_dense(units = 5) |>  # the only new line of code!
  layer_dense(units = 1)

# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
# summary(nn)

nn |> 
  compile(
    optimizer = optimizer_adam(learning_rate = 1),
    loss = "mse",
    metrics = metric_root_mean_squared_error()
  )

# Fit model - naive without validation
nn |> 
  fit(
    x = diamonds$carat,
    y = diamonds$price,
    epochs = 30,
    batch_size = 100, 
    verbose = 0
  )

# Plot effect of carat on average price
data.frame(carat = seq(0.3, 3, by = 0.1)) |> 
  transform(price = predict(nn, carat, verbose = 0)) |> 
ggplot(aes(x = carat, y = price)) +
  geom_line(color = "chartreuse4") +
  geom_point(color = "chartreuse4")

Comment: Oops, it seems as if the extra hidden layer had no effect. The reason is that a linear function of a linear function is still a linear function. Adding the hidden layer did not really change the capabilities of the model. It just added a lot of unnecessary parameters.

2.3 Step 3: Activation functions

The missing magic component is the so called activation function \sigma after each layer, which transforms the values of the nodes. So far, we have implicitly used “linear activations”, which - in neural network slang - is just the identity function.

Applying non-linear activation functions after hidden layers have the purpose to introduce non-linear and interaction effects. Typical such functions are

the hyperbolic tangent \sigma(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} (“S”-shaped function that maps real values to [-1, 1]),
the standard logistic function (“sigmoid”) \sigma(x) = 1 / (1 + e^{-x}) (“S”-shaped function that maps real values to [0, 1], shifted and scaled hyperbolic tangent),
the rectangular linear unit “ReLU” \sigma(x) = \text{max}(0, x) that sets negative values to 0.

Activation functions applied to the output layer have a different purpose, namely the same as the inverse of the link function of a corresponding GLM. It maps predictions to the scale of the response:

identity/“linear” activation \rightarrow usual regression
logistic activation \rightarrow binary logistic regression (one probability)
softmax activation \rightarrow multinomial logistic regression (one probability per class)
exponential activation \rightarrow log-linear regression as with Poisson or Gamma regression

Let us add a hyperbolic tangent activation function (\sigma) after the hidden layer of our simple example.

2.3.1 Example

Again, the code is very similar to the last one, with the exception of using a hyperbolic tangent activation after the hidden layer (and different learning rate and number of epochs).

library(ggplot2)
library(keras)

# Input layer: we have 1 covariate
input <- layer_input(shape = 1)

# One hidden layer
output <- input |>
  layer_dense(units = 5, activation = "tanh") |> 
  layer_dense(units = 1, activation = "linear")

# Create and compile model
nn <- keras_model(inputs = input, outputs = output)

nn |> 
  compile(
    optimizer = optimizer_adam(learning_rate = 0.2),
    loss = "mse",
    metrics = metric_root_mean_squared_error()
  )

# Fit model - naive without validation
nn |> 
  fit(
    x = diamonds$carat,
    y = diamonds$price,
    epochs = 50,
    batch_size = 100, 
    verbose = 0
  )

# Plot effect of carat on average price
data.frame(carat = seq(0.3, 3, by = 0.1)) |> 
  transform(price = predict(nn, carat, verbose = 0)) |> 
ggplot(aes(x = carat, y = price)) +
  geom_line(color = "chartreuse4") +
  geom_point(color = "chartreuse4")

Comment: Adding the non-linear activation after the hidden layer has changed the model. The effect of carat is now representing the association between carat and price by a non-linear function.

3 Practical Considerations

3.1 Validation and tuning of main parameters

So far, we have naively fitted the neural networks without splitting the data for test and validation. Don’t do this! Usually, one sets a small test dataset (e.g. 10% of rows) aside to assess the final model performance and use simple (or cross-)validation for model tuning.

In order to choose the main tuning parameters, namely

network architecture,
activation functions,
learning rate,
batch size, and
number of epochs,

one often uses simple validation because cross-validation takes too much time.

3.2 Missing values

A neural net does not accept missing values in the input. They need to be filled, e.g., by a typical value or a value below the minimum.

3.3 Input standardization

Gradient descent starts by random initialization of parameters. This step is optimized for standardized input. Standardization has to be done manually by either

min/max scale the values of each input to the range [-1, 1],
standard scale the values of each input to mean 0 and standard deviation 1, or
use relative ranks.

Note that the scaling transformation is calculated on the training data and then applied to the validation and test data. This usually requires a couple of lines of code.

3.4 Categorical input

There are three common ways to represent categorical input variables in a neural network.

Binary and ordinal categoricals are best represented by integers and then treated as numeric.
Unordered categoricals are either one-hot-encoded (i.e., each category is represented by a binary variable) or
they are represented by a (categorical) embedding. To do so, the K categories are integer encoded and then condensed by a special embedding layer to m << K dense features. This requires a more complex network architecture but saves memory and preprocessing. This approach is heavily used when the input consists of words (which is a categorical variable with thousands of levels - one level per word). The embedding is represented by a fully parametrized (K \times m) matrix \beta estimated together with the other network parameters. Instead of representing X by K dummy variables \tilde X and then computing \tilde X \beta, the matrix multiplication is done implicitly by index slicing. For instance, if X_1 = j, then the first row of X \beta equals the j-th row of \beta.

For Option 2, input standardization is not required, for Option 3 it must not be applied as the embedding layer expects integers.

3.5 Callbacks

Sometimes, we want to take actions during training, such as

stop training when validation performance starts worsening,
reduce the learning rate when the optimization is stuck in a “plateau”, or
save the network weights between epochs.

Such monitoring tasks are called callbacks. We will see them in the example below.

3.6 Types of layers

So far, we have encountered only dense (= fully connected) layers and activation layers. Here some further types:

Embedding layers to represent integer encoded categoricals.
Dropout layers to add regularization.
Convolutional and pooling layers for image data.
Recurrent layers (long-short-term memory LSTM, gated recurrent unit GRU) for sequence data.
Concatenation layers to combine different branches of the network (like in a directed graph).
Flatten layers to bring higher dimensional layers to dimension 1 (relevant, e.g., for embeddings, image and text data).

3.7 Optimizer

Pure gradient descent is rarely applied without tweaks because it tends to be stuck in local minima, especially for complex networks with many layers. Modern variants are “adam”, “nadam” and “RMSProp”. These optimizers work usually out-of-the-box, except for the learning rate, which has to be chosen manually.

3.8 Custom losses and evaluation metrics

Frameworks like Keras/TensorFlow offer many predefined loss functions and evaluation metrics. Choosing them is a crucial step, just as with tree boosting. Using TensorFlow’s backend functions, one can define own metrics and loss functions (see exercises).

3.9 Overfitting and regularization

As with linear models, a model with too many parameters will overfit in an undesirable way. With about 50 to 100 observations per parameter, overfitting is usually not problematic. (Different rules apply to image and text data). Besides using fewer parameters, the main ways to reduce overfitting are as follows:

Pull parameters of a layer slightly toward zero by applying L1 and/or L2 penalties to the objective function.
Adding dropout layers. A dropout layer randomly sets some node values of the previous layer to 0, turning them off. Dropout is only applied during training.

3.10 Choosing the architecture

How many layers and how many nodes per layer should be selected? For tabular data, up to three hidden layers are usually sufficient. If we start with p input variables, the number of nodes in the first hidden layer is usually higher than p and reduces for later layers. There should not be a “representational bottleneck”, i.e., an early hidden layer with too few parameters.

The number of parameters should not be too high compared to the number of rows, see “Overfitting and regularization” above.

3.11 Interpretation

Variable importance of covariates in neural networks can be assessed by permutation importance (see below) or SHAP importance (not covered). Covariate effects can be investigated, e.g., by partial dependence plots or SHAP dependence plots.

4 Example: Diamonds

We will now fit a neural net with two hidden layers (30 and 15 nodes) and a total of 631 parameters to model diamond prices. Learning rate, activation functions, and batch size were manually chosen by simple validation. The number of epochs is automatically being chosen by an early stopping callback.

library(ggplot2)
library(withr)
library(keras)
library(MetricsWeighted)
library(hstats)

diamonds <- transform(diamonds, log_price = log(price), log_carat = log(carat))

y <- "log_price"
xvars <- c("log_carat", "color", "cut", "clarity")

# Split into train and test
with_seed(
  9838, 
  ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)

train <- diamonds[ix, ]
test <- diamonds[-ix, ]

X_train <- train[, xvars]
X_test <- test[, xvars]

y_train <- train[[y]]
y_test <- test[[y]]

# Standardization information using X_train
temp <- scale(data.matrix(X_train))
sc <- list(
  center = attr(temp, "scaled:center"), 
  scale = attr(temp, "scaled:scale")
)

# Function that maps data to scaled network input
prep_nn <- function(X, sel = xvars, scaling = sc) {
  X <- data.matrix(X[, sel, drop = FALSE])
  scale(X, center = scaling$center, scale = scaling$scale)
}

# Trying to make things reproducible...
k_clear_session()
tensorflow::set_random_seed(499)

# Input layer: we have 4 covariates
input <- layer_input(shape = 4)

# Two hidden layers with contracting number of nodes
output <- input |>
  layer_dense(units = 30, activation = "relu") |> 
  layer_dense(units = 15, activation = "relu") |> 
  layer_dense(units = 1, activation = "linear")

# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
summary(nn)

## Model: "model"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  input_1 (InputLayer)               [(None, 4)]                     0           
##  dense_2 (Dense)                    (None, 30)                      150         
##  dense_1 (Dense)                    (None, 15)                      465         
##  dense (Dense)                      (None, 1)                       16          
## ================================================================================
## Total params: 631 (2.46 KB)
## Trainable params: 631 (2.46 KB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________

nn |> 
  compile(
    optimizer = optimizer_adam(learning_rate = 0.05),
    loss = "mse",
    metrics = metric_root_mean_squared_error()
  )

# Callbacks
cb <- list(
  callback_early_stopping(patience = 20),
  callback_reduce_lr_on_plateau(patience = 5)
)
       
# Fit model
history <- nn |> 
  fit(
    x = prep_nn(X_train),
    y = y_train,
    epochs = 100,
    batch_size = 400, 
    validation_split = 0.2,
    callbacks = cb,
    verbose = 0
  )

plot(history, metrics = "root_mean_squared_error", smooth = FALSE) +
  coord_cartesian(xlim = c(2, length(history$metrics$loss)), ylim = c(0.1, 0.2))

# Interpret
pred_fun <- function(m, X) predict(m, prep_nn(X), batch_size = 1000, verbose = 0)

# Performance on validation data
pred <- pred_fun(nn, X_test)
rmse(y_test, pred)

## [1] 0.1184944

r_squared(y_test, pred, reference_mean = mean(y_train))

## [1] 0.9864616

# Variable importance?

# PDPs
for (v in xvars) {
  p <- partial_dep(nn, v = v, X = X_train, pred_fun = pred_fun) |> 
    plot(color = "chartreuse4") +
    ggtitle(paste("PDP for", v)) 
  print(p)
}

Comment: Performance is a bit lower than of the tree-based models. This might partly be a consequence of effects being smoother, but also because the model has not been refitted on the full training data for simplicity (20% of the training rows are used for validation). Compared to our tree-based models, one important aspect in model interpretation is missing: variable importance.

5 Excursion: Permutation Importance

So far, we have measured variable importance by model-specific approaches such as split-gain importance for trees. A model-agnostic technique is called permutation importance introduced in Breiman (2001) for random forests and later generalized by Fisher, Rudin, and Dominici (2018) to other models. It measures by how much a relevant performance measure S (e.g., the average deviance or the RMSE) of a fitted model \hat f, evaluated on a dataset D, worsens after randomly shuffling the values of the j-th feature. The larger the performance change, the more important the feature.

Formally, permutation importance \text{PVI}(j, D) of the j-th feature X^{(j)} and data D can be defined as follows: \text{PVI}(j, D) = S(\hat f, D^{(j)}) - S(\hat f, D), where D^{(j)} is a version of D with randomly permuted values in the column representing the j-th feature X^{(j)}. For simplicity, we assume that the performance measure S is defined such that smaller values mean better performance.

If there are p features and n is the sample size of D, then n(p+1) predictions have to be calculated. Compared to other methods from XAI, this is a relatively cheap operation. Note that the process can be repeated multiple times to increase (and assess) robustness of the results.

5.1 Remarks

During the process of calculating permutation importance, the model is never refitted.
Generally, variable importance measures struggle with strongly collinear (or even strongly causally dependent) features. The problem can usually be reduced by applying clever data transformations during feature construction, i.e., before modeling. Example: the age v of the driver and the “age” t of the driving license can often be decorrelated by considering v and v-t instead of the original features v and t. The use of such transformations contrasts with the bad habit of blindly throwing all available columns into the model.
Calculating permutation importance makes use of the response variable. In order to not influence the results by overfitting, it usually makes sense to evaluate it on a hold-out data set. Can you think of a situation where one would rather consider the in-sample version?
Different versions of permutation importance exist. For example, instead of studying absolute drops in performance, one could also use relative drops.

Most of the text of this section is taken from our responsible ML lecture.

5.2 Example: Diamonds (continued)

perm_importance(nn, X = X_test, y = y_test, pred_fun = pred_fun, verbose = FALSE) |> 
  plot(fill = "chartreuse4")

Comment: As in all our models on diamond prices, the dominating feature is the diamond size.

6 Example: Taxi

The last example on neural networks aims to show

how strong such models perform on large data,
how fast neural nets are,
how flexible they are, and
how embedding layers are used to represent unordered categorical input.

To do so, we create a model for (log) taxi trip durations, using the same features as before. We use the same train/test split as with the corresponding boosted trees model. Slight tuning (mainly the learning rate and the architecture) is done manually. The number of epochs is chosen by an early stopping callback. Pick-up location is modeled by a ten-dimensional embedding layer.

library(arrow)
library(data.table)
library(withr)
library(ggplot2)
library(keras)
library(MetricsWeighted)
library(hstats)

df <- read_parquet("taxi/yellow_tripdata_2018-01.parquet")

setDT(df)
head(df)

#=======================================================================
# Prepare data for modeling
#=======================================================================

df[, duration := as.numeric(
    difftime(tpep_dropoff_datetime, tpep_pickup_datetime, units = "mins")
  )]
df = df[between(trip_distance, 0.2, 100) & between(duration, 1, 120)]
df[, `:=`(
  pu_hour = factor(data.table::hour(tpep_pickup_datetime)),
  weekday = factor(data.table::wday(tpep_pickup_datetime)), # 1 = Sunday
  pu_loc = forcats::fct_lump_min(factor(PULocationID), 1e5),
  log_duration = log(duration),
  log_distance = log(trip_distance)
)]

y <- "log_duration"
x_dense <- c("log_distance", "weekday", "pu_hour")
xvars <- c(x_dense, "pu_loc")

# Random split
with_seed(1, 
  ix <- sample(nrow(df), 0.8 * nrow(df))
)
train <- as.data.frame(df[ix, c(y, xvars), with = FALSE])
test <- as.data.frame(df[-ix, c(y, xvars), with = FALSE])

y_train <- train[[y]]
y_test <- test[[y]]

X_train <- train[, xvars]
X_test <- test[, xvars]

# Standardization info using X_train
temp <- scale(data.matrix(X_train[, x_dense]))
sc <- list(
  center = attr(temp, "scaled:center"),
  scale = attr(temp, "scaled:scale")
)

# Function that maps data.frame to network input 
# (a list with a scaled matrix and an integer vector representing pu_loc)
prep_nn <- function(X, dense = x_dense, scaling = sc) {
  X_dense <- data.matrix(X[, dense, drop = FALSE])
  X_dense <- scale(X_dense, center = scaling$center, scale = scaling$scale)
  list(dense1 = X_dense, pu_loc = as.integer(X$pu_loc) - 1)
}

# Does prep_nn() really work?
prep_nn(head(test, 1))

## $dense1
##   log_distance weekday   pu_hour
## 1     0.478163  -1.031 -1.995948
## attr(,"scaled:center")
## log_distance      weekday      pu_hour 
##    0.5651471    4.0010674   14.7945473 
## attr(,"scaled:scale")
## log_distance      weekday      pu_hour 
##    0.8953113    1.9409001    6.4102621 
## 
## $pu_loc
## [1] 31

# Trying to make things reproducible...
k_clear_session()
tensorflow::set_random_seed(469)

# Inputs
input_dense <- layer_input(shape = length(x_dense), name = "dense1")
input_pu_loc <- layer_input(shape = 1, name = "pu_loc")

# Embedding of pu_loc
emb_pu_loc <- input_pu_loc |> 
  layer_embedding(input_dim = nlevels(train$pu_loc) + 1, output_dim = 10) |> 
  layer_flatten()

# Combine dense input and embedding, and add dense layers
outputs <- list(input_dense, emb_pu_loc) |>
  layer_concatenate() |> 
  layer_dense(100, activation = "relu") |>
  layer_dense(10, activation = "relu") |> 
  layer_dense(1, activation = "linear")

# Input
inputs <- list(dense1 = input_dense, pu_loc = input_pu_loc)

# Create and compile model
nn <- keras_model(inputs = inputs, outputs = outputs)
summary(nn)

## Model: "model"
## ________________________________________________________________________________
##  Layer (type)           Output Shape            Param   Connected to            
##                                                  #                              
## ================================================================================
##  pu_loc (InputLayer)    [(None, 1)]             0       []                      
##  embedding (Embedding)  (None, 1, 10)           390     ['pu_loc[0][0]']        
##  dense1 (InputLayer)    [(None, 3)]             0       []                      
##  flatten (Flatten)      (None, 10)              0       ['embedding[0][0]']     
##  concatenate (Concaten  (None, 13)              0       ['dense1[0][0]',        
##  ate)                                                    'flatten[0][0]']       
##  dense_2 (Dense)        (None, 100)             1400    ['concatenate[0][0]']   
##  dense_1 (Dense)        (None, 10)              1010    ['dense_2[0][0]']       
##  dense (Dense)          (None, 1)               11      ['dense_1[0][0]']       
## ================================================================================
## Total params: 2811 (10.98 KB)
## Trainable params: 2811 (10.98 KB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________

nn |> 
  compile(optimizer = optimizer_adam(learning_rate = 0.005), loss = "mse")

# Callbacks
cb <- list(
  callback_early_stopping(patience = 10),
  callback_reduce_lr_on_plateau(patience = 5)
)

# Fit model
if (FALSE) {  # Set to true to refit
  history <- nn |> 
    fit(
      x = prep_nn(X_train),
      y = y_train,
      epochs = 100,
      batch_size = 50000, 
      validation_split = 0.2,
      callbacks = cb,
      verbose = 0
    )
  plot(history, metrics = c("loss", "val_loss"), smooth = FALSE) +
    coord_cartesian(ylim = c(0, 0.15))
  save_model_weights_hdf5(nn, "taxi/taxi_nn.h5")
} else {
  load_model_weights_hdf5(nn, "taxi/taxi_nn.h5")
}

#=======================================================================
# Inspect
#=======================================================================

pred_fun <- function(m, X) predict(m, prep_nn(X), batch_size = 50000, verbose = 0)

# Performance on test data
pred <- pred_fun(nn, X_test)
rmse(y_test, pred)

## [1] 0.304573

r_squared(y_test, pred, reference_mean = mean(y_train))

## [1] 0.8195031

# Permutation importance
perm_importance(nn, X = X_test, y = y_test, pred_fun = pred_fun, verbose = FALSE) |> 
  plot(fill = "chartreuse4")

# PDPs
for (v in xvars) {
  p <- partial_dep(nn, v = v, X = X_train, pred_fun = pred_fun) |> 
    plot(color = "chartreuse4", rotate_x = (v == "pu_loc")) +
    ggtitle(paste("PDP for", v)) 
  print(p)
}

Comment: The resulting model performs quite similarly to the boosted trees model. As in the diamonds example above, we used an early-stopping callback on 20% of the training rows to automatically find a good number of epochs. And again, for simplicity, we did not refit the model on the full training data. Thus, the final model was trained on less rows than the LightGBM model.

7 Neural Network Slang

Here, we summarize some of the neural network slang.

Activation function: The transformation applied to the node values.
Architecture: The layout of layers and nodes.
Backpropagation: An efficient way to calculate gradients.
Batch: A couple of data rows used for one mini-batch gradient descent step.
Callback: An action during training (save weights, reduce learning rate, stop training, …).
Epoch: The process of updating the network weights by gradient descent until each observation in the training set was used once.
Embedding: A numeric representation of categorical input as learned by the neural net.
Encoding: The values of latent variables of a hidden layer, usually the last.
Gradient descent: The basic optimization algorithm of neural networks.
Keras: User-friendly wrapper of TensorFlow.
Layer: Main organizational unit of a neural network.
Learning rate: Controls the step size of gradient descent, i.e., how aggressive the network learns.
Node: Nodes on the input layer are the covariates, nodes on the output layer the response(s) and nodes on a hidden layer are latent variables representing the covariates for the task to predict the response.
Optimizer: The specific variant of gradient descent.
PyTorch: An important implementation of neural networks.
Stochastic gradient descent (SGD): Mini-batch gradient descent with batches of size 1.
TensorFlow: An important implementation of neural networks.
Weights: The parameters of a neural net.

8 Excursion: Analysis Scheme X

Let T(Y) be a quantity of interest, e.g., an average diamond price, claims frequency, or the success probability of a medical intervention. Many interesting insights can be found by the following analysis schema:

Calculate T(Y) on the full data.
Calculate T(Y) stratified by each covariate X^{(j)}. This will describe the bivariate associations between Y and the X^{(j)}. Continuous covariates need to be binned.
Build an ML model T(Y \mid \boldsymbol x) \approx f(\boldsymbol x), using features X^{(1)}, \dots, X^{(p)}, and a clean validation strategy. It describes the multivariate association between Y and the covariates.
1. Study model performance.
2. Study variable importance and use them to sort the results of Step 2. Which of the associations are very strong/weak?
3. For each feature, plot its main effect (e.g. partial or SHAP dependence plot). Use it to complement the marginal effects from Step 2: Are they quite similar or not? Ideally, search for strong interaction effects as well.

8.1 Example: Diamonds

Let’s apply above scheme for diamond prices.

library(ggplot2)
library(ranger)
library(hstats)
library(withr)

# STEP 1
mean(diamonds$price)

## [1] 3932.8

# STEP 2
xvars <- c("carat", "color", "cut", "clarity")
for (v in xvars) {
  p <- ggplot(diamonds, aes(.data[[v]], price)) +
    stat_summary(fun = "mean", color = "chartreuse4") +
    if (!ggplot2:::is.discrete(diamonds[[v]])) scale_x_binned()
  print(p)
}

# STEP 3
with_seed(
  9838, 
  ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)

fit <- ranger(
  reformulate(xvars, "price"), 
  num.trees = 500,
  data = diamonds[ix, ], 
  seed = 83
)
fit

## Ranger result
## 
## Call:
##  ranger(reformulate(xvars, "price"), num.trees = 500, data = diamonds[ix,      ], seed = 83) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      43152 
## Number of independent variables:  4 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       308191.3 
## R squared (OOB):                  0.9804707

# Importance wrt test MSE
imp <- perm_importance(
  fit, X = diamonds[-ix, xvars], y = diamonds$price[-ix], verbose = FALSE
)
plot(imp, fill = "chartreuse4")

# Add PDP to descriptive effect
for (v in xvars) {
  p <- partial_dep(fit, v = v, X = diamonds[ix, ]) |> 
    plot(color = "chartreuse4") +
    ggtitle(paste("PDP for", v))
  print(p)
}

Selected insights

Step 1: The average price is 3932.8 USD.
Step 2: The average price heavily depends on carat and the other factors. The associations from Step 2 are not intuitive, except for carat.
Step 3: The most important feature is carat, followed by clarity, color and cut. Conditional effects complement the marginal descriptive effects. They also explain some of the unintuitive descriptive effects. (Larger diamonds tend to be of reduced quality, introducing confounding.)

9 Exercises

Explain why a neural network is conceptually closer to a GLM than to a random forest. Why are activation functions important?
Fit a neural network to (non-logarithmic) diamond prices by minimizing Gamma deviance with log-link (-> exponential output activation). Use the custom loss function defined below. Tune the model by simple validation and evaluate it on a test dataset. Study test performance, permutation importance, and partial dependence plots. Hints: Play with activation functions. Furthermore, the response needs to be transformed from int to float for certain TensorFlow versions.
```
loss_gamma <- function(y_true, y_pred) {
  -k_log(y_true / y_pred) + y_true / y_pred
}
```
(Optional) Craft a neural net for the “dataCar” dataset. The binary response is “clm”, and the features are “veh_value”, “veh_age”, “gender”, “area”, “agecat”, and “veh_body”. Represent the latter with a one-dimensional embedding layer. You can either write the code from scratch or modify above taxi trip code. Work with a clean train/test split. Study test performance, permutation importance, and partial dependence plots.
Choose a data set of your interest and apply “Analysis Scheme X” to it.

10 Summary

In this chapter, we have glimpsed into the world of neural networks and deep learning. Step by step we have learned how a neural network works. We have used Keras and TensorFlow to build models brick by brick. Furthermore, we met a simple analysis scheme involving ML that can be applied to many analysis situations.

Here a short comparison of some properties of the ML algorithms learned in this lecture (screenshot as per Sept. 7, 2020):

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Chollet, François. 2018. Deep Learning with R. Shelter Island, New York: Manning Publications.

Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 2018. “All Models Are Wrong, but Many Are Useful: Learning a Variable’s Importance by Studying an Entire Class of Prediction Models Simultaneously.” https://arxiv.org/abs/1801.01489.

Werbos, P. J. 1974. “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences.” PhD thesis, Harvard University.

6. Neural Nets

Statistical Computing

Michael Mayer

2024-02-16