In this chapter, we dive into artificial neural networks, one of the main drivers of artificial intelligence.
Neural networks are around since many decades. (Maybe) the first such model was built by Marvin Minsky in 1951. He called his algorithm SNARC (“stochastic neural-analog reinforcement calculator”). Since then, neural networks have gone through several stages of development. One of the milestones was the idea of Paul J. Werbos in 1974 [1] to efficiently calculate gradients in the optimization algorithm by an approach called “backpropagation”. Another milestone was the use of GPUs (graphics processing units) to greatly reduce calculation time.
Artificial neural nets are extremely versatile and powerful. They can be used to
In this chapter, we will mainly deal with the first three aspects. Since a lot of new terms are being used, a small glossary can be found in Section “Neural Network Slang”.
To learn how and why neural networks work, we will go through three steps - each illustrated on the diamonds data:
After this, we will be ready to build more complex models.
Let us revisit the simple linear regression E(\text{price}) = \alpha + \beta \cdot \text{carat} calculated on the full diamonds data. In Chapter 1 we have found the solution \hat\alpha = -2256.36 and \hat \beta = 7756.43 by ordinary least-squares.
Above situation can be viewed as a neural network with
carat
and the intercept
called “bias unit” with value 1),price
). Fully connected means that each node of a layer is
a linear function of all node values of the previous layer. Each linear
function has parameters or weights to be estimated, in our
simple case just \alpha and \beta.Visualized as a graph, the situation looks as follows.
Part of the figures were done with this cool webtool.
To gain confidence in neural nets, we first show that parameters estimated by a neural network are quite similar to the ones learned by linear least-squares. To do so, we will use Google’s TensorFlow with its convenient (functional) Keras interface.
library(ggplot2)
library(keras)
# Input layer: we have 1 covariate
input <- layer_input(shape = 1)
# Output layer connected to the input layer
output <- input |>
layer_dense(units = 1)
# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
# summary(nn)
nn |>
compile(
optimizer = optimizer_adam(learning_rate = 1),
loss = "mse",
metrics = metric_root_mean_squared_error()
)
# Fit model - naive without validation
history <- nn |>
fit(
x = diamonds$carat,
y = diamonds$price,
epochs = 30,
batch_size = 100
)
## Epoch 1/30
## 540/540 - 1s - loss: 27114768.0000 - root_mean_squared_error: 5207.1841 - 906ms/epoch - 2ms/step
## Epoch 2/30
## 540/540 - 1s - loss: 20254536.0000 - root_mean_squared_error: 4500.5039 - 687ms/epoch - 1ms/step
## Epoch 3/30
## 540/540 - 1s - loss: 15506468.0000 - root_mean_squared_error: 3937.8252 - 774ms/epoch - 1ms/step
## Epoch 4/30
## 540/540 - 1s - loss: 12346761.0000 - root_mean_squared_error: 3513.7959 - 767ms/epoch - 1ms/step
## Epoch 5/30
## 540/540 - 1s - loss: 10351682.0000 - root_mean_squared_error: 3217.4031 - 600ms/epoch - 1ms/step
## Epoch 6/30
## 540/540 - 1s - loss: 9132777.0000 - root_mean_squared_error: 3022.0486 - 699ms/epoch - 1ms/step
## Epoch 7/30
## 540/540 - 1s - loss: 8334773.5000 - root_mean_squared_error: 2887.0007 - 627ms/epoch - 1ms/step
## Epoch 8/30
## 540/540 - 1s - loss: 7698570.5000 - root_mean_squared_error: 2774.6299 - 642ms/epoch - 1ms/step
## Epoch 9/30
## 540/540 - 1s - loss: 7085740.5000 - root_mean_squared_error: 2661.9055 - 541ms/epoch - 1ms/step
## Epoch 10/30
## 540/540 - 1s - loss: 6464287.0000 - root_mean_squared_error: 2542.4961 - 785ms/epoch - 1ms/step
## Epoch 11/30
## 540/540 - 1s - loss: 5850314.5000 - root_mean_squared_error: 2418.7424 - 778ms/epoch - 1ms/step
## Epoch 12/30
## 540/540 - 1s - loss: 5263002.0000 - root_mean_squared_error: 2294.1233 - 726ms/epoch - 1ms/step
## Epoch 13/30
## 540/540 - 1s - loss: 4720861.5000 - root_mean_squared_error: 2172.7544 - 753ms/epoch - 1ms/step
## Epoch 14/30
## 540/540 - 1s - loss: 4235432.0000 - root_mean_squared_error: 2058.0166 - 740ms/epoch - 1ms/step
## Epoch 15/30
## 540/540 - 1s - loss: 3812718.0000 - root_mean_squared_error: 1952.6183 - 738ms/epoch - 1ms/step
## Epoch 16/30
## 540/540 - 1s - loss: 3457425.5000 - root_mean_squared_error: 1859.4154 - 773ms/epoch - 1ms/step
## Epoch 17/30
## 540/540 - 1s - loss: 3167502.7500 - root_mean_squared_error: 1779.7479 - 786ms/epoch - 1ms/step
## Epoch 18/30
## 540/540 - 1s - loss: 2937122.7500 - root_mean_squared_error: 1713.8036 - 723ms/epoch - 1ms/step
## Epoch 19/30
## 540/540 - 1s - loss: 2761225.2500 - root_mean_squared_error: 1661.6935 - 701ms/epoch - 1ms/step
## Epoch 20/30
## 540/540 - 1s - loss: 2633236.2500 - root_mean_squared_error: 1622.7250 - 565ms/epoch - 1ms/step
## Epoch 21/30
## 540/540 - 1s - loss: 2544190.2500 - root_mean_squared_error: 1595.0518 - 543ms/epoch - 1ms/step
## Epoch 22/30
## 540/540 - 1s - loss: 2485685.5000 - root_mean_squared_error: 1576.6057 - 529ms/epoch - 980us/step
## Epoch 23/30
## 540/540 - 0s - loss: 2448361.2500 - root_mean_squared_error: 1564.7240 - 489ms/epoch - 906us/step
## Epoch 24/30
## 540/540 - 1s - loss: 2425859.5000 - root_mean_squared_error: 1557.5172 - 504ms/epoch - 932us/step
## Epoch 25/30
## 540/540 - 1s - loss: 2413218.2500 - root_mean_squared_error: 1553.4536 - 652ms/epoch - 1ms/step
## Epoch 26/30
## 540/540 - 1s - loss: 2405978.7500 - root_mean_squared_error: 1551.1218 - 535ms/epoch - 991us/step
## Epoch 27/30
## 540/540 - 1s - loss: 2402259.0000 - root_mean_squared_error: 1549.9222 - 508ms/epoch - 941us/step
## Epoch 28/30
## 540/540 - 1s - loss: 2400198.5000 - root_mean_squared_error: 1549.2574 - 576ms/epoch - 1ms/step
## Epoch 29/30
## 540/540 - 1s - loss: 2399197.2500 - root_mean_squared_error: 1548.9342 - 548ms/epoch - 1ms/step
## Epoch 30/30
## 540/540 - 1s - loss: 2398820.5000 - root_mean_squared_error: 1548.8126 - 527ms/epoch - 975us/step
plot(history, metrics = "root_mean_squared_error")
unlist(get_weights(nn))
## [1] 7715.458 -2223.249
# Plot effect of carat on average price
data.frame(carat = seq(0.3, 3, by = 0.1)) |>
transform(price = predict(nn, carat, verbose = 0)) |>
ggplot(aes(x = carat, y = price)) +
geom_line(color = "chartreuse4") +
geom_point(color = "chartreuse4")
Comment: The solution of the simple neural network is indeed quite similar to the OLS solution.
Neural nets are typically fitted by mini-batch gradient descent, using backpropagation to efficiently calculate gradients. It works as follows:
Gradient descent on batches of size 1 is called “stochastic gradient descent” (SGD).
The missing magic component is the so called activation function \sigma after each layer, which transforms the values of the nodes. So far, we have implicitly used “linear activations”, which - in neural network slang - is just the identity function.
Applying non-linear activation functions after hidden layers have the purpose to introduce non-linear and interaction effects. Typical such functions are
Activation functions applied to the output layer have a different purpose, namely the same as the inverse of the link function of a corresponding GLM. It maps predictions to the scale of the response:
Let us add a hyperbolic tangent activation function (\sigma) after the hidden layer of our simple example.
Again, the code is very similar to the last one, with the exception of using a hyperbolic tangent activation after the hidden layer (and different learning rate and number of epochs).
library(ggplot2)
library(keras)
# Input layer: we have 1 covariate
input <- layer_input(shape = 1)
# One hidden layer
output <- input |>
layer_dense(units = 5, activation = "tanh") |>
layer_dense(units = 1, activation = "linear")
# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
nn |>
compile(
optimizer = optimizer_adam(learning_rate = 0.2),
loss = "mse",
metrics = metric_root_mean_squared_error()
)
# Fit model - naive without validation
nn |>
fit(
x = diamonds$carat,
y = diamonds$price,
epochs = 50,
batch_size = 100,
verbose = 0
)
# Plot effect of carat on average price
data.frame(carat = seq(0.3, 3, by = 0.1)) |>
transform(price = predict(nn, carat, verbose = 0)) |>
ggplot(aes(x = carat, y = price)) +
geom_line(color = "chartreuse4") +
geom_point(color = "chartreuse4")
Comment: Adding the non-linear activation after the hidden layer has changed the model. The effect of carat is now representing the association between carat and price by a non-linear function.
So far, we have naively fitted the neural networks without splitting the data for test and validation. Don’t do this! Usually, one sets a small test dataset (e.g. 10% of rows) aside to assess the final model performance and use simple (or cross-)validation for model tuning.
In order to choose the main tuning parameters, namely
one often uses simple validation because cross-validation takes too much time.
A neural net does not accept missing values in the input. They need to be filled, e.g., by a typical value or a value below the minimum.
Gradient descent starts by random initialization of parameters. This step is optimized for standardized input. Standardization has to be done manually by either
Note that the scaling transformation is calculated on the training data and then applied to the validation and test data. This usually requires a couple of lines of code.
There are three common ways to represent categorical input variables in a neural network.
For Option 2, input standardization is not required, for Option 3 it must not be applied as the embedding layer expects integers.
Sometimes, we want to take actions during training, such as
Such monitoring tasks are called callbacks. We will see them in the example below.
So far, we have encountered only dense (= fully connected) layers and activation layers. Here some further types:
Pure gradient descent is rarely applied without tweaks because it tends to be stuck in local minima, especially for complex networks with non-convex objective surfaces. Modern variants are “adam”, “nadam” and “RMSProp”. These optimizers work usually out-of-the-box, except for the learning rate, which has to be manually chosen.
Frameworks like Keras/TensorFlow offer many predefined loss functions and evaluation metrics. Choosing them is a crucial step, just as with tree boosting. Using TensorFlow’s backend functions, one can define own metrics and loss functions (see exercises).
As with linear models, a model with too many parameters will overfit in an undesired way. With about 50 to 100 observations per parameter, overfitting is usually unproblematic. (For image and text data, different rules apply). Besides using less parameters, the main options to reduce overfitting are the following:
How many layers and number of nodes per layer to select? For tabular data, using 1-3 hidden layers is usually enough. If we start with m input variables, the number of nodes in the first hidden layer is usually higher than m and reduces for later layers. There should not be a “representational bottleneck”, i.e., an early hidden layer with too few parameters.
The number of parameters should not be too high compared to the number of rows, see “Overfitting and regularization” above.
Variable importance of covariates in neural networks can be assessed by permutation importance (how much performance is lost when shuffling column X?) or SHAP importance. Covariate effects can be investigated, e.g., by partial dependence plots or SHAP dependence plots.
We will now fit a neural net with two hidden layers (30 and 15 nodes) and a total of 631 parameters to model diamond prices. Learning rate, activation functions, and batch size were manually chosen by simple validation. The number of epochs is automatically being chosen by an early stopping callback.
library(ggplot2)
library(withr)
library(keras)
library(MetricsWeighted)
library(hstats)
y <- "price"
xvars <- c("carat", "color", "cut", "clarity")
# Split into train and test
with_seed(
9838,
ix <- sample(nrow(diamonds), 0.8 * nrow(diamonds))
)
train <- diamonds[ix, ]
test <- diamonds[-ix, ]
X_train <- train[, xvars]
X_test <- test[, xvars]
y_train <- train[[y]]
y_test <- test[[y]]
# Standardize X using X_train
temp <- scale(data.matrix(X_train))
sc <- list(
center = attr(temp, "scaled:center"),
scale = attr(temp, "scaled:scale")
)
# Function that maps data to scaled network input
prep_nn <- function(X, sel = xvars, scaling = sc) {
X <- data.matrix(X[, sel, drop = FALSE])
scale(X, center = scaling$center, scale = scaling$scale)
}
# Trying to make things reproducible...
k_clear_session()
tensorflow::set_random_seed(499)
# Input layer: we have 4 covariates
input <- layer_input(shape = 4)
# Two hidden layers with contracting number of nodes
output <- input |>
layer_dense(units = 30, activation = "relu") |>
layer_dense(units = 15, activation = "relu") |>
layer_dense(units = 1, activation = "linear")
# Create and compile model
nn <- keras_model(inputs = input, outputs = output)
summary(nn)
## Model: "model"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## input_1 (InputLayer) [(None, 4)] 0
## dense_2 (Dense) (None, 30) 150
## dense_1 (Dense) (None, 15) 465
## dense (Dense) (None, 1) 16
## ================================================================================
## Total params: 631 (2.46 KB)
## Trainable params: 631 (2.46 KB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________
nn |>
compile(
optimizer = optimizer_adam(learning_rate = 0.5),
loss = "mse",
metrics = metric_root_mean_squared_error()
)
# Callbacks
cb <- list(
callback_early_stopping(patience = 20),
callback_reduce_lr_on_plateau(patience = 5)
)
# Fit model
history <- nn |>
fit(
x = prep_nn(X_train),
y = y_train,
epochs = 100,
batch_size = 500,
validation_split = 0.2,
callbacks = cb,
verbose = 0
)
plot(history, metrics = "root_mean_squared_error", smooth = FALSE)
# Interpret
pred_fun <- function(m, X) predict(m, prep_nn(X), batch_size = 1000, verbose = 0)
# Performance on test data
pred <- pred_fun(nn, X_test)
rmse(y_test, pred)
## [1] 596.2767
r_squared(y_test, pred, reference_mean = mean(y_train))
## [1] 0.9783945
# Permutation importance wrt test data
perm_importance(nn, X = X_test, y = y_test, pred_fun = pred_fun, verbose = FALSE) |>
plot(fill = "chartreuse4")
# PDPs
for (v in xvars) {
p <- partial_dep(nn, v = v, X = X_train, pred_fun = pred_fun) |>
plot(color = "chartreuse4") +
ggtitle(paste("PDP for", v))
print(p)
}
Comment: Performance is lower than of the tree-based models. This might partly be a consequence of effects being smoother, but also because the model has not been refitted on the full training data for simplicity (20% of the training rows are used for validation).
Representing categorical input variables through embedding layers is extremely useful in practice. We will end this chapter with an example on how to do it with the claims data. This example also shows how flexible neural network structures are.
library(tidyverse)
library(keras)
library(withr)
library(insuranceData)
library(hstats)
library(MetricsWeighted)
data(dataCar)
# Response and covariates
y <- "clm"
x_emb <- "veh_body"
x_dense <- c("veh_value", "veh_age", "gender", "area", "agecat")
xvars <- c(x_dense, x_emb)
# Split into train and test
with_seed(
9838,
ix <- sample(nrow(dataCar), 0.8 * nrow(dataCar))
)
train <- dataCar[ix, ]
test <- dataCar[-ix, ]
y_train <- train[[y]]
y_test <- test[[y]]
X_train <- train[, xvars]
X_test <- test[, xvars]
# Standardization info using X_train
temp <- scale(data.matrix(X_train[, x_dense]))
sc <- list(
center = attr(temp, "scaled:center"),
scale = attr(temp, "scaled:scale")
)
# Function that maps data.frame to scaled network input (a list with a dense
# part and each embedding as separat integer vector)
prep_nn <- function(X, dense = x_dense, emb = x_emb, scaling = sc) {
X_dense <- data.matrix(X[, dense, drop = FALSE])
X_dense <- scale(X_dense, center = scaling$center, scale = scaling$scale)
emb <- lapply(X[emb], function(x) as.integer(x) - 1)
c(list(dense1 = X_dense), emb)
}
# Trying to make things reproducible...
k_clear_session()
tensorflow::set_random_seed(469)
# Inputs
input_dense <- layer_input(shape = length(x_dense), name = "dense1")
input_veh_body <- layer_input(shape = 1, name = "veh_body")
# Embedding of veh_body
emb_veh_body <- input_veh_body |>
layer_embedding(input_dim = nlevels(dataCar$veh_body) + 1, output_dim = 1) |>
layer_flatten()
# Combine dense input and embedding
outputs <- list(input_dense, emb_veh_body) |>
layer_concatenate() |>
layer_dense(30, activation = "relu") |>
layer_dense(5, activation = "relu") |>
layer_dense(1, activation = "sigmoid")
# Input
inputs <- list(dense1 = input_dense, veh_body = input_veh_body)
# Create and compile model
nn <- keras_model(inputs = inputs, outputs = outputs)
summary(nn)
## Model: "model"
## ________________________________________________________________________________
## Layer (type) Output Shape Param Connected to
## #
## ================================================================================
## veh_body (InputLayer) [(None, 1)] 0 []
## embedding (Embedding) (None, 1, 1) 14 ['veh_body[0][0]']
## dense1 (InputLayer) [(None, 5)] 0 []
## flatten (Flatten) (None, 1) 0 ['embedding[0][0]']
## concatenate (Concaten (None, 6) 0 ['dense1[0][0]',
## ate) 'flatten[0][0]']
## dense_2 (Dense) (None, 30) 210 ['concatenate[0][0]']
## dense_1 (Dense) (None, 5) 155 ['dense_2[0][0]']
## dense (Dense) (None, 1) 6 ['dense_1[0][0]']
## ================================================================================
## Total params: 385 (1.50 KB)
## Trainable params: 385 (1.50 KB)
## Non-trainable params: 0 (0.00 Byte)
## ________________________________________________________________________________
nn |>
compile(
optimizer = optimizer_adam(learning_rate = 0.001),
loss = "binary_crossentropy"
)
# Callbacks
cb <- list(
callback_early_stopping(patience = 20),
callback_reduce_lr_on_plateau(patience = 5)
)
# Fit model
history <- nn |>
fit(
x = prep_nn(X_train),
y = y_train,
epochs = 100,
batch_size = 400,
validation_split = 0.2,
callbacks = cb,
verbose = 0
)
plot(history, metrics = c("loss", "val_loss"), smooth = FALSE)
# Interpret
pred_fun <- function(m, X) predict(m, prep_nn(X), batch_size = 1000, verbose = FALSE)
# Performance on test data
pred <- pred_fun(nn, X_test)
logLoss(y_test, pred)
## [1] 0.2533789
r_squared_bernoulli(y_test, pred, reference_mean = mean(y_train))
## [1] 0.002686216
# Permutation importance
perm_importance(
nn, X = X_test, y = y_test, loss = "logloss", pred_fun = pred_fun, verbose = FALSE) |>
plot(fill = "chartreuse4")
# Partial dependence
for (v in xvars) {
p <- partial_dep(nn, v = v, X = X_train, pred_fun = pred_fun) |>
plot(color = "chartreuse4", rotate_x = (v == "veh_body")) +
ggtitle(paste("PDP for", v))
print(p)
}
loss_gamma <- function(y_true, y_pred) {
-k_log(y_true / y_pred) + y_true / y_pred
}
veh_body
by integers.Here, we summarize some of the neural network slang.
In this chapter, we have glimpsed into the world of neural networks. Step by step we have learned how a neural network works. We have used Keras and TensorFlow to build models brick by brick.
During this lecture, we have met many ML algorithms and principles. To get used to them, the best approach is practicing. Kaggle is a great place to do so and learn from the best.
A summary and comparison of the algorithms can be found on github.
Here a screenshot as per Sept. 7, 2020:
As a final task and motivation, try out my Analysis scheme X. It can be applied very frequently and works as follows:
What additional insights did you get from Step 3?
[1] P.J. Werbos, “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences”, Dissertation, 1974.