The aim of the first chapter is to refresh your R skills. It is split into two sections: “Data Analysis” and “Writing Functions”.
You can download R from The Comprehensive R Archive Network CRAN.
In the first section, our focus is on data preparation and descriptive analysis, an essential part of every (good) analysis. Base R has hundreds of functions to help you with these aspects. We take additional support from the following contributed extension packages:
Package | Task | Initial CRAN release | Creator |
magrittr | pipe %>% |
2014 | Stefan Milton Bache |
dplyr | data preprocessing | 2014 | Hadley Wickham |
ggplot2 | beautiful plots | 2007 | Hadley Wickham |
rmarkdown | dynamic reports | 2014 | Yihui Xie |
Let’s start with a small selection of helpful functions in base R for data preparation and descriptive analysis:
: Select rows and columns of data frametransform()
: Add or overwrite columns in data
, tapply()
, by()
: Grouped calculationsave()
: Grouped transformationsrbind()
, cbind()
: Bind rows/columns of
data frame/matrixmerge()
: Join data frames by keyexpand.grid()
: Cross-join lists/data frameshead()
, tail()
: First/last few elements of
, ncol()
, dim()
: Number
of rows/columns of data frame/matrixorder()
, rank()
: Sort indices, ranksrowSums()
, rowMeans()
: Row-wise sums/means
of data frame/matrixcolSums()
, colMeans()
: Column-wise
sums/means data frame/matrixcumsum()
, cummean()
: Cumulative sums and
means of vectorreshape()
: Transposition/Reshaping of data frame
(tricky interface)lapply()
: Apply function element-wise, e.g., per column
of data framestr()
: Structure of object, e.g., of a data framesummary()
: Summarizes object, e.g., each column in a
data framemean()
, median()
, sd()
, min()
, max()
Univariate statisticstable()
, prop.table()
: Absolute and relative countscor()
, cov()
: Bivariate statisticshist()
, barplot()
, boxplot()
: Some plot functionsTo see some of these functions in action, we will peak into the diamonds data that is part of the “ggplot2” package. We are mainly interested in the column “price” and the four “C”-variables: Carat, Color, Cut, and Clarity. Each observation/row represents a diamond.
## [1] 53940
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(subset(diamonds, select = c(price, carat, color, clarity, cut)))
## price carat color clarity cut
## Min. : 326 Min. :0.2000 D: 6775 SI1 :13065 Fair : 1610
## 1st Qu.: 950 1st Qu.:0.4000 E: 9797 VS2 :12258 Good : 4906
## Median : 2401 Median :0.7000 F: 9542 SI2 : 9194 Very Good:12082
## Mean : 3933 Mean :0.7979 G:11292 VS1 : 8171 Premium :13791
## 3rd Qu.: 5324 3rd Qu.:1.0400 H: 8304 VVS2 : 5066 Ideal :21551
## Max. :18823 Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
# Univariate plots
hist(diamonds$price, breaks = "FD", col = "chartreuse4")
hist(diamonds$carat, breaks = "FD", col = "chartreuse4")
for (x in c("color", "clarity", "cut")) {
barplot(table(diamonds[[x]]), main = x, col = "chartreuse4")
# Bivariate plots
plot(price ~ carat, data = diamonds, col = "chartreuse4", pch = ".", xlim = c(0, 3))
price ~ carat,
data = diamonds,
col = "chartreuse4",
pch = ".",
log = "xy",
main = "log-log scale"
for (x in c("color", "clarity", "cut")) {
reformulate(x, "price"),
varwidth = TRUE,
data = diamonds,
main = paste("price by levels of", x),
col = "chartreuse4"
, cut
, and
are rare.One of the most downloaded contributed extension packages of all
times is “magrittr”. It
provides the forward pipe operator %>%
. The pipe puts
the object in front of it as the first argument in the function after
it. Thus, X %>% f(...)
is the same as
f(X, ...)
. In this way, the pipe helps to turn a nested
function call into a sequence of simple calls.
. The piped object can be referred to by
# Same as head(diamonds, 2)
diamonds %>%
# Not too spectacular. But what about these expressions?
diamonds %>%
subset(select = c(price, carat, color, clarity, cut)) %>%
## price carat color clarity cut
## Min. : 326 Min. :0.2000 D: 6775 SI1 :13065 Fair : 1610
## 1st Qu.: 950 1st Qu.:0.4000 E: 9797 VS2 :12258 Good : 4906
## Median : 2401 Median :0.7000 F: 9542 SI2 : 9194 Very Good:12082
## Mean : 3933 Mean :0.7979 G:11292 VS1 : 8171 Premium :13791
## 3rd Qu.: 5324 3rd Qu.:1.0400 H: 8304 VVS2 : 5066 Ideal :21551
## Max. :18823 Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
diamonds$color %>%
table() %>%
prop.table() %>%
barplot(col = "chartreuse4")
# The alternative to the last expression would be a nested construction like this:
# barplot(prop.table(table(diamonds$color)), col = "chartreuse4")
# Or a repeated assignment:
# x <- table(diamonds$color)
# x <- prop.table(x)
# barplot(x, col = "chartreuse4")
# Voila the new base R pipe:
iris |> head(2)
# Beautifully implemented:
quote(iris |> head(2))
## head(iris, 2)
The pipe shines when it comes to data preprocessing, as this often consists of several steps. We will see this in the following.
An other helpful R package is “dplyr” (Grammar of Data Manipulation). It provides a rich set of data preprocessing functions. Here is a selection:
: Select or drop columnsfilter()
: Select rows by
: Sort by one or more
: Create or overwrite
: Summary statisticsrename()
: Rename columnstransmute()
: Transform columns and selects only
, ungroup()
: Groups the rows by
levels of one or more columns. Plays well with other functions like
or mutate()
, bind_cols()
: Bind data frames
by row/columnleft_join()
, inner_join()
: Join by keyslice()
: Select rows by positionpivot_wider()
, pivot_longer()
Reshape/transpose (in “tidyr”, not “dplyr”)These “verbs” always take a data frame as their first argument and return a data frame, which makes it easy to work with the pipe.
Together with “magrittr”, “tidyr” (for restructuring data), “ggplot2”
(beautiful plots), and some other packages, “dplyr” is part of the tidyverse. Its packages can be
loaded with the command library(tidyverse)
. See Wickham and Grolemund (2017) for a great reference on the
Let’s have some “dplyr” fun with diamonds.
# Price and carat of the two most expensive diamonds
diamonds |>
arrange(-price) |>
select(price, carat) |>
# Select diamonds >2 carat and calculate log(price) and log(carat)
# Possible simplification: transmute() = mutate() + select()
diamonds |>
filter(carat > 2) |>
log_price = log(price),
log_carat = log(carat)
) |>
select(log_price, log_carat) |>
# Median carat and price
diamonds |>
med_carat = median(carat),
med_price = median(price)
# Same grouped by clarity
med <- diamonds |>
group_by(clarity) |>
med_carat = median(carat),
med_price = median(price)
# Join medians to original data by using clarity as key
# We only use price and the four "C" variables
diamonds |>
select(price, starts_with("c")) |>
left_join(med, by = "clarity") |>
# Directly with a grouped mutate (imagine this without pipe...)
diamonds |>
select(price, starts_with("c")) |>
group_by(clarity) |>
med_carat = median(carat),
med_price = median(price)
) |>
ungroup() |>
# Turn all ordered factors into unordered
dia <- diamonds |>
mutate_if(is.ordered, factor, ordered = FALSE)
# Stack price and carat using the function pivot_longer() in "tidyr"
# -> will need this later
diamonds_long <- diamonds |>
select(price, carat) |>
head(diamonds_long, 4)
The following table helps to translate between base R and “dplyr”. It also includes other technologies that we will meet later: “data.table” for fast data processing, and the data query language SQL.
Task | base R | dplyr | data.table | SQL |
Pick columns | subset /[cols] |
select |
X[, cols] |
Transform columns | transform /<- |
mutate |
X[, z := ...] |
Rename a column | depends… | rename |
setnames |
AS |
Filter on condition | subset /[cond,] |
filter |
X[cond] |
Bind rows | rbind |
bind_rows |
rbindlist /rbind |
Bind columns | cbind |
bind_cols |
cbind |
Join on row id |
Left join | merge(, all.x=T) |
left_join |
merge /[] |
Inner join | merge |
inner_join |
merge /[] |
Grouped stats | aggregate |
group_by +summarize |
[, ..., by = ] |
Grouped trafo | ave |
group_by +mutate |
[, := , by = ] |
Reshape wide to long | reshape |
pivot_longer ** |
melt |
Reshape long to wide | reshape |
pivot_wider ** |
dcast |
Sort by z | X[order(), ] |
arrange |
X[order()] /setorder |
Top m rows | head |
slice |
X[1:m] |
“ggplot2” (Grammar or
Graphics) is THE package for drawing beautiful figures. The main
difference to the standard plotting functions in R is the use of
to modify the plot layer per layer. This logic was
suggested in the “grammar of graphics” (Wilkinson 2005) and then implemented
by Hadley Wickham in “ggplot2”.
to concatenate the layers.We will introduce “ggplot2” using examples. For more information, see again Wickham and Grolemund (2017).
Let’s use “ggplot2” to plot diamonds data.
# The minimum: a data set, an aesthetic mapping, and a geometry
# -> "make a bar plot, using color on the x axis"
ggplot(data = diamonds, mapping = aes(x = color)) +
geom_bar(fill = "chartreuse4")
# A histogram -> store as object "p" to modify it later
p <- ggplot(diamonds, aes(x = price)) +
geom_histogram(fill = "chartreuse4", bins = 30)
# "Add" title
p + ggtitle("Histogram of price")
# Boxplot of price per color
p <- ggplot(diamonds, mapping = aes(x = color, y = price)) +
geom_boxplot(fill = "chartreuse4", varwidth = TRUE)
# Same but with larger text. Other settings can be changed by + theme()
p + theme_gray(base_size = 15)
# Attention: + ylim() would clip the data *before* calculating statistics
# It is usually better to clip the coordinate system:
p + coord_cartesian(ylim = c(0, 8000))
# Scatterplot of price against carat (on log-log scale)
p <- ggplot(diamonds, mapping = aes(x = carat, y = price)) +
geom_point(color = "chartreuse4", alpha = 0.2, shape = ".") +
scale_x_log10() +
scale_y_log10() +
ggtitle("Scatterplot on log-log scale")
# A ggplot can use any number of geometries (even with different data sets each)
# Here, we add a scatterplot smoother (a type of regression)
p + geom_smooth()
One of the strengths of “ggplot2” are grouped plots using
and facet_grid()
. Creating a
scatterplot per color? Just add a facet
p + facet_wrap(~ color)
Combining faceting with data reshaping, we can draw histograms of multiple variables:
diamonds |>
select(price, carat) |>
pivot_longer(everything()) |>
ggplot(aes(value)) +
geom_histogram(fill = "chartreuse4", bins = 29) +
facet_wrap(~ name, scale = "free_x")
Another strength of “ggplot2” is that we can map columns not only to x and y coordinates, but also to aspects such as color, fill, alpha (transparency), shape, linetype and size:
ggplot(diamonds, mapping = aes(x = carat, y = price, color = clarity)) +
geom_point(alpha = 0.2) +
scale_x_log10() +
scale_y_log10() +
guides(colour = guide_legend(override.aes = list(alpha = 1))) +
theme(legend.position = "top")
Another fantastic plotting library is Plotly. It provides interactive plots. Plotly is written in JavaScript and is available in R through the “plotly” package. Unlike “ggplot2”, it is not based on the “graphics” package of base R.
The “plotly” package offers two ways to create the plots: We can
either work with the native Plotly syntax or use the function
to translate (most) “ggplot” objects. We will
show the latter:
# Mind the outer parentheses
(ggplot(diamonds, mapping = aes(color)) +
geom_bar(fill = "chartreuse4")) |>
These lecture notes are written with R Markdown. R Markdown combines Markdown text and R code and turns it into HTML, Word, or PDF. It is a wonderful tool to write reports.
Markdown is a simple markup language to format text. Unlike Latex, it can be easily read also in its raw form. Markdown is frequently used on technical ask/answer web forums or to write documentations (e.g. on Github). Markdown is not related to R.
Here some basic syntax elements, see for more. The corresponding HTML created with RStudio is shown afterwards.
# Markdown
## Headers
Headers start with one or more `#`. The more, the smaller the title.
## Text highlighting
This is an *italic* text, this one is **bold**, and this one is ***both***.
## Lists
- The items of a *bulleted list* are created with an `-`.
- Numbered lists are initialized with a number as below:
1. First item
2. Second item
1. Third item (the actual number does not matter...)
Lists can also be nested.
## Code
To format text as inline code, set the code between `backticks`. Use three backticks to format longer code snippets.
## Formulas
Use Latex syntax to write formulas like this $e^{i \pi} = -1$ or this:
e^{i \pi} = -1
They are rendered by [MathJax](
Screenshot of the rendered HTML file
In contrast to a Markdown file, R Markdown also contains R code chunks to be executed. Furthermore, it starts with a YAML header (Yet Another Markup Language) that specifies, among other things, the output format (HTML, PDF, Word).
A very simple R Markdown file looks like this:
title: "iris flowers"
output: html_document
## The data
The `iris` data contains information on `r nrow(iris)` flowers from three species. Here are the first three rows:
head(iris, 3)
## A plot
To suppress code in the resulting file, set `echo=FALSE` in the chunk options:
```{r, echo=FALSE}
plot(Sepal.Width ~ Sepal.Length, col = Species, data = iris)
Screenshot of the resulting HTML file
The basic workflow is as follows:
to create the HTML/Word/PDF. What
happens in the background?
searches the .Rmd file for code chunks,
runs them, and “knits” the results with your Markdown text to a
temporary Markdown file.For more information, see Xie, Allaire, and Grolemund (2018).
In this section, we review some aspects of writing our own functions.
We have already used different R functions, for instance,
and ggplot2::ggplot()
, or
. (The ::
indicates that
belongs to the “ggplot2” package.)
Writing our own functions helps to avoid code duplication and makes our code more readable. For many details and technical background, see Chapter 6 in Wickham (2015a).
The standard way to find the greatest common divisor (GCD) of two natural numbers is to multiply all common prime factors. For very large numbers, prime factorization becomes unfeasible. How to proceed in such a case? A general solution is to use the Euclidean algorithm, the oldest non-trivial algorithm that is still in use. It uses the fact that subtracting the smaller number from the larger number does not change their GCD. This leads to the following compact algorithm:
a <- 45
b <- 20
while (b > 0) {
temp <- b
b <- a %% b
a <- temp
a # Result is 5
# Again with other numbers
a <- 335544 * 98734
b <- 335544 * 98733
while (b > 0) {
temp <- b
b <- a %% b
a <- temp
a # Result is 335544
The algorithm seems to work! But we see two problems:
.Working with a function solves both issues:
# Greatest common divisor
# a, b: positive integers
gcd <- function(a, b) {
while (b > 0) {
temp <- b
b <- a %% b
a <- temp
# Example
a <- 45
b <- 12
gcd(a, b) # 3
a # unchanged 45
Similarly, the least common multiple of two natural numbers is their
product divided by the GCD. We can re-use gcd()
to create a
compact function lcm()
# Least common multiple
# a, b: positive integers
lcm <- function(a, b) {
div <- gcd(a = a, b = b)
return(a * b / div)
# Example
lcm(45, 18) # 90
Whether we write functions or any other code: mind your code style. A compact list of rules can be found in Google’s R Style Guide.
Some tips:
. However, spaces around =
are optional in
function arguments. No space before a comma, always after a
and stop()
, or
.# Strange placement of curly braces, spaces, using ; etc.
lcm_bad_style <- function(a,b)
div <-gcd(a= a, b =b); return(a*b/div)
# Comment on "why" you are doing something...
b <- a %% b # GCD(a, b) unaffected by modulo
# ... rather than stating the obvious
b <- a %% b # a modulo b
Our final version of the function gcd()
uses input
checks. Note: for very large numbers above 2^{50}, the function would start to return
garbage because of floating point issues.
# Greatest common divisor
# a, b: positive integers
gcd <- function(a, b) {
length(a) == 1,
length(b) == 1,
a >= 1,
b >= 1,
a == trunc(a),
b == trunc(b),
max(a, b) < 2^50
while (b > 0) {
temp <- b
b <- a %% b
a <- temp
The larger a project becomes, the more functions you will write.
Then, it often makes sense to move them to one or more separate R
scripts containing only those functions. In your main R script or R
Markdown report, after loading necessary packages, you can load them via
Depending on the situation, it could also make sense to combine your functions to an R package, especially if you want to share your code with others, see Wickham (2015b).
The R script “functions.R” contains the two functions
and lcm()
from above. We can load and
use them like this:
# A Fantastic Analysis
# Loads gcd() and lcm()
# Least common multiple
lcm(3030, 5050) # 15150
# More code...
Being able to use unquoted variable names is one of the reasons why “dplyr” or “ggplot2” code looks so smooth:
select(diamonds, price, color)
, not:
select(diamonds, c("price", "color"))
ggplot(diamonds, aes(x = color))
ggplot(diamonds, aes(x = "color"))
facet_grid(~ color)
, not:
However, this data-masking makes it tricky to use such
functions in our own functions. If the object v
contains a
variable name like "color"
to be shown on the x-axis of a
bar plot, calling ggplot(diamonds, aes(v))
would produce an
error because there is no column “v” in diamonds
Similarly, writing facet_grid(~ v)
select(diamonds, v)
would fail. There are different
solutions to this problem, see Chapter 20 “Evaluation” in Wickham (2015a)
for details.
Solutions include:
select(all_of(c("x1", "x2")))
instead of
select(x1, x2)
instead of
aes_string(x = "x1")
instead of
aes(x = x1)
(might be deprecated soon)aes(x = .data[["x1"]])
instead of
aes(x = x1)
reformulate(c("x1", "x2"))
instead of
~ x1 + x2
reformulate(c("x1", "x2"), "y")
instead of
y ~ x1 + x2
Here, c("x1", "x2")
or "x1"
can be replaced
by objects v = c("x1", "x2")
or v = "x1"
Let’s create a function that plots a bar plot for diamonds data for any of the discrete “C” variables:
bar_plot <- function(xvar) {
stopifnot(xvar %in% c("color", "clarity", "cut"))
ggplot(data = diamonds, mapping = aes(x = .data[[xvar]])) +
geom_bar(fill = "chartreuse4") +
ggtitle(sprintf("Bar plot of '%s'", xvar))
As a second example, we write a function to show mean diamond prices per factor level, along with standard deviations.
error_bars <- function(xvar) {
stopifnot(xvar %in% c("color", "clarity", "cut"))
dat <- diamonds |>
group_by(across(all_of(xvar))) |>
summarize(mean_price = mean(price), sd_price = sd(price))
ggplot(dat, aes(x = .data[[xvar]], y = mean_price)) +
aes(ymin = mean_price - sd_price, ymax = mean_price + sd_price),
color = "chartreuse4"
) +
geom_point(size = 3) +
title = sprintf("Distribution of price by '%s'", xvar),
y = "Price"
R contains generic functions like plot()
, summary()
, and
. Depending on the object class they are
applied to, they do something different.
Let’s look at two of the many behaviors of plot()
# plot() applied to object of class "formula"
class(price ~ color)
## [1] "formula"
plot(price ~ color, data = diamonds)
# plot() applied to object of class "factor".
# Note: the first class "ordered" is skipped as it does not have its own plot method
## [1] "ordered" "factor"
This so-called S3 object oriented system works roughly like this: The
generic function calls UseMethod()
, which selects the
class-specific method of the form
, which is then being invoked. In above
situation, plot()
calls UseMethod()
, which
then calls plot.factor()
Following this logic, we can write our own print()
methods, for example. It is also possible to define new generic
functions, but we will not need that here.
Let’s create an object of type “student” with its own
# Function that creates an object of class "student"
student <- function(given_name, family_name) {
out <- list(
given_name = given_name,
family_name = family_name
class(out) <- "student"
me <- student("Michael", "Mayer")
me # same as print(me)
## $given_name
## [1] "Michael"
## $family_name
## [1] "Mayer"
## attr(,"class")
## [1] "student"
# Nothing special so far. We first need a print() method for the "student" class
print.student <- function(x, ...) {
cat("Hi, I'm", x$given_name, x$family_name)
print(me) # same as me
## Hi, I'm Michael Mayer
This construction is often used in R.
. Thus, methods(plot)
show all loaded plot methods.UseMethod()
search the first class that has a method for the given generic. If none
is found, the default method is called, e.g.,
is a placeholder for other function
arguments. It is usually just passed to another function.Load the data dataCar
from the package
“insuranceData”. It represents claim data on vehicle insurance policies
from 2004 to 2005. Some variables like “gender” describe the policy
holder, others like “veh_age” the vehicle, and some variables carry
information on claims, e.g. “numclaims”. Each row represents policy
information valid in a certain time window. Use the pipe, “dplyr”, and
“ggplot2” to solve the following tasks.
Create an R Markdown file that contains all answers to Exercise 1. Knit the report to HTML. Make sure that the resulting HTML looks neat and clean, so that you could hand it over to someone else.
The sieve of Eratosthenes is an ancient algorithm to get all
prime numbers up to any given limit n,
see Wikipedia.
Write a function sieve_of_eratosthenes(n)
that returns all
prime numbers up to n. Benchmark the
results for n = 10^5 with the package
“bench”. Mind your coding style!
In Exercise 1c, we have calculated and plotted the average number
of claims per level of “agecat” in the dataCar
that provides such
a visualization for any discrete variable v
to control whether the resulting plot is interactive or not.Extend the “student” class from Section “plot, print, summary” by
the optional information “semester”. It represents the number of
semesters the student is already registered. Add a
method that would neatly print the name and the
semester of the student.
In this first chapter, we have used R as a powerful tool for data preparation and descriptive analysis. Furthermore, we met important aspects of writing functions.