The focus of this chapter is on computer-intensive methods to calculate confidence intervals and to perform statistical tests. As a refresher, we will give a short introduction into statistical inference first.
This chapter also serves to apply the programming techniques learned in the last chapter.
Statistical inference is the art of making statements about an unknown population parameter \theta from data. Examples of \theta:
The three main tasks are the following:
Mathematical statistics provides the theory to perform these tasks for different parameters \theta, and under different circumstances.
To revisit the basics of classical statistical inference, we summarize some important results for the mean.
Let \mu = \mathbb E(F) denote the true mean of a distribution F with standard deviation \sigma = \sqrt{\text{Var}(F)} < \infty. Usually, F, \mu, and \sigma are all unknown. Furthermore, let \boldsymbol X = (X_1, \dots, X_n) be a sequence of independent random variables, each with distribution F.
What can we say about the sample mean \hat \mu = \frac{1}{n}\sum_{i = 1}^n X_i viewed as a random variable?
The standard deviation \sigma(\hat\mu) of the sample mean equals \sigma(\hat\mu) = \sqrt{\text{Var}(\hat\mu)} = \sqrt{\text{Var}\left(\frac{1}{n} \sum_{i = 1}^n X_i\right)} = \sqrt{\frac{1}{n^2} \sum_{i = 1}^n \text{Var}(X_i)} = \sqrt{\frac{1}{n} \text{Var}(X_i)} = \frac{\sigma}{\sqrt{n}}. Since \sigma = \sigma(F) = \sigma(X_i) is unknown in practice, it is usually replaced by the sample standard deviation \hat\sigma = \sqrt{\frac{1}{n-1}\sum_{i = 1}^{n}(X_i - \hat\mu)^2}.
Remarks and outlook:
We will now use computer simulations to illustrate the law of large numbers and the CLT. In both cases, we will do repeated sampling from a known(!) distribution F.
The LLN says that the sample mean approaches the true mean as the sample size grows. We illustrate this by simulation: For different sample sizes n, we draw 10^5 samples of size n each, and plot the histogram of their sample means. All values will be drawn from an exponential distribution with density f(x) = \lambda e^{-\lambda x}, \ x\ge 0, with \lambda = 1/\mu = 1.
par(mfrow = c(2, 2), mai = rep(0.4, 4)) # Four plot windows
nsims <- 1e5
set.seed(100)
for (n in c(1, 10, 50, 200)) {
means <- numeric(nsims)
for (i in 1:nsims) {
means[i] <- mean(rexp(n))
}
hist(means, breaks = 99, main = paste("n =", n), col = "orange",
border = "orange", xlim = c(0, 6), probability = TRUE)
}
Comment: The distribution of the sample mean contracts more and more around the true mean 1.
The CLT (roughly) says that the standardized sample mean of a
sufficiently large sample is approximately normal, no matter how
“non-normal” the individual values X_i are. We will illustrate this with the
same code as above, now using standardized means, and replacing
the inner loop by replicate()
for convenience.
par(mfrow = c(2, 2), mai = rep(0.4, 4))
nsims <- 1e5
# Standardized mean: For exponential x, mu = sd
std_mean <- function(x, mu, sd) {
(mean(x) - mu) / (sd / sqrt(length(x)))
}
set.seed(100)
for (n in c(1, 10, 50, 200)) {
means <- replicate(nsims, std_mean(rexp(n), mu = 1, sd = 1))
hist(means, breaks = 99, main = paste("n =", n), col = "orange",
border = "orange", xlim = c(-4, 4), probability = TRUE)
}
Comments:
In practice, we would have access to only one sample, and the underlying distribution F of X_i would be unknown. Why is the CLT still relevant? It helps to construct confidence intervals (CI) for \mu.
By the CLT and if n is not too small, the probability of the event \mu - z_{1-\alpha/2} \sigma(\hat\mu) < \hat \mu < \mu + z_{1-\alpha/2}\sigma(\hat\mu) is about 1 - \alpha, and z_\beta is the \beta-quantile of the standard normal distribution.
Consequently, P\big(\hat\mu - z_{1-\alpha/2} \sigma(\hat\mu) < \mu < \hat \mu + z_{1-\alpha/2}\sigma(\hat\mu)\big) \approx 1 - \alpha. Thus, an approximate (1-\alpha)\cdot 100\%-confidence interval for \mu is given by
[\hat\mu \pm z_{1-\alpha/2} \sigma(\hat\mu)]. Again, since we do not know \sigma in practice, we further approximate this approximation by [\hat\mu \pm z_{1-\alpha/2} \hat\sigma(\hat\mu)], where \hat\sigma(\hat\mu) = \hat\sigma / \sqrt{n} is the estimated standard error of the mean.
Remarks:
Let’s use a sample of values drawn from an exponential distribution to calculate an approximate 95% confidence interval for the true mean \mu.
set.seed(1)
n <- 50
x <- rexp(n)
mu_hat <- mean(x)
se_hat <- sd(x) / sqrt(n)
z <- qnorm(1 - 0.025) # 1.96
# z-confidence interval for mu
c(mu_hat - z * se_hat, mu_hat + z * se_hat)
## [1] 0.7368612 1.2312221
# Same via function
ci <- function(x, alpha = 0.05) {
z <- qnorm(1 - alpha / 2)
se <- sd(x) / sqrt(length(x))
out <- mean(x) + -1:1 * z * se
names(out) <- c("lci", "estimate", "uci")
out
}
ci(x)
## lci estimate uci
## 0.7368612 0.9840417 1.2312221
# Compare with more conservative Student CIs
c(t.test(x)$conf.int) # What does c() do here?
## [1] 0.7306045 1.2374788
Comments:
Accuracy: In theory, the accuracy of a confidence interval for a parameter \theta can be assessed as the difference between its nominal coverage probability 1 - \alpha and its real coverage probability. The latter can be estimated by repeated sampling as follows: Draw many samples, calculate their confidence intervals and then calculate the proportion of intervals that contain \theta. In practice, we only have access to one sample and \theta is unknown, so we cannot calculate the real coverage probability and have to trust the nominal value.
In our artificial situation, we can use simulation to generate many samples and their confidence intervals, and then estimate the real coverage probability.
# Note: we need the function ci() from above
library(ggplot2)
n <- 50
nsims <- 1e4
mu <- 1
set.seed(260)
results <- replicate(nsims, ci(rexp(n, rate = 1 / mu))) |>
t() |>
data.frame() |>
transform(ok = (lci <= mu) & (mu <= uci))
head(results)
# Proportion correct
mean(results$ok)
## [1] 0.9294
# Plot confidence intervals of 100 samples (lengthy ggplot...)
ggplot(results[1:100, ], aes(x = rank(estimate), y = estimate, color = ok)) +
geom_point() +
geom_errorbar(aes(ymin = lci, ymax = uci)) +
geom_hline(yintercept = mu, color = "darkgray") +
coord_flip() +
scale_color_viridis_d(begin = 0.4, end = 0.8, option = "magma") +
theme(
legend.position = "none",
axis.text.y = element_blank(),
axis.ticks.y = element_blank()
) +
labs(
x = "Sorted by estimate",
y = "values",
title = "Confidence intervals calculated from 100 samples"
)
Comments: Based on 10’000 samples, we estimate the real coverage probability as 93%, which is lower than the stated one (95%). This undesirable property might be a consequence of the skeweness of the data distribution. (In the plot, five of 100 confidence intervals do not cover \mu=1, corresponding to a real coverage probability of exactly 95%.)
So far, we have been mainly concerned with the sample mean. What about estimators of other parameters?
Many estimators \hat \theta are asymptotically normal. Then, following the same logic as with the sample mean, approximate (1-\alpha)\cdot 100\% confidence intervals for their parameter \theta are given by I_{1-\alpha} = [\hat \theta \pm z_{1-\alpha/2} \hat \sigma(\hat\theta)]. The problem of such standard normal (or Wald) confidence intervals: For most estimators, no formula of their estimated standard error \hat\sigma(\hat \theta) is available. This is where the Bootstrap enters the stage.
Efron (1979) found a fully generic and computer-based way to estimate the standard error of any estimator: The Bootstrap.
Let \boldsymbol x = (x_1, \dots, x_n) be a specific sample from some unknown population, and we want to compute the standard error of an estimator \hat\theta(\boldsymbol x) for a population parameter \theta.
The Bootstrap estimate of standard error is calculated as follows:
Remarks:
In R, different contributed packages (e.g., “boot”) exist to perform
Bootstrapping. However, standard Bootstrap is so simple that we only
need the key function sample()
. It has multiple important
modes and applications:
sample(x, replace = TRUE)
: Bootstrap
sample of vector x
.X[sample(nrow(X), replace = TRUE), ]
:
Bootstrap sample of data/matrix X
.sample(x)
: Random permutation of vector x
.
Will use it in Section “Permutation Tests”.X[sample(nrow(X)), ]
: Shuffle rows in data/matrix
X
.sample(nrow(X), 0.8 * nrow(X))
: 80% randomly selected
row indices of data X
.sample(x, n, replace = TRUE, probs = ...)
: Generate
n
random draws from any discrete distribution with support
x
.In this section, we will need the first two.
Let’s calculate the Bootstrap estimate of standard error for the sample mean, one of the only estimators where we can explicitly estimate its standard error from the data.
library(ggplot2)
set.seed(30)
n <- 49
x <- rexp(n)
# Estimated standard error of mean
sd(x) / sqrt(n)
## [1] 0.1235472
# Almost same value via the Bootstrap
B <- 2000
boot <- numeric(n)
for (i in 1:B) {
x_boot <- sample(x, replace = TRUE)
boot[i] <- mean(x_boot)
}
# Bootstrapped standard error of the mean
sd(boot)
## [1] 0.1237622
ggplot(data.frame(Mean = boot), aes(x = Mean)) +
geom_histogram(fill = "chartreuse4", bins = 29) +
ggtitle("Histogram of Bootstrap replications")
Comment: The Bootstrap standard error is very similar to the usual estimate. Here, both are far away from the true standard error of the mean of 49 standard exponentials (1 / 7 \approx 0.1429).
According to Efron and Tibshirani (1993) (p. 47), 25 to 200 Bootstrap samples are usually sufficient to get a stable Bootstrap estimate of the standard error. The following curve indicates differently: We calculate the Bootstrap estimate of standard error for the first b=20 Bootstrap replications, then for the first b=21 replications etc. to see how fast the estimates stabilize.
plot_stability <- function(x) { # x is the vector of Bootstrap replications
df <- data.frame(x = x, b = 1:length(x))
df$se <- sapply(df$b, function(i) sd(x[1:i])) # "cumsd"
ggplot(subset(df, b >= 20), aes(b, se)) +
geom_line(color = "chartreuse4") +
ggtitle("Stability of Bootstrap Estimate of Standard Error")
}
plot_stability(boot) +
geom_hline(yintercept = sd(x) / sqrt(n)) # Estimated standard error of the mean
Comment: The curve starts to stabilize around the standard estimate (horizontal black line) not before 1200 Bootstrap samples.
Let’s turn to the median. No simple formula for its standard error exists. (For large samples and normally distributed values, it is \sqrt{\pi/2} \approx 1.253 times as large as the standard error of the mean.) In the last example, we now simply replace the mean with the median to get the Bootstrap standard error of the median.
To simplify the code, we use replicate()
instead of the
for
loop. This has the advantage of not needing to
initialize the resulting vector.
library(ggplot2)
set.seed(30)
n <- 49
x <- rexp(n)
# Sample median (Population median is ln(2) / lambda = ln(2) \approx 0.693)
median(x)
## [1] 0.8549164
boot <- replicate(2000, median(sample(x, replace = TRUE)))
# Bootstrapped standard error of the median
sd(boot)
## [1] 0.0756406
ggplot(data.frame(Median = boot), aes(x = Median)) +
geom_histogram(fill = "chartreuse4", bins = 29) +
ggtitle("Histogram of Bootstrap replications")
# Function defined in last example
plot_stability(boot)
Comment: In this example, around 1000 resamples would have been enough to find a stable Bootstrap estimate of the standard error of the median. In contrast to the last example on the mean, we cannot compare with a “standard” estimate - we have to rely on the validity of the Bootstrap. Note: Such stability curve can be plotted in all situations.
The Bootstrap gives us a convenient way to estimate the standard error \sigma(\hat \theta) of any estimator \hat \theta. Assuming asymptotic normality of \hat \theta, we can calculate standard normal (1-\alpha)\cdot 100\%-confidence intervals [\hat \theta \pm z_{1-\alpha/2} \hat \sigma(\hat\theta)] for \theta, using the Bootstrap estimate of the standard error as \hat \sigma(\hat\theta).
Let’s revisit again the last example. We have estimated the standard error of the median by 0.0756. This leads to an approximate normal 95%-Bootstrap CI of 0.8549 \pm 1.96 \cdot 0.0756 \approx [0.707, 1.003]. The code to calculate it:
set.seed(30)
n <- 49
x <- rexp(n) # Population median is ln(2) / lambda = ln(2)
boot <- replicate(2000, median(sample(x, replace = TRUE)))
median(x) + c(-1, 1) * qnorm(0.975) * sd(boot)
## [1] 0.7066635 1.0031692
In the case of the median (and other quantiles), we actually do not need to rely on the Bootstrap to calculate confidence intervals as there is a fully nonparametric method based on order statistics. It is exact in the sense that the intervals are never anti-conservative, see Hahn, Meeker, and Escobar (2014). Additionally, it does not require resampling.
ci_median_nonpar <- function(x, alpha = 0.05, q = 0.5) {
K <- qbinom(c(alpha / 2, 1 - alpha / 2), length(x), q) + 0:1
sort(x)[K]
}
ci_median_nonpar(x) # [0.699, 1.013]
## [1] 0.6989941 1.0127069
In some of the examples above, we have plotted a histogram of the Bootstrap replications. A simple alternative to the standard normal confidence interval is to use the empirical (\alpha/2) and (1-\alpha/2) quantiles of the Bootstrap replications to build a so-called percentile Bootstrap confidence interval.
Compared to the standard normal approach,
Let us also calculate the percentile Bootstrap confidence interval in our example on the median. Only two rows of code are required:
set.seed(30)
n <- 49
x <- rexp(n) # Population median is ln(2) / lambda = ln(2)
# Percentile Bootstrap
boot <- replicate(9999, median(sample(x, replace = TRUE)))
quantile(boot, c(0.025, 0.975), names = FALSE)
## [1] 0.6989941 1.0127069
Comment: This result is similar to the one obtained from the non-parametric approach based on order statistics.
Let’s compare the accuracy of the three types of confidence intervals for the population median in our artificial example. We use repeated sampling from the exponential distribution, and further repeat the process for different sample sizes (9, 49 and 99).
Because of nested loops, the code takes long to run. Thus, we store the result on disk.
set.seed(30)
nsim <- 1000
sample_sizes <- as.character(c(9, 49, 99))
result <- array(
dim = c(length(sample_sizes), 3, nsim),
dimnames = list(sample_sizes, c("Norm", "Perc", "NP"), NULL)
)
is_ok <- function(ci, theta = log(2)) {
(ci[1] <= theta) & (ci[2] >= theta)
}
# Function copy pasted from above
ci_median_nonpar <- function(x, alpha = 0.05, q = 0.5) {
K <- qbinom(c(alpha / 2, 1 - alpha / 2), length(x), q) + 0:1
sort(x)[K]
}
if (FALSE) { # Trick to skip code block without commenting out
system.time( # 23 minutes
for (n in sample_sizes) {
print(n)
pb <- txtProgressBar(max = nsim, style = 3)
for (i in 1:nsim) {
x <- rexp(as.character(n))
boot <- replicate(9999, median(sample(x, replace = TRUE)))
result[n, "Norm", i] <- is_ok(median(x) + c(-1, 1) * qnorm(0.975) * sd(boot))
result[n, "Perc", i] <- is_ok(quantile(boot, c(0.025, 0.975), names = FALSE))
result[n, "NP", i] <- is_ok(ci_median_nonpar(x))
setTxtProgressBar(pb, i)
}
}
)
# saveRDS(result, "simulation/ci_median.rds")
} else{
result <- readRDS("simulation/ci_median.rds")
}
(coverage_probs <- apply(result, 1:2, FUN = mean))
## Norm Perc NP
## 9 0.939 0.965 0.965
## 49 0.937 0.956 0.957
## 99 0.940 0.949 0.952
# Some tidyverse magic for visualization
library(tidyverse)
coverage_probs |>
data.frame() |>
rownames_to_column("n") |>
mutate(n = as.integer(n)) |>
pivot_longer(cols = -n, names_to = "method") |>
ggplot(aes(y = value, x = n, group = method, color = method)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0.95, color = "darkgrey") +
coord_cartesian(xlim = c(1, 100))
Comment: The coverage probabilities are surprisingly close to the nominal coverage of 95%, even for the tiny sample size 9. The non-parametric intervals are conservative by construction. In this case, the standard normal confidence interval are slightly anti-conservative.
At the cost of more complexity, more accurate Bootstrap confidence intervals can be constructed. One is the bias-corrected and accelerated BC_\alpha confidence interval, which is an improved version of the percentile method (using different order statistics as limits), and available in the “boot” package.
library(boot)
set.seed(30)
n <- 49
x <- rexp(n) # Population median is ln(2) / lambda = ln(2)
boot_dist <- boot(x, statistic = function(x, id) median(x[id]), R = 9999)
boot.ci(boot_dist)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 9999 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot_dist)
##
## Intervals :
## Level Normal Basic
## 95% ( 0.7209, 1.0030 ) ( 0.6971, 1.0108 )
##
## Level Percentile BCa
## 95% ( 0.6990, 1.0127 ) ( 0.6609, 0.9119 )
## Calculations and Intervals on Original Scale
Comment: Four types of confidence intervals are returned:
Remark on bias correction: Sometimes, the Bootstrap is also used for bias correction of the estimate \hat \theta. We do not consider it here.
Many interesting estimators are calculated from two or more samples, e.g., from different treatment groups in a clinical study. Some examples:
The logic for applying the Bootstrap is as before: resample B times with replacement, calculate the B Bootstrap replications of the estimator, and then use these B values to calculate standard errors and/or confidence intervals.
There are two options to perform the resampling:
Option 1 is usually preferred.
Let us construct two samples from different exponential distributions: One with true mean 2, the other with true mean 1. Imagine the values are weeks of hospital stay, and the two groups receive different surgical treatments. The task is to create a 95% confidence interval for the true mean difference. The classic approach would be to calculate a two-sample Student confidence interval with unequal variances.
library(ggplot2)
set.seed(1)
m <- 15
n <- 25
y1 <- rexp(m, 1 / 2)
y2 <- rexp(n, 1)
# Plot data
df <- data.frame(value = c(y1, y2), group = rep(c("A", "B"), times = c(m, n)))
ggplot(df, aes(group, value, fill = group)) +
geom_boxplot(varwidth = TRUE, show.legend = FALSE) +
scale_fill_viridis_d(begin = 0.5, end = 0.9)
# Estimate for true mean(y1) - mean(y2)
estimator <- function(y1, y2) {
mean(y1) - mean(y2)
}
(est <- estimator(y1, y2))
## [1] 1.442493
# Welch's Student confidence interval for unequal variances
c(t.test(y1, y2)$conf.int)
## [1] 0.1477659 2.7372195
# Bootstrap CIs by resampling within each group
boot <- replicate(
9999,
estimator(
sample(y1, replace = TRUE),
sample(y2, replace = TRUE)
)
)
ggplot(data.frame(Estimate = boot), aes(x = Estimate)) +
geom_histogram(fill = "chartreuse4", bins = 29) +
ggtitle("Histogram of the Bootstrap replications")
# Standard normal Bootstrap CI
est + c(-1, 1) * qnorm(0.975) * sd(boot)
## [1] 0.2826721 2.6023133
# Percentile Bootstrap CI
quantile(boot, c(0.025, 0.975))
## 2.5% 97.5%
## 0.4111158 2.7096968
Comment: The two bootstrap intervals differ considerably, and are much narrower than the student interval. We do not know which of the three intervals is the best. Since we know the true distributions of the two samples in this example, we could again perform a simulation study and calculate the accuracy of the intervals based on the difference between the actual and nominal coverage probabilities. In practice, this is not possible because the true distributions are unknown. Otherwise we wouldn’t need statistics, after all!
By resampling rows in a dataset, we can also work with other multivariate situations such as the Pearson correlation, Kendall’s rank correlation or the R-squared of a linear regression. This allows to study and measure associations between two or more variables using the Bootstrap.
In the following, we will estimate Bootstrap confidence intervals for the true Pearson correlation coefficient \theta of two variables using both the standard normal and the percentile Bootstrap approach. The percentile Bootstrap CI has the advantage of being range-preserving, i.e., staying within [-1, 1] (why?). We will compare the intervals with the classic approach assuming bivariate normality.
library(ggplot2)
x <- seq(0, 1, length.out = 25)
df <- data.frame(x = x, y = x^2)
# Scatterplot of the bivariate situation
ggplot(df, aes(x, y)) +
geom_point(color = "chartreuse4")
# Classic 95%-CI assuming bivariate normality
c(cor.test(~ x + y, data = df)$conf.int)
## [1] 0.9230414 0.9850674
# Estimate
estimator <- function(X) {
cor(X)[1, 2]
}
(est <- estimator(df))
## [1] 0.9658906
# Bootstrap
set.seed(3)
boot <- replicate(
9999,
estimator(df[sample(nrow(df), replace = TRUE), ])
)
ggplot(data.frame(Estimate = boot), aes(x = Estimate)) +
geom_histogram(fill = "chartreuse4", bins = 29) +
ggtitle("Histogram of the Bootstrap replications")
# Standard normal Bootstrap CI
est + c(-1, 1) * qnorm(0.975) * sd(boot)
## [1] 0.9510755 0.9807056
# Percentile Bootstrap CI
quantile(boot, probs = c(0.025, 0.975))
## 2.5% 97.5%
## 0.9516588 0.9813144
Comment: The two Bootstrap CIs are similar, but quite different from the normal confidence interval. In this situation, it is difficult to say which interval is the best.
The main application of the Bootstrap is to estimate standard errors, confidence intervals, and bias. To test hypotheses about a parameter \theta, another resampling technique shines: permutation tests, first described in Fisher (1935) - long before the computer age! But what is a statistical test again?
The motivation of a statistical test is to show that some interesting alternative hypothesis H_1 about \theta holds, e.g., \theta \ne 0. This is done by measuring evidence against the contrary null hypothesis H_0 (e.g., \theta=0), using a suitable test statistic T. If the evidence is strong enough, we reject H_0 in favor of H_1.
The decision is often made by the help of the p value calculated from the data: It is the probability of observing at least as much evidence against H_0 as in the specific sample when H_0 is true, formally P_{H_0}\{T \ge t\}. t is the observed value of T, and T is a random variable. Finding the distribution of T under H_0 usually requires mathematical statistics (and several assumptions have to be met).
If the p value is lower or equal to some prespecified significance level \alpha (often 0.05), we reject H_0.
Let’s illustrate above concepts by the two-sample t-test. Under the null hypothesis of equal population means, assuming both normality and equal variance, its test statistic (the standardized mean difference) follows a t distribution with n + m - 2 degrees of freedom. n and m are the sample sizes of the two groups.
set.seed(8)
m <- 20
n <- 40
# Values in groups A and B
y1 <- rexp(m, 1)
y2 <- rexp(n, 1 / 2)
# Long form
y <- c(y1, y2)
x <- rep(c("A", "B"), times = c(m, n))
data.frame(y, x)[18:22, ]
boxplot(y ~ x, col = "chartreuse4", varwidth = TRUE)
# We should actually use the default option var.equal = FALSE
t.test(y ~ x, var.equal = TRUE)
##
## Two Sample t-test
##
## data: y by x
## t = -2.3297, df = 58, p-value = 0.02332
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## -1.8962003 -0.1435874
## sample estimates:
## mean in group A mean in group B
## 0.8205254 1.8404192
# Calculate p value "by hand" (why multiply with 2?)
2 * (1 - pt(abs(-2.3297), df = m + n - 2))
## [1] 0.02332464
Comment: The sample mean is clearly larger in group B than in group A. What can we say about the true difference \theta between corresponding population means? The t-test gives a p value of 0.023. At the 5% significance level, we can therefore reject the null hypothesis of equal population means.
The code in the previous example indicates something conceptually important: A two-sample t-test can be viewed as a test of association between to variables:
The association between \boldsymbol X and \boldsymbol Y is measured in terms of a location shift in the grouped means.
Not only the two-sample t-test, but many additional parametric and non-parametric tests can be seen as tests for association between two variables. Such tests (and actually, many more) can be tackled by a computer-intensive and fully automatic technique: Permutation tests.
Let T be a test statistic measuring the strength of association between two variables \boldsymbol X and \boldsymbol Y with observations \boldsymbol x and \boldsymbol y. A permutation test finds the null distribution of T by repeatedly shuffling the values in one of the two variables:
Remarks
Let’s perform above two-sample comparison via permutation test:
set.seed(8)
m <- 20
n <- 40
y1 <- rexp(m, 1)
y2 <- rexp(n, 1 / 2)
# Long form
y <- c(y1, y2)
x <- rep(c("A", "B"), times = c(m, n))
statistic <- function(x, y) {
abs(mean(y[x == "B"]) - mean(y[x == "A"]))
}
# Calculate approximate null distribution
B <- 1e4
perm_dist <- replicate(B, statistic(sample(x), y))
# p value
mean(perm_dist >= statistic(x, y))
## [1] 0.0202
# Via "coin" package
library(coin)
independence_test(y ~ factor(x), distribution = approximate(nresample = 10000))
##
## Approximative General Independence Test
##
## data: y by factor(x) (A, B)
## Z = -2.2469, p-value = 0.0184
## alternative hypothesis: two.sided
Comment: Our permutation test gives very similar results to the one in the professional package “coin”. They are also similar to the parametric version (see last example). In all cases, we can reject the null hypothesis of no true mean difference at the 5% level.
According to the following figure, B = 10000 Monte-Carlo-samples seem large enough: The p values stabilizes relatively quickly.
successive_pvalues <- cumsum(perm_dist >= statistic(x, y)) / (1:B)
plot(successive_pvalues, pch = ".")
grid()
In the last example, we have compared the permutation test with the classic t-test for equal variances. In the specific case, it is difficult to judge which of the two approaches is better. In our artificial example with simulated data, we can use repeated sampling to calculate Type 1 error and power of the two tests.
The Type 1 error should be close to the stated significance level, while the power should be as large as possible.
Let’s calculate Type 1 error and power for above artificial situation. We consider the classic t-test for equal variance, the permutation test, and also Welch’s t-test for unequal variances. It is similar to the classic t-test, but with corrected degrees of freedom (and usually the recommended of the two parametric t-tests).
To calculate Type 1 error, we simulate the two samples from the same exponential distribution. To calculate power, we draw them from exponential distributions with a mean difference of 1. Other mean differences would lead to other powers.
what <- "type_1_error" # "Power" # Select one of the two
nsim <- 1000
m <- 20
n <- 40
shift <- (what == "Power") # 0 or 1
result <- matrix(
nrow = nsim, ncol = 3, dimnames = list(NULL, c("Welch", "Student", "Perm"))
)
statistic <- function(x, y) {
abs(mean(y[x == "B"]) - mean(y[x == "A"]))
}
# Simulation
set.seed(20)
pb <- txtProgressBar(max = nsim, style = 3)
## | | | 0%
if (FALSE) { # Set to TRUE to rerun
for (i in 1:nsim) {
y1 <- rexp(m, 1)
y2 <- rexp(n, 1 / (1 + shift))
y <- c(y1, y2)
x <- rep(c("A", "B"), times = c(m, n))
# t-tests: First for unequal variances
result[i, "Welch"] <- t.test(y ~ x)$p.value <= 0.05
result[i, "Student"] <- t.test(y ~ x, var.equal = TRUE)$p.value <= 0.05
# Permutation test with B = 10000
perm_dist <- replicate(1e4, statistic(sample(x), y))
perm_p <- mean(perm_dist >= statistic(x, y))
result[i, "Perm"] <- perm_p <= 0.05
setTxtProgressBar(pb, i)
}
saveRDS(colMeans(result), file = paste0("simulation/", what, ".rds"))
}
# Type 1 error
(type_1_error <- readRDS("simulation/type_1_error.rds"))
## Welch Student Perm
## 0.061 0.054 0.058
# Power under a mean shift of 1
(Power <- readRDS("simulation/Power.rds"))
## Welch Student Perm
## 0.729 0.581 0.602
Comment: In this situation, all tests are somewhat anti-conservative regarding Type 1 error. At the cost of a slightly higher Type 1 error, Welch’s t-test has clearly the highest power. The permutation test performs somewhere in the middle.
We can use permutation tests to replace almost any parametric or non-parametric test. To end the chapter on statistical inference, will go through the following three situations:
Many additional tests are available in the “coin” package - some of which do not even have an official name.
We can use permutation tests to mimic classic rank tests like Wilcoxon’s rank sum test. To do so, we replace the values by their rank in the pooled sample and then proceed as before.
Let’s test the alternative hypothesis that the values in one sample
tend to be larger than in the other sample. We will use
withr::with_seed()
to set the random seed only temporarily,
without affecting later code.
library(withr)
y1 <- c(1, 1, 1, 2, 3, 4, 5)
y2 <- c(4, 5, 5, 5, 6, 7, 8, 10)
# Long form
y <- c(y1, y2)
x <- rep(c("A", "B"), times = c(length(y1), length(y2)))
# Wilcoxon's rank sum test
wilcox.test(y ~ x)$p.value
## [1] 0.004052096
# Permutation test
statistic <- function(x, y) {
r <- rank(y)
abs(mean(r[x == "B"]) - mean(r[x == "A"]))
# Could also use Wilcoxon's rank-sum statistic
}
with_seed(
38,
perm_dist <- replicate(1e4, statistic(sample(x), y))
)
mean(perm_dist >= statistic(x, y))
## [1] 0.0018
# Via "coin", using the exact "shift" algorithm by Streitberg and Röhmel (1984)
library(coin)
wilcox_test(y ~ factor(x), distribution = "exact")
##
## Exact Wilcoxon-Mann-Whitney Test
##
## data: y by factor(x) (A, B)
## Z = -2.9327, p-value = 0.00202
## alternative hypothesis: true mu is not equal to 0
Comment: At the 5% level, all tests lead to the same conclusion. Note that the permutation approach has no problem with ties.
Next, we test the null hypothesis of no correlation between two variables X and Y.
n <- 40
x <- seq(0, pi * 9 / 10, length.out = n)
y <- sin(x)
plot(y ~ x)
# Classic test based on assuming bivariate normality
cor.test(~ x + y)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 2.4439, df = 38, p-value = 0.01928
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.06443092 0.61001955
## sample estimates:
## cor
## 0.3685433
# Permutation test
statistic <- function(x, y) {
abs(cor(x, y))
}
set.seed(98)
perm_dist <- replicate(1e4, statistic(sample(x), y))
mean(perm_dist >= statistic(x, y))
## [1] 0.0186
## Via coin
library(coin)
independence_test(y ~ x, distribution = approximate(nresample = 10000))
##
## Approximative General Independence Test
##
## data: y by x
## Z = 2.3016, p-value = 0.0196
## alternative hypothesis: two.sided
Comment: For any of the tests considered, we can reject the null hypothesis of no correlation and claim that there is some association between the two variables.
Permutation tests can also be used for testing hypotheses on two or more paired samples. The package “coin” uses a very general scheme in this case: Data are represented in long form with an additional column “block” to identify the subject. Only “admissible” permutations are done, i.e., permutations within block.
library(coin)
pre <- c(10, 10, 11, 15, 15, 20, 33, 35, 35)
post <- c(11, 12, 10, 20, 21, 22, 30, 35, 37)
# One-sample t-test (= t-test for two paired samples)
t.test(post - pre)
##
## One Sample t-test
##
## data: post - pre
## t = 1.6733, df = 8, p-value = 0.1328
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.5881572 3.6992683
## sample estimates:
## mean of x
## 1.555556
# Coin: Turn data into long form with three columns
n <- length(pre)
y <- c(pre, post)
x <- rep(c("pre", "post"), each = n)
block <- rep(1:n, times = 2)
data.frame(y, x, block)
independence_test(y ~ factor(x) | factor(block), distribution = "exact")
##
## Exact General Independence Test
##
## data: y by
## factor(x) (post, pre)
## stratified by factor(block)
## Z = 1.5275, p-value = 0.1719
## alternative hypothesis: two.sided
Comment: The exact permutation test provides a different p value, but in both cases we cannot reject the null hypothesis of equal means.
In this exercise, we consider 95%-confidence intervals for the true mean of a uniform distribution.
plot_stability()
function of the lecture notes
to figure out after how many Bootstrap samples the Bootstrap estimate of
the standard error would stabilize.Consider the two samples y_1 = 1, 2, \dots, 21 and y_2 = 1, 2, \dots, 51.
For the situation in Exercise 1, use simulation to estimate real coverage probabilities of the Student CI and the two types of Bootstrap CIs. What do you observe?
Here, we study a test on Spearman’s rank correlation.
spearman_test2(x, y, B = 10000)
that
calculates a one-sided permutation p value for the null hypothesis of no
positive monotonic association. I.e., you want to show the alternative
hypothesis that the true rank correlation is positive.stats::cor.test(x, y, method = "s", method = "greater")
.In the situation of Exercise 4: Use simulation to compare your
approach with stats::cor.test()
regarding…
In this chapter, we have revisited some basic results on statistical inference. Then, we met two key players in modern inferential statistics: The Bootstrap to calculate approximate confidence intervals for general parameters, and permutation tests to produce p values for a wide range of hypothesis tests.