Chapter 12 Analysis of Variance and Comparison of Multiple Groups

In this chapter, we introduce one-way analysis of variance (ANOVA) through the analysis of a motivating example.

Viñals et. al.⁸⁷ studied the effects of marijuana on mice. One group of ordinary (“wild type”) mice were given a dose of THC, the active ingredient in marijuana. Another group of wild type mice were given a vehicle, which is a shot with the same inactive ingredients but no THC.

The investigators measured the locomotor activity of the mice by placing each mouse in a box lined with photocells and then observing the total distance traveled. The results were reported as the percentage movement relative to the untreated VEH group.

The data is in the fosdata package in the mice_pot data set.

mice_pot <- fosdata::mice_pot

The data consists of 42 observations of 2 variables. The variable group is the dosage of THC that each mouse received, and percent_of_act measures locomotion as the percent of baseline activity. Start by exploring the data with summary statistics and a boxplot.

mice_pot %>%
  group_by(group) %>%
  summarize(
    mean = mean(percent_of_act),
    sd = sd(percent_of_act),
    skew = e1071::skewness(percent_of_act),
    N = n()
  )

group	mean	sd	skew	N
VEH	100.000	25.324	-0.190	15
0.3	97.323	31.457	0.133	9
1	99.052	26.258	-0.106	12
3	70.668	20.723	0.034	10

ggplot(mice_pot, aes(x = group, y = percent_of_act)) +
  geom_boxplot(outlier.size = -1) +
  geom_jitter(height = 0, width = .2)

The plot and summary statistics show that there is little difference between the first three groups, but there may well be a difference once the dose gets up to 3 mg/kg.

This chapter introduces a number of different hypothesis tests to test for a difference in the means of multiple groups. The first we consider is analysis of variance, abbreviated ANOVA.

For the mice, ANOVA will test whether the mean percent of activity relative to baseline is the same in all four groups, under the assumption of normality of each group, independence and equal variance. From the boxplot, mouse locomotion appears to be normally distributed, and each group has approximately the same variance. We know from the experimental design that the measurements on each mouse are independent.

To run ANOVA on the mouse experiment, first build a linear model of percent_of_act as explained by group, then run the anova command on the linear model:

mice_model <- lm(percent_of_act ~ group, data = mice_pot)
anova(mice_model)

## Analysis of Variance Table
## 
## Response: percent_of_act
##           Df Sum Sq Mean Sq F value Pr(>F)  
## group      3   6329 2109.65  3.1261 0.0357 *
## Residuals 42  28344  674.85                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The next section will explain the complex output of the anova function fully. The $p$-value for the test is given by the Pr(>F) field, so $p = 0.0357$ for this data, and the * means significance at the $\alpha = 0.05$ level. We reject the hypothesis that the mean percent of activity is the same for all four groups. The association between TCH treatment and locomotion is unlikely to be due to chance.

12.1 ANOVA

ANOVA is used when a single quantitative variable $X$ is measured on multiple groups. Since there is one categorical variable (the grouping) to explain $X$, this technique is specifically called one-way ANOVA.

12.1.1 Notation

ANOVA involves multiple observations over multiple groups, so the notation required is fairly complex:

There are $N$ total observations.
There are $k$ groups.
There are $n_i$ observations in the $i$th group.
$x_{ij}$ denotes the $j$th observation from the $i$th group.
$\overline{x}_{i\cdot}$ denotes the mean of the observations in the $i$th group.
$\overline{x}_{\cdot}$ denotes the mean of all observations.

Example 12.1

In mice_pot there are 4 groups so $k=4$. The observed variable $X$ is percent_of_act, percent of activity relative to baseline. From the following table, we see that $n_1 = 15$, $n_2 = 9$, $n_3 = 12$, and $n_4 = 10$.

table(mice_pot$group)

## 
## VEH 0.3   1   3 
##  15   9  12  10

There are 46 observations, so $N = 46$.

In group 1, the VEH group, we have $x_{11} = 54.2$, $x_{12} = 65.9$, and so on. This means the first mouse in group 1 has activity 54.2% of baseline, and the second had 65.9%. In group 2, the first mouse has activity $x_{21} = 98.8$, and in group three the first mouse has activity $x_{31} = 95.3$.

To compute the group means:

mice_pot %>%
  group_by(group) %>%
  summarize(round(mean(percent_of_act), 1))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 4 x 2
##   group `round(mean(percent_of_act), 1)`
##   <fct>                            <dbl>
## 1 VEH                              100  
## 2 0.3                               97.3
## 3 1                                 99.1
## 4 3                                 70.7

This tells us that $\overline{x}_{1\cdot} = 100$ and $\overline{x}_{4\cdot} = 70.7$. Group 1, the VEH group, is exactly 100 by definition, since the percentage of activity was based on what the VEH group did.

To get the overall mean, we use mean(mice_pot$percent_of_act), which gives $\overline{x}_{\cdot} = 92.9$.

12.1.2 The ANOVA model

For ANOVA, the model for observations is

\[ x_{ij} = \mu + \alpha_i + \epsilon_{ij}, \]

All of $\mu$, $\alpha_i$, and $\epsilon_{ij}$ are parameters that will be estimated from data. The parameter $\mu$ is the grand mean, the mean of the entire population. Each $\alpha_i$ represents the difference between the grand mean and the $i^\text{th}$ group mean, and $\epsilon_{ij}$ accounts for the variation within the group.

Using estimates from the data each individual data point is described by \[ x_{ij} = \overline{x}_\cdot + (\overline{x}_{i\cdot} - \overline{x}_\cdot) + (x_{ij} - \overline{x}_{i\cdot}) \]

The null hypothesis for ANOVA is that the group means are all equal (or equivalently all of the $\alpha_i$ are zero). So, if $\mu_i = \mu + \alpha_i$ is the true group mean, we have:

\[ H_0 : \mu_1 = \mu_2 = \cdots =\mu_k\\ H_a : \text{Not all $\mu_i$ are equal} \]

ANOVA is “analysis of variance” because we will compare the variation of data within individual groups with the overall variation of the data. We make assumptions (below) to predict the behavior of this comparison when $H_0$ is true. If the observed behavior is unlikely, it gives evidence against the null hypothesis. The general idea of comparing variance within groups to the variance between groups is applicable in a wide variety of situations.

Definition 12.2

In ANOVA, the variation within groups is \[ SSD_W = \sum_i \sum_j (x_{ij} - \overline{x}_{i\cdot})^2. \]

The variation between groups is \[ SSD_B = \sum_i\sum_j (\overline{x}_{i\cdot} - \overline{x})^2. \]

The total variation is \[ SSD_{\text{tot}} = \sum_i \sum_j (\overline{x} - x_{ij})^2. \]

The key mathematical result for ANOVA is the following theorem.

Theorem 12.1

The total variation is the sum of the variation between groups and the variation within the groups. Using the notation in Definition 12.2,

\[ SSD_{\text{tot}} = SSD_W + SSD_B. \]

If the variation between groups is much larger than the variation within groups, then that is evidence that there is a difference in the means of the groups. The question then becomes: how large is large? In order to answer that, we will need to examine the $SSD_B$ and $SSD_W$ terms more carefully.

Assumptions 12.1 (for ANOVA)

The population is normally distributed within each group
The variances of the groups are equal.
The observations are independent.

Assumption 1 is typical of parametric statistics, because it provides way to identify the data with known distributions.

Assumption 2 is a big one, and is investigated thoroughly in Section 12.3. There is no theoretical reason why the groups would have equal variances, so this must be checked before using ANOVA.

Assumption 3 is dependent on the experimental design. A common experimental design that violates Assumption 3 is to use the same unit in each group. Suppose the THC mouse experiment used the mice repeatedly, giving each of the four treatments to each mouse. Then the independence assumption would be false and ANOVA would be inappropriate. This mirrors the distinction between paired and independent two-sample tests.

Theorem 12.2

The test statistic for ANOVA is \[ F = \frac{SSD_B/(k - 1)}{SSD_W/(N - k)}. \] If the Assumptions (12.1) for ANOVA are met, then $F$ has the $F$ distribution with $k-1$ and $N-k$ degrees of freedom.

Proof: (sketch)

We have $k$ groups. There are $n_i$ observations in group $i$, and $N = \sum n_i$ total observations.

Notice that $SSD_W$ is almost $\sum\sum (\overline{x}_{ij} - \mu_i)^2$, in which case $SSD_W/\sigma^2$ would be a $\chi^2$ random variable with $N$ degrees of freedom. However, we are making $k$ replacements of means by their estimates, so we subtract $k$ to get that $SSD_B/\sigma^2$ is $\chi^2$ with $N - k$ degrees of freedom. (This follows the heuristic we started in Section 10.4).

As for $SSD_B$, it is trickier to see, but $\sum_i \sum_j (x_{i\cdot} - \mu)^2/\sigma^2$ would be $\chi^2$ with $k$ degrees of freedom. Replacing $\mu$ by $\overline{x}$ makes $SSD_B/\sigma^2$ a $\chi^2$ rv with $k - 1$ degrees of freedom.

The test statistic for ANOVA is \[ F = \frac{SSD_B/(k - 1)}{SSD_W/(N - k)} = \frac{(SSD_B/\sigma^2)/(k - 1))}{(SSD_W/\sigma^2)/(N - k))} \]

This is the ratio of two $\chi^2$ rvs divided by their degrees of freedom; hence, it is $F$ with $k - 1$ and $N - k$ degrees of freedom.

Through this derivation lacks details, it explains why Assumptions 12.1 are necessary. The assumption of equal variances is needed to cancel the unknown $\sigma^2$. The normality assumption is needed so that we get a known distribution for the test statistic.

12.1.3 Simulations

In this section, we confirm through simulation that the test statistic under the null hypothesis does have an F distribution.

We must assume normality and equal variances, and wish to simulate the the test statistic under the null hypothesis that all means are equal. So, we assume three populations, all of which are normally distributed with mean 0 and standard deviation 1. We arbitrarily decide to use 10 samples from group 1, 20 from group 2 and 30 from group 3, giving $k = 3$ and $N = 60$.

three_groups <- data.frame(
  group = rep(1:3, times = c(10, 20, 30)),
  value = rnorm(60, 0, 1)
)

We could use built in R functions to compute the test statistic, but it is worthwhile to understand all of the formulas so we do it step by step. We first make sure that group is a factor! We can also check that (in this case, anyway) $SSE_W + SSE_B = SSE_\text{tot}$.

three_groups$group <- factor(three_groups$group)
sse <- three_groups %>%
  mutate(total_mean = mean(value)) %>%
  group_by(group) %>%
  mutate(
    mu = mean(value),
    diff_within = value - mu,
    diff_between = mu - total_mean,
    diff_tot = value - total_mean
  ) %>%
  ungroup() %>%
  summarize(
    sse_within = sum(diff_within^2),
    sse_between = sum(diff_between^2),
    sse_total = sum(diff_tot^2)
  )
sse

## # A tibble: 1 x 3
##   sse_within sse_between sse_total
##        <dbl>       <dbl>     <dbl>
## 1       45.2        1.46      46.7

Now that we have computed the three sums of squared errors, we verify that $SSE_W + SSE_B = SSE_\text{tot}$.

sse$sse_within + sse$sse_between - sse$sse_total

## [1] 0

Now we can compute the test statistic.

k <- 3
N <- 60
(sse$sse_between / (k - 1)) / (sse$sse_within / (N - k))

## [1] 0.9198313

For simulation, it is faster to use the anova command to compute $F$ directly:

aov.mod <- anova(lm(value ~ group, data = three_groups))
aov.mod$`F value`[1]

## [1] 0.9198313

Now we are ready for the simulation.

sim_data <- replicate(1000, {
  three_groups <- data.frame(
    group = rep(1:3, times = c(10, 20, 30)),
    value = rnorm(60, 0, 1)
  )
  three_groups$group <- factor(three_groups$group)
  aov.mod <- anova(lm(value ~ group, data = three_groups))
  aov.mod$`F value`[1]
})
plot(density(sim_data),
  main = "F test statistic"
)
curve(df(x, 2, 57), add = T, col = 2)

Indeed, the test statistic distribution matches the $F$ distribution with 2 and 57 degrees of freedom.

12.2 The ANOVA test

To perform analysis of variance, compute the test statistic $F$. Because $F = 1$ under $H_0$ and the failure of $H_0$ can only make $F$ larger, ANOVA is a one-sided test.

If $F$ is much bigger than 1, then we reject $H_0: \mu_1 = \mu_2 = \cdots =\mu_k$ in favor of the alternative hypothesis that not all of the $\mu_i$ are equal. More precisely, the $p$-value is the probability of observing a value as large as $F$ or larger from the $F(k-1,N-k)$ distribution.

To perform ANOVA in R, use lm to build a linear model of the numeric variable on the grouping variable. Then run anova on the model object.

Alert 12.1

When performing anova in R, it is essential that the grouping variable is a factor. If the grouping variable is numeric you are using linear regression and will get the wrong result. R will not warn you about this. For this reason, always check that the degrees of freedom matches the number of groups minus 1. If DF is 1 and you have more than two groups, then you likely miscoded the grouping variable.

12.2.1 Example: THC mice

At the start of the chapter, we used anova to analyze the differences between four groups of mice given different doses of THC. In this section, we step through that computation carefully.

The data is infosdata::mice_pot.. We have already seen that the four distributions are approximately normal with approximately equal variance, and the measurements are independent. The null hypothesis is $H_0: \mu_1 = \cdots = \mu_4$, where $\mu_i$ is the true mean percent activity relative to baseline. The alternative hypothesis is that at least one of the means is different. We will test at the $\alpha = .05$ level.

mice_mod <- lm(percent_of_act ~ group, data = mice_pot)
anova(mice_mod)

## Analysis of Variance Table
## 
## Response: percent_of_act
##           Df Sum Sq Mean Sq F value Pr(>F)  
## group      3   6329 2109.65  3.1261 0.0357 *
## Residuals 42  28344  674.85                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results are significant ($p = .0357$), so we reject the null hypothesis that all of the means are the same.

We now explain the rest of the output from anova. The table contains all of the intermediate values used to compute $F$ as described in Section 12.1.2.

The two rows group and Residuals correspond to the between group and within group variation. The first column, Df gives the degrees of freedom in each case. Since $k = 4$, the between group variation has $k - 1 = 3$ degrees of freedom, and since $N = 46$, the within group variation (Residuals) has $N - k = 42$ degrees of freedom.

The Sum Sq column gives $SSD_B$ and $SSD_W$. The Mean Sq variable is the Sum Sq value divided by the degrees of freedom. These two numbers are the numerator and denominator of the test statistic, $F$. So here, $F = 2109.65/674.85 = 3.1261$.

To compute the $p$-value, we need the area under the tail of the $F$ distribution above $F = 3.1261$. This is a one-tailed test, because any deviation from $H_0$ makes small values of $F$ less likely.

We may compute the $p$-value from $F$ using:

pf(3.1261, df1 = 3, df2 = 42, lower.tail = FALSE)

## [1] 0.03570167

This matches the Pr(>F) value used earlier to reject $H_0$.

12.2.2 Example: Humanization

Different social groups may hold strong opinions about each other. A 2018 study by Davies et. al. ⁸⁸ examined American opinions about Pakistanis, and whether those opinions changed when participants learned about humanitarian behavior by Pakistanis. American participants in the study were read a brief synopsis of the tragedy of Hurricane Katrina, and then:

Told nothing further (the control group).
Told about the aid that Pakistan sent, and told low numbers for the amount of that aid.
Told about the aid that Pakistan sent, and told high numbers for the amount of that aid.

Afterwards, the participants were asked how strongly they believed Pakistanis would have felt both secondary and primary emotions following the disaster, and the mean of their responses was taken.

We plan on analyzing whether there is a difference in the mean of the three groups’ belief of how strongly Pakistanis would have felt secondary emotions. One issue with the technique that we are using is that we are averaging the ordinal scale responses to the seven questions on emotions, which is a common, but not always valid, thing to do.

We can load the data via the following.

humanization <- fosdata::humanization

We will first examine the data to see whether it appears to be approximately normal with equal variances across the groups. We are only concerned in this problem with the people who were told about Hurricane Katrina.

kat_humanization <- humanization %>%
  filter(stringr::str_detect(group, "Kat"))
ggplot(kat_humanization, aes(x = group, y = pak_sec)) +
  geom_boxplot() +
  geom_jitter(height = 0, width = 0.2)

Those don’t look too bad, except that the control group and the low group both have more values exactly equal to 1 than would be expected in a normal distribution. Let’s look at the mean, standard deviation and sample size in each group.

kat_humanization %>%
  group_by(group) %>%
  summarize(
    mean = mean(pak_sec),
    sd = sd(pak_sec),
    n = n()
  )

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 4
##   group            mean    sd     n
##   <fct>           <dbl> <dbl> <int>
## 1 Katrina control  2.24 0.927    54
## 2 Katrina low      2.65 0.911    53
## 3 Katrina high     2.72 0.794    54

It’s also a good idea to do a histogram.

ggplot(kat_humanization, aes(x = pak_sec)) +
  geom_histogram(bins = 8) +
  facet_wrap(~group)

To do ANOVA in R, first build a linear model for pak_sec as explained by group, and then apply the anova function to the model.

human_mod <- lm(pak_sec ~ group, data = kat_humanization)
anova(human_mod)

## Analysis of Variance Table
## 
## Response: pak_sec
##            Df  Sum Sq Mean Sq F value   Pr(>F)   
## group       2   7.456  3.7279  4.8269 0.009231 **
## Residuals 158 122.025  0.7723                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table gives the $p$-value as Pr(>F), here $p = 0.009231$. According to the $p$-value, we reject $H_0$ and conclude that not all three of the groups had the same mean. The test suggests that the story told had an effect on the opinions of the American participants.

12.3 Unequal variance

The equal variances assumption of ANOVA is often difficult to verify. In this section, we introduce a variant of ANOVA that corrects for unequal variance. We then use simulation to explore the effect of unequal variance on the results of ANOVA.

12.3.1 The oneway.test

The function oneway.test operates similarly to ANOVA but performs an approximation procedure (called Welch’s correction) to correct for unequal variance. We introduce oneway.test with an example of data involving the graying of chimpanzees.

The fosdata::chimps data comes from a 2020 study by Tapanes et. al.⁸⁹ The investigators showed photographs of chimpanzees to human judges, who rated the grayness of chimpanzee facial hair on a scale of 1-6.

By Ikiwaner - Own work, GFDL 1.2, https://commons.wikimedia.org/w/index.php?curid=11343938

The goal of the study was to determine whether gray hair in chimpanzees increases with age, as it does with humans. The authors found that for chimpanzees aged 30 and older, there does not seem to be an age effect. We will use the data in a different way. The study included chimpanzees from three different locations: the Ngogo community in Uganda, the South community at Taï National Park, Ivory Coast, and the captive population of the New Iberia Research Center (NIRC). We wish to determine whether the mean gray hair is different for chimpanzees in the three groups.

Since the study found no age effect for older chimpanzees, we may ignore age in our analysis if we restrict to chimps aged 30 and older. One chimpanzee, Brownface, has two pictures in the data set. To maintain independence, we arbitrarily chose to remove his 2011 photograph. This choice did not affect the conclusions of our statistical analysis. We load and clean the data first:

old_chimps <- fosdata::chimps %>%
  filter(age >= 30) %>%
  filter(!(individual == "Brownface" & year == 2011))

Next, we make a boxplot to visually compare the three groups.

old_chimps %>%
  ggplot(aes(x = population, y = grey_score_avg)) +
  geom_boxplot()

Based on this plot, it appears there may be a difference in mean gray score by population. However, the variances of the three groups are visibly different. The box and whiskers have more spread for NGOGO and less for TAI. Using ANOVA on this data could give misleading results.

The one-way test has the same independence and normality assumptions as ANOVA, and the same hypotheses. For the chimps,

\[ H_0: \mu_\text{NGOGO} = \mu_\text{NIRC} = \mu_\text{TAI}, \quad H_a: \text{not all of the $\mu_i$ are equal}. \]

We test grey_score_avg as explained by the factor variable population:

oneway.test(grey_score_avg ~ population, data = old_chimps)

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  grey_score_avg and population
## F = 4.2285, num df = 2.000, denom df = 18.927, p-value =
## 0.03033

The test results in $p$-value of $.03033$, so we reject $H_0$ at the $\alpha = 0.05$ level. We conclude that the three groups do not have the same grayness.

Incorrectly using anova in spite of the unequal variance would result in the less reliable value $p = 0.14$, which fails to detect a difference among the three groups.

12.3.2 Error simulations

For the chimpanzee data, unequal variance led to a striking difference between the results of ANOVA and the results of oneway.test. In this section, simulations show how much unequal variances might affect the accuracy of ANOVA.

The simulations in this section assume the underlying populations are normal and independent, but have unequal variances. We will measure the effective type I error rates when $\alpha = .05$. The results will depend on whether or not our groups have equal or unequal sample sizes - generally the behavior of ANOVA worsens when sample sizes are unequal.

Equal sample sizes

We create three groups of data, each size $n_i = 30$, all normally distributed with mean 0. Data in the first two groups has standard deviation 1, and in the third group we use standard deviation $\sigma$. The code below creates a function to run the simulation for various values of $\sigma$. For each $\sigma$, it calculates the probability of type I error, where we have $p < .05$ and incorrectly reject $H_0$.

variances <- c(0.01, 0.05, 0.1, 0.5, 5, 10, 50, 100)

sapply(variances, function(sigma) {
  sim_data <- replicate(10000, {
    three_groups <- data.frame(
      group = factor(rep(1:3, each = 30)),
      values = rnorm(90, 0, rep(c(1, 1, sigma), each = 30))
    )
    anova(lm(values ~ group, data = three_groups))$`Pr(>F)`[1]
  })
  mean(sim_data < 0.05)
})

The simulation takes a considerable time to run for 10000 trials, and results in the following:

$\sigma$	0.01	0.05	0.1	0.5	5	10	50	100
$P(p < .05)$	0.0649	0.0639	0.0608	0.0572	0.0779	0.0866	0.0902	0.0911

The effective type I error is consistently too high. A correctly functioning test would have type I error of 0.05 in this setting. Things are worse when the third group has large variance, not quite as bad when it has small variance.

Unequal sample sizes

In this simulation, the three groups have sample sizes $n_1 = 30$, $n_2 = 30$, and $n_3$ which will vary from 10 to 500. For each value of $n_3$, we create data with three groups that are normal, mean 0, and have standard deviation 1, 1, and 5 respectively. In each replication, the $p$ value is checked against $\alpha = 0.05$.

sample_sizes <- c(10, 20, 50, 100, 500)
sapply(sample_sizes, function(n3) {
  sim_data <- replicate(10000, {
    three_groups <- data.frame(
      group = factor(rep(1:3, times = c(30, 30, n3))),
      values = rnorm(60 + n3, 0, rep(c(1, 1, 5), times = c(30, 30, n3)))
    )
    anova(lm(values ~ group, data = three_groups))$`Pr(>F)`[1]
  })
  mean(sim_data < .05)
})

$n_3$	10	20	50	100	500
$P(p < .05)$	0.3031	0.1483	0.0263	0.0028	0.0000

Unequal sample sizes exacerbate the problem! These type I error rates would be 5% if the test was working. When the different group’s sample size is small relative to the other two, the error rate is as high as 30%. This means that 30% of the time, the test detects a difference in means where none exists. This is very bad, because it can lead investigators to find significance when none exists.

When the sample size is large the rate falls to below 0.01%, indicating the test is failing to work as advertised.

The oneway.test does not depend on equal variances, so it should work properly under the same conditions:

sample_sizes <- c(10, 20, 50, 100, 500)
sapply(sample_sizes, function(n3) {
  sim_data <- replicate(10000, {
    three_groups <- data.frame(
      group = factor(rep(1:3, times = c(30, 30, n3))),
      values = rnorm(60 + n3, 0, rep(c(1, 1, 5), times = c(30, 30, n3)))
    )
    oneway.test(values ~ group, data = three_groups)$p.value
  })
  mean(sim_data < .05)
})

$n_3$	10	20	50	100	500
$P(p < .05)$	0.0482	0.0490	0.0502	0.0513	0.0489

As desired, the effective type I error rate when $\alpha = .05$ is approximately $.05$ in all cases.

If you suspect your populations may not follow the assumptions for ANOVA, simulations give a measure of how big a problem it will cause. For example, if you have two samples of size 30 and one of size 10, then you want to be pretty strict about the equal variance assumption. If your sample sizes are all 30, then you don’t need to be as strict.

12.4 Pairwise t-tests

Suppose we have $k$ groups of observations and wish to detect a difference in means. It is tempting to perform t-tests on some or all of the groups, especially if two groups seem quite different. This sort of analysis is problematic because the selection of two groups biases the results. Given enough groups, even if all come from the same population it is quite likely that two of them will appear to be different.

One approach that avoids the issue of selection is to test all possible pairs. For $k$ groups, this results in $\binom{k}{2}$ different t-tests. The resulting $p$-values must be adjusted upward to account for the multiple tests. This method is called pairwise t-tests.

Pairwise t-tests have the advantage of showing directly which pairs of groups have significant differences in means. They also apply in experiments where subjects are reused across all groups, when ANOVA is not appropriate due to dependence (see Exercise 12.18).

On the other hand, pairwise t-tests do not directly test the hypothesis of interest, $H_0: \mu_1 = \mu_2 = \cdots = \mu_k$ versus $H_a:$ at least one of the means is different. Generally ANOVA will have more power (see Exercise 12.14).

The R command for doing pairwise $t$ tests is pairwise.t.test. The function pairwise.t.test requires two values, x the measured value or response vector and g the grouping variable.

Example 12.3

Perform pairwise t-tests to determine differences between groups for the mice_pot data set.

pairwise.t.test(mice_pot$percent_of_act, mice_pot$group)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  mice_pot$percent_of_act and mice_pot$group 
## 
##     VEH   0.3   1    
## 0.3 1.000 -     -    
## 1   1.000 1.000 -    
## 3   0.050 0.124 0.072
## 
## P value adjustment method: holm

The only pair approaching significance is the test between the control (VEH) group and the treatment group that received the highest THC dose, 3 mg/kg. With this result, it would be hard to claim a relationship between THC and locomotive impairment. At the start of the chapter, we used ANOVA to calculate a single $p$-value of $0.0357$, somewhat stronger evidence of a difference between the four groups of mice.

Example 12.4

Perform pairwise t-tests to determine differences between groups for the humanization data set from Section 12.2.2.

pairwise.t.test(kat_humanization$pak_sec, kat_humanization$group)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  kat_humanization$pak_sec and kat_humanization$group 
## 
##              Katrina control Katrina low
## Katrina low  0.033           -          
## Katrina high 0.013           0.647      
## 
## P value adjustment method: holm

There is a significant difference at $\alpha = .05$ between the control and both the low ($p = 0.033$) and high ($p = 0.013$) levels of aid stories, but no significant difference between the high and low level of aid stories ($p = 0.647$). Compare these values with the ANOVA $p$-value of 0.009 from Section 12.2.2, which showed an overall difference among the groups.

The pairwise.t.test function has additional options to control its behavior.

p.adjust.method sets the method for adjusting the $p$ values, and can be “none”, “holm” and “bonferroni”. These options are explored in the next section.
pool.sd indicates whether we want to pool the standard deviation across all of the groups to get an estimate of the common standard deviation, or whether we estimate the standard deviation of each group separately. Set to TRUE if we know that the variances of all the groups are equal, or possibly if some of the groups are small. Otherwise, FALSE.
alternative can be used for one sided-tests. The alternate hypothesis for each pair will be that the lower numbered group has a mean strictly less than (or greater than) the higher numbered group. This means that the ordering of groups is important.

12.4.1 Corrections and FWER

This section investigates the methods for adjusting $p$-values when performing multiple tests. First, we define a measure for the failure of multiple tests.

Definition 12.5

Suppose that multiple hypothesis tests are conducted. The Family Wide Error Rate is the probability of making at least one type I error in any of the tests.

Example 12.6

Estimate the FWER for an uncorrected pairwise t-test on six iid groups, each normal with mean 0 and sd 1.

As a first, step, generate one set of data and observe the results. In the pairwise t-test, set p.adjust.method to “none” so no $p$-value correction is performed.

six_groups <- data.frame(
  values = rnorm(60),
  groups = factor(rep(1:6, each = 10))
)
pvals <- pairwise.t.test(six_groups$values,
  six_groups$groups,
  p.adjust.method = "none"
)$p.value
pvals

##           1         2         3         4         5
## 2 0.7694607        NA        NA        NA        NA
## 3 0.5048147 0.3383610        NA        NA        NA
## 4 0.9769938 0.7475434 0.5233001        NA        NA
## 5 0.9961214 0.7731745 0.5017338 0.9731171        NA
## 6 0.9774319 0.7911463 0.4870307 0.9544448 0.9813092

By looking at the results, we can tell that none of the 15 $p$-values are less than 0.05. To simulate, we use the any function to detect whether any of the $p$-values are less than 0.05, indicating a family-wide error.

## [1] 0.3459

sim_data <- replicate(10000, {
  six_groups <- data.frame(
    values = rnorm(60),
    groups = factor(rep(1:6, each = 10))
  )
  pvals <- pairwise.t.test(six_groups$values,
    six_groups$groups,
    p.adjust.method = "none"
  )$p.value
  any(pvals < .05, na.rm = TRUE)
})
mean(sim_data)

The simulation gives a FWER of 0.3511, so an uncorrected pairwise t-test would result in a type I error about 35% of the time. A correction to the $p$-values for multiple tests is certainly required!

The Bonferroni correction to the pairwise t-test, multiplies each $p$-value by the total number of tests, capping $p$ at 1 if necessary. The Bonferroni correction is a conservative approach, and will lead to a FWER of less than the specified $\alpha$ level.

Repeating the simulation in Example 12.6 using p.adjust.method = "bonferroni" results in a FWER of about 0.04, which is indeed less than $\alpha = .05$.

A more sophisticated technique is the Holm correction. The FWER of Holm is the same as that of Bonferroni, but in general Holm is more powerful. For this reason, Holm is the preferred method and the default used by pairwise.t.test.

In Exercise 12.17, you are asked to show that the power of “holm” is higher than the power of “bonferroni” in a specific instance.

Vignette: Reproducibility

Up to this point in the book, we have emphasized writing reproducible code. Reproducible code means that anyone can take the R script or R Markdown file from a statistical analysis and run it as is to get exactly the results that were reported. This is important because it can often be challenging to understand which exact details were used for a statistical test when reading the summary of an experiment. Providing reproducible code eliminates any confusion about what techniques were used.

A second notion of reproducibility is that of reproducible science. A crucial aspect of the scientific method is that experiments should be reproducible. There is currently a reproducibility crisis in several scientific areas, in the sense that it is believed that many published results would not be reproduced if the experiment was repeated. Here are some reasons for that.

Significance: If you say that a $p$-value less than .05 means that we can reject $H_0$, that means that we will reject $H_0$ five percent of the time even when it is true. Now, consider how many thousands of statistical tests are run each day. Many (in fact, 5%) of those tests will result in $p < .05$ simply due to chance. This leads to a problem reproducing significance.
Power and effect size: Many experiments do consider the sample size needed to attain a reasonable power for the test they will be running. When a test is underpowered and yet $p < .05$, frequently the effect size gets overestimated. That is, if we are interested not only in whether $H_0$ is false but also how big of a difference from $H_0$ there is, we will overestimate that difference in an underpowered test. This leads to a problem reproducing effect sizes. We recommend doing a power analysis whenever possible before collecting data.
Data dredging: Some experimenters will take measurements on hundreds or thousands of variables, and look for interesting patterns. If the experimenters do enough tests and check on enough things, then eventually they will find something that is statistically significant. This is fine, as long as the experimenters report all of the tests that they conducted or considered in addition to the ones that were significant. Unethical researchers might not report on the data that wasn’t signifcant, which is called $p$-hacking. Even if researchers report all of the tests that were run, people can be misled by the statistically significant results, and assign them more importance than deserved. Both $p$-hacking and misinterpretation lead to reproducibility issues. We recommend reporting all tests that were run (or considered, but not run based on the data). We also recommend making all data and scripts public whenever possible.
Researcher degrees of freedom: There are many decisions that go into collecting and analyzing data. Consider outliers. If we think that data is an outlier or may have been miscoded, it is common practice to check with the person who collected the data. How do we decide what appears to be an outlier? If there are too many, then maybe we use a Wilcoxon Rank Sum test or other method that is robust to outliers rather than a $t$-test. If we are using a $t$-test, how do we decide whether var.equal is TRUE or FALSE? Do you perform a log transformation of the data? Is the alternative hypothesis one-sided or two-sided? There are any number of things that we may or may not do when analyzing a data set, and each one has an effect on the outcome that is hard to quantify. As a solution, we recommend pre-registration. Pre-registration means the experimenters state the details of the data collection and analysis procedure before starting to collect data. Any deviance from the pre-registered procedure should be noted and justified.
Replication incentives: Replication studies are not seen as exciting and result in little social or fianacial reward. They are more difficult to publish, since they are not original results. Not surprisingly, replication studies are conducted far less often than original research. The dearth of replication studies has allowed research with statistical design flaws to linger and gain credence, leading to larger crises when the results cannot be reproduced.

Exercises

Exercises 12.1 - 12.4 require material through Section 12.1.

12.1

Consider the chimps data set from the fosdata package. This data set contains 169 observations of (among other things) the average grey hair score grey_score_avg of chimps together with a population that the chimps come from. Suppose that the order of the populations is NIRC, NGOGO, TAI. In the notation of Section 12.1, find $n_1, n_2, n_3$ as well as $\overline{x}_{i\cdot}$ for $i = 1, 2, 3$.

12.2

If you have 30 independent observations of 3 groups, each of which are normal with the same mean and standard deviation, what is the distribution of the ANOVA test statistic $F$? Include degrees of freedom.

12.3

Create a simulation that shows that the ANOVA test statistic follows an $F$ distribution under the null hypothesis. Assume that you have 4 groups, each with 25 subjects, and each with mean 1 and standard deviation 2. Show via simulation that the ANOVA test statistic $F$ is an $F$ random variable with $k - 1$ and $N - k$ degrees of freedom.

12.4

Consider the penguins data set in the palmerpenguins package that was discussed in Chapter 11.

Create a boxplot of the variable bill_depth_mm by species. Does it appear that the data satisfies the assumptions for ANOVA?
Test $H_0$ all of the mean bill depths of the species are the same versus $H_a$ not all of the mean bill depths are the same at the $\alpha = .01$ level.
If you reject $H_0$, perform a post-hoc analysis to determine which pairs of mean bill depths are the same, and which are different at the $\alpha = .01$ level.

Exercises 12.5 - 12.10 require material through Section 12.2.

12.5

Consider the weight_estimate data in the fosdata package. Children and adults were asked to estimate the weight of an object that an actor picks up. Conduct an ANOVA test at the $\alpha = .05$ level to test whether the means of the weight estimates for the 300 gram object are the same across the age groups in the study.

12.6

Consider the case0502 data from Sleuth3. Dr Benjamin Spock was tried in Boston for encouraging young men not to register for the draft. It was conjectured that the judge in Spock’s trial did not have appropriate representation of women. The jurors were supposed to be selected by taking a random sample of 30 people (called venires), from which the jurors would be chosen. In the data case0502, the percent of women in 7 judges’ venires are given.

Create a boxplot of the percent women for each of the 7 judges. Comment on whether you believe that Spock’s lawyers might have a point.
Determine whether there is a significant difference in the percent of women included in the 6 judges’ venires who aren’t Spock’s judge.
Determine whether there is a significant difference in the percent of women induced in Spock’s venires versus the percent included in the other judges’ venires combined. (Your answer to b. should justify doing this.)

12.7

Consider the data in ex0524 in Sleuth3. This gives 2584 data points, where the subjects’ IQ was measured in 1979 and their Income was measured in 2005. Create a boxplot of income for each of the IQ quartiles. Is there evidence to suggest that there is a difference in mean income among the 4 groups of IQ? Check the assumptions necessary to do ANOVA. Does this cause you to doubt your result?

12.8

The built in data set chickwts measures weights of chicks after six weeks eating various types of feed. Test if the the mean weight is the same for all feed types. Are the assumptions for ANOVA reasonable?

12.9

The built in data set morley gives speed of light measurements for five experiments done by Michelson in 1879.

Create a boxplot with one box for each of the five experiments.
Use ANOVA to test if the five experiments have the same mean. What do you conclude?

12.10

The data set MASS::immer from the MASS package has data on a trial of barley varieties performed by Immer, Hayes, and Powers in the early 1930’s. Test if the total yield (sum of 1931 and 1932) depends on the variety of barley. Are the assumptions for ANOVA reasonable?

Exercises 12.11 - 12.13 require material through Section 12.3.

12.11

The built in data set InsectSprays records the counts of living insects in agricultural experimental units treated with different insecticides.

Plot the data. Three of the sprays seem to be more effective (less insects collected) than the other three sprays. Which three are the more effective sprays?
Use ANOVA to test if the three effective sprays have the same mean. What do you conclude?
Use ANOVA to test if the three less effective sprays have the same mean. What do you conclude?
Would it be appropriate to use ANOVA on the entire data set? Why or why not?
Use oneway.test to test the null hypothesis that all of the mean insect counts are the same for the various sprays versus the alternative that they are not, at the $\alpha = .01$ level.

12.12

Estimate the effective type I error rate when performing ANOVA on 4 groups at the $\alpha = .05$ level. Assume groups 1-3 are size 20 and group 4 is size 50. Assume all groups have mean 0, groups 1-3 have standard deviation 1, and group 4 has standard deviation 5.

12.13

Repeat Exercise 12.12, but assume all groups have size 20. Is the effective type I error closer to the $\alpha = .05$ level, or further from it?

Exercises 12.14 - 12.21 require material through Section 12.4.

12.14

Suppose you have 6 groups of 15 observations each. Groups 1-3 are normal with mean 0 and standard deviation 3, while groups 4-6 are normal with mean 1 and standard deviation 3. Compare via simulation the power of the following tests of $H_0: \mu_1 = \dotsb = \mu_6$ versus the alternative that at least one pair is different at the $\alpha = .05$ level.

ANOVA. (Hint: anova(lm(formula, data))[1,5] pulls out the p-value of ANOVA.)
pairwise.t.test, where we reject the null hypothesis if any of the paired $p$-values are less than 0.05.

The next two exercises use the data set flint from the fosdata package. The data set fosdata::flint contains data on lead levels in water samples taken from Flint, Michigan, during the “Flint water crisis” of 2015. Three lead measurements were taken at each house, Pb1, Pb2, and Pb3, on first draw, after 45 seconds, and after 2 minutes of water running. As observed in 9.12, the lead levels are highly skew.

12.15

Make a box plot showing the distribution of the log levels of Pb1 for each ward. Notice that ward 0 contains only a single sample. In fact, there is no ward 0 in Flint. Remove that data point before continuing.
Do the log Pb1 levels for each ward appear normally distributed? Are the variances approximately equal across wards?
Use ANOVA to determine if there is a difference in log Pb1 levels across the wards of Flint. Report a p-value with your answer.

12.16

Make a box plot showing the distribution of the log levels of Pb1, Pb2, and Pb3. Hint: use pivot_longer to tidy the data.
Do the three log levels appear normally distributed? Are their variances approximately equal?
Explain why it is inappropriate to use ANOVA to test for a difference among the means of the three log lead levels.
Use a pairwise t-test to determine if there is a difference in log lead level between the first draw, 45 second, and 2 minute samples.

12.17

Suppose you have three populations, all of which are normal with standard deviation 1, and with means $0$, $0.5$ and $1$. You take samples of size 30 and perform a pairwise $t$ test.

Estimate the percentage of times that at least one of the null hypotheses being tested are rejected when using the “bonferroni” $p$-value adjustment.
Estimate the percentage of times that at least one of the null hypotheses being tested are rejected when using the “holm” $p$-value adjustment.

12.18

Consider the data set cows_small in the fosdata package. In this data set, cows in Davis, CA were sprayed with water on hot days to try to cool them down. Each cow was measured with no spray (control), nozzle TK-0.75, and nozzle TK-12.

Is one-way ANOVA appropriate to use on this data set? If so, do it. If not, explain why not.
Perform pairwise $t$ tests on the three groups with holm $p$-value adjustment and explain.

12.19

Suppose you perform pairwise t-tests on $6$ iid groups.

How many tests are required?
At the $\alpha = 0.05$ level, what is the probability of type I error for a single one of these tests?
Assume the pairwise tests are independent. Compute the exact probability that a type I error occurs on at least one of these tests.
Compare your answer from part (c) to results of the simulation in 12.6. Are the pairwise tests independent?

12.20

Consider the weight_estimate data in the fosdata package. Children and adults were asked to estimate the weight of an object that an actor picks up. In Exercise 12.5, you conducted an ANOVA test at the $\alpha = .05$ level to test whether the means of the weight estimates for the 300 gram object are the same across the age groups in the study. Use pairwise.t.test to test all pairs of means at the $\alpha = .05$ level.

12.21

Consider the frogs data set in the fosdata package. This data set contains measurements from 6 species of frogs, including one new species, dakha, that the authors of the paper found. When establishing that a group of animals is a new species, it can be important to find physiological differences between the group of animals and animals that are nearest genetically. Read more about the data set by typing ?fosdata::frogs.

Create a boxplot of the distance from front of eyes to the nostril versus species. Does it appear that the data satisfies the assumptions for ANOVA?
Test $H_0$: all of the mean distances of the species are the same versus $H_a$ not all of the mean distances of the species are the same at the $\alpha = .05$ level.
If you reject $H_0$, look up TukeyHSD in base R, and use it to perform a post-hoc analysis to determine which pairs of mean bill depths are the same, and which are different at the $\alpha = .05$ level. Write up your result in an organized fashion with explanations.

Viñals X, Moreno E, Lanfumey L, Cordomí A, Pastor A, de La Torre R, et al. (2015) Cognitive Impairment Induced by Delta9-tetrahydrocannabinol Occurs through Heteromers between Cannabinoid CB1 and Serotonin 5-HT2A Receptors. PLoS Biol 13(7): e1002194. https://doi.org/10.1371/journal.pbio.1002194 ↩︎
Davies T, Yogeeswaran K, Verkuyten M, Loughnan S (2018) From humanitarian aid to humanization: When outgroup, but not ingroup, helping increases humanization. PLOS ONE 13(11): e0207343. https://doi.org/10.1371/journal.pone.0207343 ↩︎
Tapanes E, Anestis S, Kamilar JM, Bradley BJ (2020) Does facial hair greying in chimpanzees provide a salient progressive cue of aging? PLoS ONE 15(7): e0235610. https://doi.org/10.1371/journal.pone.0235610 ↩︎