Chapter 11 Analysis of Variance
In this chapter, we introduce one-way analysis of variance (ANOVA) through the analysis of a motivating example.
Consider the red.cell.folate
data in ISwR. We have data on folate levels of patients under three different treatments. Suppose we want to know whether treatment has a significant effect on the folate level of the patient. How would we proceed?
First, we would want to plot the data. A boxplot is a good idea.
red.cell.folate <- ISwR::red.cell.folate
ggplot(red.cell.folate, aes(x = ventilation, y = folate)) +
geom_boxplot()
Looking at this data, it is not clear whether there is a difference. It seems that the folate level under treatment 1 is perhaps higher. We could also try with notch = TRUE
, but since the notches go outside of the hinges, the plots look odd and are harder to read. We could try to compare the means of each pair of ventilations using t.test()
. The problem with that is we would be doing three tests, and we would need to adjust our \(p\)-values accordingly. We are going to take a different approach, which is ANOVA.
11.1 Setup
Notation
- \(x_{ij}\) denotes the \(j\)th observation from the \(i\)th group.
- \(\overline{x}_{i\cdot}\) denotes the mean of the observations in the \(i\)th group.
- \(\overline{x}_{\cdot}\) denotes the mean of all observations.
- There are \(k\) groups.
- There are \(n_i\) observations in the \(i\)th group.
- There are \(N\) total observations.
As an example, in red.cell.folate
, there are 3 groups.
We have \(x_{11} = 243, x_{12} = 251, \ldots, x_{21} = 206,\ldots,x_{31} = 241,\ldots.\)
To compute the group means:
## # A tibble: 3 x 2
## ventilation `mean(folate)`
## <fct> <dbl>
## 1 N2O+O2,24h 317.
## 2 N2O+O2,op 256.
## 3 O2,24h 278
This tells us that \(\overline{x}_{1\cdot} = 316.625\), for example.
To get the overall mean, we use mean(red.cell.folate$folate)
, which is 283.2.
Our model for the folate level is \[ x_{ij} = \mu + \alpha_i + \epsilon_{ij}, \] where \(\mu\) is the grand mean, \(\alpha_i\) represents the difference between the grand mean and the group mean, and \(\epsilon_{ij}\) represents the variation within the group. So, each individual data point could be realized as \[ x_{ij} = \overline{x}_\cdot + (\overline{x}_{i\cdot} - \overline{x}) + (x_{ij} - \overline{x}_{i\cdot}) \]
The null hypothesis for ANOVA is that the group means are all equal (or equivalently all of the \(\alpha_i\) are zero). So, if \(\mu_i = \mu + \alpha_i\) is the true group mean, we have:
\[ H_0 : \mu_1 = \mu_2 = \cdots =\mu_k\\ H_a : \text{Not all $\mu_i$ are equal} \]
11.2 ANOVA
ANOVA is “analysis of variance” because we will compare the variation of data within individual groups with the overall variation of the data. We make assumptions (below) to predict the behavior of this comparison when \(H_0\) is true. If the observed behavior is unlikely, it gives evidence against the null hypothesis. The general idea of comparing variance within groups to the variance between groups is applicable in a wide variety of situations.
Define the variation within groups to be \[ SSD_W = \sum_i \sum_j (x_{ij} - \overline{x}_{i\cdot})^2 \] and the variation between groups to be \[ SSD_B = \sum_i\sum_j (\overline{x}_{i\cdot} - \overline{x})^2 \] and the total variation to be \[ SSD_{tot} = \sum_i \sum_j (\overline{x} - x_{ij})^2. \]
This says that the total variation can be broken down into the variation between groups plus the variation within the groups. If the variation between groups is much larger than the variation within groups, then that is evidence that there is a difference in the means of the groups. The question then becomes: how large is large? In order to answer that, we will need to examine the \(SSD_B\) and \(SSD_W\) terms more carefully. At this point, we also will need to make
Assumptions
- The population is normally distributed within each group
- The variances of the groups are equal. (Yikes!)
Recall: We have \(k\) groups. There are \(n_i\) observations in group \(i\), and \(N = \sum n_i\) total observations.
Notice that \(SSD_W\) is almost \(\sum\sum (\overline{x}_{ij} - \mu_i)^2\), in which case \(SSD_W/\sigma^2\) would be a \(\chi^2\) random variable with \(N\) degrees of freedom. However, we are making \(k\) replacements of means by their estimates, so we subtract \(k\) to get that \(SSD_B/\sigma^2\) is \(\chi^2\) with \(N - k\) degrees of freedom. (This follows the heuristic we started in Section 4.6.2).
As for \(SSD_B\), it is trickier to see, but \(\sum_i \sum_j (x_{i\cdot} - \mu)^2/\sigma^2\) would be \(\chi^2\) with \(k\) degrees of freedom. Replacing \(\mu\) by \(\overline{x}\) makes \(SSD_B/\sigma^2\) a \(\chi^2\) rv with \(k - 1\) degrees of freedom.
The test statistic for ANOVA is: \[ F = \frac{SSD_B/(\sigma^2(k - 1))}{SSD_W/(\sigma^2(N - k))} = \frac{SSD_B/(k - 1)}{SSD_W/(N - k)} \]
This is the ratio of two \(\chi^2\) rvs divided by their degrees of freedom; hence, it is \(F\) with \(k - 1\) and \(N - k\) degrees of freedom. Through this admittedly very heuristic derivation, you can see why we must assume that we have equal variances (otherwise we wouldn’t be able to cancel the unknown \(\sigma^2\)) and normality (so that we get a known distribution for the test statistic).
If \(F\) is much bigger than 1, then we reject \(H_0: \mu_1 = \mu_2 = \cdots =\mu_k\) in favor of the alternative hypothesis that not all of the \(\mu_i\) are equal.
We can also run simulations to see that the test statistic under the null hypothesis has an F distribution. Let’s assume that we have three populations, all of which are normally distributed with mean 0 and standard deviation 1 (recall that we must assume equal variances, and we are simulating the distribution of the test statistic under the null hypothesis, that all means are equal). We assume that we have 10 samples from group 1, 20 from group 2 and 30 from group 3. In the notation above, \(k = 3\) and \(N = 60\).
We could use built in R functions (as below) to compute the test statistic, but it is worthwhile to make sure we understand all of the formulas so we do it by hand. We first make sure that group
is a factor! We can also check that (in this case, anyway) \(SSE_W + SSE_B = SSE_{tot}\).
three_groups$group <- factor(three_groups$group)
sse <- three_groups %>%
mutate(total_mean = mean(value)) %>%
group_by(group) %>%
mutate(mu = mean(value),
diff_within = value - mu,
diff_between = mu - total_mean,
diff_tot = value - total_mean) %>%
ungroup() %>%
summarize(sse_within = sum(diff_within^2),
sse_between = sum(diff_between^2),
sse_total = sum(diff_tot^2))
sse
## # A tibble: 1 x 3
## sse_within sse_between sse_total
## <dbl> <dbl> <dbl>
## 1 64.3 0.315 64.6
Now that we have computed the three sums of squared errors, we verify that \(SSE_W + SSE_B = SSE_{tot}\).
## [1] 0
Now we can compute the test statistic.
## [1] 0.1394401
The R command for computing this directly, which we will use inside of our simulation since it is faster, is given by
## [1] 0.1394401
Now we are ready for our simulation.
sim_data <- replicate(1000, {
three_groups <- data.frame(group = rep(1:3, times = c(10, 20, 30)),
value = rnorm(60, 0, 1))
three_groups$group <- factor(three_groups$group)
aov.mod <- anova(lm(value ~ group, data = three_groups))
aov.mod$`F value`[1]
})
plot(density(sim_data),
main = "F test statistic")
curve(df(x, 2, 57), add = T, col = 2)
We see that, indeed, the test statistic looks approximately \(F\) with 2 and 57 degrees of freedom.
11.3 Examples
11.3.1 red.cell.folate
Returning to red.cell.folate
, we use R to do all of the computations for us.
First, build a linear model for folate
as explained by ventilation
, and then apply the anova
function to the model. anova
produces a table which contains all of the computations from the discussion above.
## Analysis of Variance Table
##
## Response: folate
## Df Sum Sq Mean Sq F value Pr(>F)
## ventilation 2 15516 7757.9 3.7113 0.04359 *
## Residuals 19 39716 2090.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The two rows ventilation
and Residuals
correspond to the between group and within group variation.
The first column, Df
gives the degrees of freedom in each case. Since \(k = 3\), the between group variation has \(k - 1 = 2\) degrees of freedom, and since \(N = 22\), the within group variation (Residuals
) has \(N - k = 19\) degrees of freedom.
The Sum Sq
column gives \(SSD_B\) and \(SSD_W\). The Mean Sq
variable is the Sum Sq
value divided by the degrees of freedom. These two numbers are the numerator and denominator of the test statistic, \(F\). So here, \(F = 7757.9/2090.3 = 3.7113\).
To compute the \(p\)-value, we need the area under the tail of the \(F\) distribution above \(F = 3.7113\). Note that this is a one-tailed test, because small values of \(F\) would actually be evidence in favor of \(H_0\).
The ANOVA table gives the \(p\)-value as Pr(>F)
, here \(P = 0.04359\). We could compute this from \(F\) using:
## [1] 0.04359046
According to the \(p\)-value, we would reject at the \(\alpha = .05\) level, but some caution is in order.
Let’s continue by checking the assumptions in the model. As in regression, plotting the linear model object results in a series of residual plots.
The variances of each group did not look equal in the boxplot, and the Scale-Location plot suggests that the rightmost group does have somewhat higher variation than the others. The x-axis of these plots show Fitted values, which in ANOVA are the group means. So the rightmost cluster of points corresponds to the group with mean 316, which is group 1, the N2O+O2,24h
group.
The Normal Q-Q plot suggests the data may be thick tailed. I would put a strongly worded caveat in any report claiming significance based on this data, but it is certainly evidence that perhaps a larger study would be useful.
11.3.2 InsectSprays
The built in data set InsectSprays
records the counts of insects in agricultural experimental units treated with different insecticides.
Plotting the data reveals an obvious difference in effectiveness among the spray types:
Not surprisingly, ANOVA strongly rejects the null hypothesis that all sprays have the same mean, with a \(p\)-value less than \(2.2 \times 10^{-16}\):
## Analysis of Variance Table
##
## Response: count
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2668.8 533.77 34.702 < 2.2e-16 ***
## Residuals 66 1015.2 15.38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, it is also clear from the boxplot that the assumption of equal variances is violated. The three effective sprays C, D, and E, have visibly less variance than the other three sprays. In this example, the sprays with smaller counts have smaller variance. This phenomenon is known as heteroskedasticity, and is fairly common. The residual plots show it well, especially the Scale-Location plot
Using ANOVA with this data is highly suspect. Taking the square root of the counts improves the situation, or if normality of the data is not in question, then oneway.test
can be used.
## Analysis of Variance Table
##
## Response: sqrt(count)
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 88.438 17.6876 44.799 < 2.2e-16 ***
## Residuals 66 26.058 0.3948
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Or, if normality of the data within each group is not in question (only the equal variance assumption) then oneway.test
can be used.
##
## One-way analysis of means (not assuming equal variances)
##
## data: count and spray
## F = 36.065, num df = 5.000, denom df = 30.043, p-value = 7.999e-12
11.4 Examining the hypotheses
In this section, we examine the effective type I error rates when \(\alpha = .05\) when the underlying populations are normal but have unequal variances. First, let’s look at the equal sample size case. We suppose that we have three groups, two of which have the same variance, and one has a different variance. We examine the effective type I error rates. We fix the number of observations at 30 for each group for now. This piece of code can take a bit of time to run; you might try it with the number of replications equal to 100 or 1000 to start.
variances <- c(.01, .05, .1, .5, 5, 10, 50, 100)
sapply(variances, function(x) {
sim_data <- replicate(10000, {
three_groups <- data.frame(group = factor(rep(1:3, each = 30)),
values = rnorm(90, 0, rep(c(1, 1, x), each = 30)))
anova(lm(values ~ group, data = three_groups))$`Pr(>F)`[1]
})
mean(sim_data< .05)
})
## [1] 0.0649 0.0639 0.0608 0.0572 0.0779 0.0866 0.0902 0.0911
We see that when the sample sizes are all equal, and only one group has a different variance, that the group having a large variance tends to make the effective type I error (and more incorrect) larger than when the different group has a smaller variance than the other two groups. Now, what happens if the variance in the third group is larger, and the sample size is different in the third group?
sample_sizes <- c(10, 20, 50, 100, 500)
sapply(sample_sizes, function(x) {
sim_data <- replicate(1000, {
three_groups <- data.frame(group = factor(rep(1:3, times = c(30, 30, x))),
values = rnorm(60 + x, 0, rep(c(1, 1, 5), times = c(30, 30, x))))
anova(lm(values ~ group, data = three_groups))$`Pr(>F)`[1]
})
mean(sim_data< .05)
})
## [1] 0.277 0.136 0.031 0.002 0.000
Note that the unequal sample size exacerbates the problem! When the sample size is small relative to the other two, then the effective type I error rate when \(\alpha = .05\) is as high as 30 per cent, and when the sample size is large, then the effective type I error rate can get less than 0.001. Now, let’s compare what happens when we use oneway.test
.
sample_sizes <- c(10, 20, 50, 100, 500)
sapply(sample_sizes, function(x) {
sim_data <- replicate(1000, {
three_groups <- data.frame(group = factor(rep(1:3, times = c(30, 30, x))),
values = rnorm(60 + x, 0, rep(c(1, 1, 5), times = c(30, 30, x))))
oneway.test(values ~ group, data = three_groups)$p.value
})
mean(sim_data< .05)
})
## [1] 0.042 0.045 0.051 0.054 0.054
There doesn’t seem to be any significant issue with oneway.test
; the effective type I error rate when \(\alpha = .05\) appears to be approximately \(.05\), as desired.
In the exercises, you are asked to examine what happens when there are outliers and when one of the groups is skew. If you suspect your populations may not follow the assumptions that we need in order to do ANOVA, you can run some simulations to get a feel for how big of a problem it might be. So, for example, if you had two samples of size 30 and one of size 10, then you would want to be pretty strict about the equal variance assumption. If your sample sizes are all 30, then you don’t need to be as strict.
11.5 Pairwise t-tests and FWER
Suppose we have \(k\) groups of observations. Another approach is to do \({k}\choose{2}\) t-tests, and adjust the \(p\)-values for the fact that we are doing multiple t-tests. This has the advantage of showing directly which pairs of groups have significant differences in means, but the disadvantage of not testing directly \(H_0: \mu_1 = \mu_2 = \cdots = \mu_k\) versus \(H_a:\) at least one of the means is different. See Exercise 6.
Let’s begin by seeing that we will definitely need to adjust the \(p\)-values. Suppose we are testing at the \(\alpha = .05\) level, and we have \(6\) groups of observations all with the same mean. Then, we are doing \({{6} \choose {2}}\) pairs of tests, each of which has a probability of a type I error of \(.05\). We cannot really assume that the pairs of tests are independent, because if we know one pair is significantly different, that indicates an increased likelihood that at least one of the groups is different than the others. However, as a first estimate, we can assume independence to get an idea what the probability is of making an error in any of the 15 tests that we are doing. The probability of not making an error is 0.95, so the probability of not making an error in any of the tests is \(0.95^{15} \approx 0.46\). Hence, the probability of a type I error in at least one of the tests is \(1 - .95^{15} \approx 0.54\). The Family Wide Error Rate is the probability of making at least one type I error in any of the family of tests that are conducted. We have estimated it to be about \(0.54\) (assuming independence) in the computation above.
Let’s follow up with some simulations to check. We assume all 6 groups are normal with mean 0 and standard deviation 1. The R command for doing pairwise \(t\) tests is pairwise.t.test
. The function pairwise.t.test
has the following arguments.
x
the values of the experiment, or the response vector.g
the grouping of the experiment into populations.p.adjust.method
is the method for adjusting the \(p\) values. This includes the options “none”, “holm” and “bonferroni”. Default is “holm”.pool.sd
indicates whether we want to pool the standard deviation across all of the groups to get an estimate of the common standard deviation, or whether we estimate the standard deviation of each group separately. Set toTRUE
if we know that the variances of all the groups are equal, or possibly if some of the groups are small. Otherwise,FALSE
.
For our purposed for this simulation, we will set p.adjust.method
to be “none”, so we can see whether the FWER is about 0.537. Let’s first see what the output looks like.
six_groups <- data.frame(values = rnorm(60),
groups = factor(rep(1:6, each = 10)))
pairwise.t.test(six_groups$values,
six_groups$groups,
p.adjust.method = "none")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: six_groups$values and six_groups$groups
##
## 1 2 3 4 5
## 2 0.277 - - - -
## 3 0.072 0.463 - - -
## 4 0.431 0.762 0.301 - -
## 5 0.565 0.606 0.214 0.830 -
## 6 0.068 0.449 0.981 0.291 0.205
##
## P value adjustment method: none
OK, we need to see whether any of the \(p\)-values are less than 0.05, which would indicate that we would reject the null hypothesis of that pair of means being equal.
any(pairwise.t.test(six_groups$values,
six_groups$groups,
p.adjust.method = "none")$p.value < .05, na.rm = TRUE)
## [1] FALSE
Now, we put that inside of replicate
.
sim_data <- replicate(10000, {
six_groups <- data.frame(values = rnorm(60),
groups = factor(rep(1:6, each = 10)))
any(pairwise.t.test(six_groups$values,
six_groups$groups,
p.adjust.method = "none")$p.value < .05, na.rm = TRUE)
})
mean(sim_data)
## [1] 0.3453
We see that our original estimate of \(0.54\) was significantly off. The true FWER of the experiment is closer to 0.3453. This indicates that the \(p\)-values of the tests are not independent. Nonetheless, it is disconcerting that the FWER of the experiment is over 1/3 when we set \(\alpha = .05\)!
The simplest correction is the Bonferroni correction, which simply multiplies each \(p\)-value by the number of comparisons being made, rounding down to 1 if necessary. This is a conservative approach, and will lead to a FWER of less than the specified \(\alpha\) level. Let’s look at some simulations, where we simply multiply all of the \(p\)-values by 15, the number of comparisons. We should end up with a FWER of less than 0.05.
sim_data <- replicate(10000, {
six_groups <- data.frame(values = rnorm(60),
groups = factor(rep(1:6, each = 10)))
any(pairwise.t.test(six_groups$values,
six_groups$groups,
p.adjust.method = "bonferroni")$p.value < .05, na.rm = TRUE)
})
mean(sim_data)
## [1] 0.0383
We see that our FWER is 0.0383, which is indeed less than \(\alpha = .05\). If we use “holm”, then the FWER will still be less than or equal to \(\alpha - .05\). In fact, the FWER of “holm” is the same as that of “bonferroni”!
sim_data <- replicate(10000, {
six_groups <- data.frame(values = rnorm(60),
groups = factor(rep(1:6, each = 10)))
any(pairwise.t.test(six_groups$values,
six_groups$groups,
p.adjust.method = "holm")$p.value < .05, na.rm = TRUE)
})
mean(sim_data)
## [1] 0.0376
Now, we get an estimate of the FWER as 0.0376, which is indeed less than \(\alpha = .05\) and not significantly different than the Bonferroni FWER.
In Exercise @ref(bonf_holm), you are asked to show that the power of “holm” is higher than the power of “bonferroni” in a specific instance. In general, Holm is at least as powerful as Bonferroni, while having the same FWER, and for this reason is preferred.
Returning to the red.cell.folate
data, we could do a pairwise t test via the following command:
##
## Pairwise comparisons using t tests with pooled SD
##
## data: red.cell.folate$folate and red.cell.folate$ventilation
##
## N2O+O2,24h N2O+O2,op
## N2O+O2,op 0.042 -
## O2,24h 0.310 0.408
##
## P value adjustment method: holm
We see that there is a significant difference between two of the groups, but not between any of the other pairs. There are several options available with pairwise.t.test
, some of which were discussed above, which we discuss below:
pool.sd = TRUE
can be used when we are assuming equal variances among the groups, and we want to use all of the groups together to estimate the standard deviation. This can be useful if one of the groups is small, for example, but it does require the assumption that the variances of each group is the same.p.adjust.method
can be set to one of several options, which can be viewed by typingp.adjust.methods
. The easiest to understand is “bonferroni”, which multiplies the \(p\)-values by the number of tests being performed. This is a very conservative adjustment method! The defaultholm
is a good choice. If you wish to compare the results tot.test
for diagnostic purposes, then setp.adust.method = "none"
.alternative = "greater"
or “less” can be used for one sided-tests. Care must be taken to get the order of the groupings correct in this case.var.equal = TRUE
can be used if we know the variances are equal, but we do not want to pool the standard deviation. The default isvar.equal = FALSE
, just like int.test
. This option is not generally recommended.
11.6 Exercises
-
Consider the
case0502
data fromSleuth3
. Dr Benjamin Spock was tried in Boston for encouraging young men not to register for the draft. It was conjectured that the judge in Spock’s trial did not have appropriate representation of women. The jurors were supposed to be selected by taking a random sample of 30 people (called venires), from which the jurors would be chosen. In the datacase0502
, the percent of women in 7 judges’ venires are given.- Create a boxplot of the percent women for each of the 7 judges. Comment on whether you believe that Spock’s lawyers might have a point.
- Determine whether there is a significant difference in the percent of women included in the 6 judges’ venires who aren’t Spock’s judge.
- Determine whether there is a significant difference in the percent of women induced in Spock’s venires versus the percent included in the other judges’ venires combined. (Your answer to b. should justify doing this.)
-
Consider the data in
ex0524
inSleuth3
. This gives 2584 data points, where the subjects’ IQ was measured in 1979 and their Income was measured in 2005. Create a boxplot of income for each of the IQ quartiles. Is there evidence to suggest that there is a difference in mean income among the 4 groups of IQ? Check the residuals in your model. Does this cause you to doubt your result? -
The built in data set
InsectSprays
records the counts of living insects in agricultural experimental units treated with different insecticides.- Plot the data. Three of the sprays seem to be more effective (less insects collected) than the other three sprays. Which three are the more effective sprays?
- Use ANOVA to test if the three effective sprays have the same mean. What do you conclude?
- Use ANOVA to test if the three less effective sprays have the same mean. What do you conclude?
-
The built in data set
chickwts
measures weights of chicks after six weeks eating various types of feed. Test if the the mean weight is the same for all feed types. Check the residuals - are the assumptions for ANOVA reasonable? -
The data set
MASS::immer
from theMASS
library has data on a trial of barley varieties performed by Immer, Hayes, and Powers in the early 1930’s. Test if the total yield (sum of 1931 and 1932) depends on the variety of barley. Check the residuals - are the assumptions for ANOVA reasonable? -
The built in data set
morley
gives speed of light measurements for five experiments done by Michelson in 1879.- Create a boxplot with one box for each of the five experiments.
- Use ANOVA to test if the five experiments have the same mean. What do you conclude?
-
Suppose you have 6 groups of 15 observations each. Groups 1-3 are normal with mean 0 and standard deviation 3, while groups 4-6 are normal with mean 1 and standard deviation 3. Compare via simulation the power of the following tests of \(H_0: \mu_1 = \cdot = \mu_6\) versus the alternative that at least one pair is different at the \(\alpha = .05\) level.
- ANOVA. (Hint:
anova(lm(formula, data))[1,5]
pulls out the p-value of ANOVA.) - pairwise.t.test, where we reject the null hypothesis if any of the paired \(p\)-values are less than 0.05.
- ANOVA. (Hint:
-
Suppose you have three populations, all of which are normal with standard deviation 1, and with means \(0\), \(0.5\) and \(1\). You take samples of size 30 and perform a pairwise \(t\) test.
- Estimate the percentage of times that the null hypotheses being tested are rejected when using the “bonferroni” \(p\)-value adjustment.
- Estimate the percentage of times that the null hypotheses being tested are rejected when using the “holm” \(p\)-value adjustment.