In the last couple of classes, we have discussed the t test. Let’s recall that the test statistic \[ T = \frac{\overline{X} - \mu}{S/{\sqrt{n}}} \] is \(t_{n-1}\), provided that \(X_1,\ldots, X_n\) are iid normal random variables with mean \(\mu\).

The Central Limit Theorem tells us that if we take a large sample \(n\), then \(\overline{X}\) will be approximately normal, and the test statistic \(T\) described above will be approximately normal (and approximately \(t_{n-1}\) because for large \(n\) those are approximately the same). What CLT doesn’t tell us is how big \(n\) needs to be in order for \(T\) to be reasonably well approximated by \(t_{n-1}\).

There are three properties of the underlying distribution that we will study with respect to this question:

  1. skewness
  2. heavy tail
  3. outliers

Skewness

Let’s start by giving an example. A \(\chi^2\) rv is skew. Let’s look at its pdf:

x <- seq(0,10,.01)
plot(x,dchisq(x,3), type = "l")

Notice that, wherever the median is, the distribution doesn’t look the same to the left and to the right of that point. Moderate skewness of this type can affect the accuracy of confidence intervals and moderate \(p\)-values for sample sizes in the 30-100 range, and it can affect \(p\)-values in the \(10^{-5}\) range even for quite large samples. Let’s do some simulations.

I know that the mean of a \(\chisq\) random variable with 3 degrees of freedom is 3. Let’s consider a hypothesis test of \(H_0: \mu = 3\) vs \(H_a: \mu\not= 3\). If t.test is working correctly, the probability of getting a \(p\)-value less \(p\) should be \(p\). That’s how \(p\)-values work. Let’s simulate for a sample size of 20 and \(p = .1\).

pvals <- replicate(10000,t.test(rchisq(20,3),mu = 3)$p.value)
sum(pvals < .1)
## [1] 1202
#Better: sum(replicate(10000,t.test(rchisq(20,3),mu = 3)$p.value) < .1)

Not too bad. I see that I am rejecting \(H_0\) 12.02% of the time, rather than 10% of the time.

(Aside: I have switched to using implicit loops to do these simulations. If you prefer, you can still do all of the simulations on this page with for loops:

succ <- 0
trials <- 1000
for(i in 1:trials) {
  if(t.test(rchisq(20,3),mu = 3)$p.value < .1) succ <- succ + 1
}
(succ/trials) #Looking to see whether it is close to .1
## [1] 0.128

End of Aside.)

What about \(p = .05\)?

pvals <- replicate(10000,t.test(rchisq(20,3),mu = 3)$p.value)
sum(pvals < .05)
## [1] 699
#Better: sum(replicate(10000,t.test(rchisq(20,3),mu = 3)$p.value) < .05)

Here we see that the error is almost 50% more than how the test is designed to work. And with \(p = .001\):

pvals <- replicate(10000,t.test(rchisq(20,3),mu = 3)$p.value)
sum(pvals < .001)
## [1] 75

While I hesitate to put too much faith in this, we did obtain 75 false rejections of \(H_0\), when the test is designed to yield only 10! This type of behavior is typical for positive skewed rv’s such as chi-squared and exponential, for \(p = .05\), \(n = 30-50\) is often large enough to be more or less accurate. For \(p = .01\), we often need more like \(n = 100\). And as \(p\) gets smaller, we need more and more samples to make the test work close to how it was designed.

I note here that I have seen data sets that are so skewed that \(n = 3000\) is not a reasonable sample size even for \(p = .05\). So, the measure of skewness is also a factor in determining how big of a sample you need. In general if you are working with sums of squaring something that should be normal-ish, or working with sums of waiting times of Poisson-ish data, then the sample sizes above will probably be pretty close to what you need for your test to be valid. It is a completely separate question on how many samples you need in order to get the power of the test that you want!!!

Heavy Tails

Heavy tails are distributions that are more likely to have values far away from the mean/median than “typical”. What typical means is in the eye of the beholder, but if you look at a boxplot, then the values that are outside of the whiskers are one way of thinking about things that are far away from the median. Let’s look at normal and uniform rv’s (not heavy tailed):

boxplot(rnorm(100))

boxplot(runif(100))

A typical normal rv of size 100 might have a few data points outside of the whiskers, while uniform rv’s would not. t random variables, on the other hand, are typically heavy tailed.

boxplot(rt(100,2))

You can see there are more data points, further away from the whiskers. That is an indication of a heavy tailed distribution. As a general rule, symmetric heavy-tailed distributions tend to be too conservative in rejecting \(H_0\). That is, perhaps only 2% of the time would a p-value fall less than .05. Let’s look:

pvals <- replicate(10000, t.test(rt(30,2))$p.value)
sum(pvals < .1)
## [1] 871
sum(pvals < .01)
## [1] 50

We do OK with p-values less than .1, 871 instead of the expected value of 1000, and worse with p-values less than .01, 50 instead of 100.

When the sample size gets up to 100, then things look better, though we are still quite a bit more conservative than necessary at the \(\alpha = .01\) level.

pvals <- replicate(10000, t.test(rt(100,2))$p.value)
sum(pvals < .1)
## [1] 919
sum(pvals < .01)
## [1] 48

Outliers

Outliers mess up t-tests like nobodody’s business. You could have a sample size of 100000, and a single outlier of sufficient size could render your t-test completely invalid. The reason for that is that, if I hold \(x_1,\ldots,x_{n-1}\) constant, and let \(x_n\to\infty\), then the test statistic \(T \to 1\). And, we never reject \(H_0\) when \(T = 1\). So, a single large outlier will force us to never reject \(H_0\). If you think your data has, or could have, very large outliers, a t-test will not be appropriate no matter how many samples you take. Let’s look at an example:

pvals <- replicate(10000,t.test(c(rnorm(100), 10000))$p.value)
sum(pvals < .1)                 
## [1] 0

Note that we never reject \(H_0\) in this scenario, while we were supposed to reject it 10% of the time! If we up the sample size to 10000,

pvals <- replicate(10000,t.test(c(rnorm(10000), 10000))$p.value)
sum(pvals < .1) 
## [1] 0

we still never reject. So, single outlier crushes Mr t.test.