Chapter 5 Simulation of Random Variables

In this chapter we discuss simulation related to random variables. After a review of probability simulation, we turn to the simulation of pdfs and pmfs of random variables. These simulations provide a foundation for understanding the fundamental concepts of statistical inference: sampling distributions, point estimators, and the Central Limit Theorem.

5.1 Estimating probabilities of rvs via simulation

In order to run simulations with random variables, we will use the R command r + distname, where distname is the name of the distribution, such as unif, geom, pois, norm, exp or binom. The first argument to any of these functions is the number of samples to create.

Most distname choices can take additional arguments that affect their behavior. For example, if we want to create a random sample of size 100 from a normal random variable with mean 5 and sd 2, we would use rnorm(100, mean = 5, sd = 2), or simply rnorm(100, 5, 2). Without the additional arguments, rnorm gives the standard normal random variable, with mean 0 and sd 1. For example, rnorm(10000) gives 10000 independent random values of the standard normal random variable \(Z\).

Our strategy for estimating probabilities using simulation is as follows. We first create a large (approximately 10,000) sample from the distribution(s) in question. Then, we use the mean function to compute the percentage of times that the event we are interested in occurs. If the sample size is reasonably large, then the true probability that the event occurs should be close to the percentage of times that the event occurs in our sample. The larger the sample size is, the closer we expect our estimate to be to the true value. Since there can be variation in these estimates, it is often useful to repeat this procedure a few times to see what kind of variation we obtain in our answers.

Example 5.1

Let \(Z\) be a standard normal random variable. Estimate \(P(Z > 1)\).

We begin by creating a large random sample from a normal random variable.

simdata <- rnorm(10000)

Next, we compute the percentage of times that the sample is greater than one. First, we create a vector that contains TRUE at each index of the sample that is greater than one, and contains FALSE at each index of the sample that is less than or equal to one.

log_vec <- simdata > 1
head(log_vec)
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

Now, we count the number of times we see a TRUE, and divide by the length of the vector.

sum(log_vec == TRUE) / 10000
## [1] 0.1588

There are a few improvements we can make. First, log_vec == TRUE is just the same as log_vec, so we can compute

sum(log_vec) / 10000
## [1] 0.1588

Lastly, we note that sum converts TRUE to 1 and FALSE to zero. We are adding up these values and dividing by the length, which is the same thing as taking the mean. So, we can just do

mean(log_vec)
## [1] 0.1588

For simple problems, it is often easier to omit the intermediate step(s) and write

mean(simdata > 1)
## [1] 0.1588
Example 5.2

Let \(Z\) be a standard normal rv. Estimate \(P(Z^2 > 1)\).

simdata <- rnorm(10000)
mean(simdata^2 > 1)
## [1] 0.322

We can also easily estimate means and standard deviations of random variables. To do so, we create a large random sample from the distribution in question, and we take the sample mean or sample standard deviation of the large sample. If the sample is large, then we expect the sample mean to be close to the true mean, and the sample standard deviation to be close to the true standard deviation. Let’s begin with an example where we know what the correct answer is, in order to check that the technique is working.

Example 5.3

Let \(Z\) be a normal random variable with mean 0 and standard deviation 1. Estimate the mean and standard deviation of \(Z\).

simdata <- rnorm(10000, 0, 1)
mean(simdata)
## [1] 0.001501601
sd(simdata)
## [1] 1.003491

We see that we are reasonably close to the correct answers of 0 and 1.

Example 5.4

Estimate the mean and standard deviation of \(Z^2\).

simdata <- rnorm(10000)
mean(simdata^2)
## [1] 0.9873908
sd(simdata^2)
## [1] 1.416691

We will see later in this chapter that \(Z^2\) is a \(\chi^2\) random variable with one degree of freedom, and it is known that a \(\chi^2\) random variable with \(\nu\) degrees of freedom has mean \(\mu\) and standard deviation \(\sqrt{2\nu}\). Thus, the answers above are pretty close to the correct answers.

Example 5.5

Let \(X\) and \(Y\) be independent standard normal random variables. Estimate \(P(XY > 1)\).

There are two ways to do this. The first is:

x <- rnorm(10000)
y <- rnorm(10000)
mean(x * y > 1)
## [1] 0.105

A technique that is closer to what we will be doing below is the following. We want to create a random sample from the random variable \(Z = XY\). To do so, we would use

simdata <- rnorm(10000) * rnorm(10000)

Note that R multiples the vectors componentwise. Then, we compute the percentage of times that the sample is greater than 1 as before.

mean(simdata > 1)
## [1] 0.0995

The two methods give different but similar answers because simulations are random and only estimate the true values.

Note that \(P(XY > 1)\) is not the same as the answer we got for \(P(Z^2 > 1)\) in Example 5.3.

5.2 Estimating discrete distributions

In this section, we show how to estimate via simulation the pmf of a discrete random variable. Let’s begin with an example.

Example 5.6

Suppose that two dice are rolled, and their sum is denoted as \(X\). Estimate the pmf of \(X\) via simulation.

To estimate \(P(X = 2)\), for example, we would use

simdata <- replicate(10000, {
  dieRoll <- sample(1:6, 2, TRUE)
  sum(dieRoll)
})
mean(simdata == 2)
## [1] 0.0282

It is possible to repeat this replication for each value \(2,\ldots, 12\), but that would take a long time. A more efficient method is to keep track of all of the observations of the random variable, and divide each by the number of times the rv was observed. We will use table for this. Recall, table gives a vector of counts of each unique element in a vector. That is,

table(c(1, 1, 1, 1, 1, 2, 2, 3, 5, 1))
## 
## 1 2 3 5 
## 6 2 1 1

indicates that there are 6 occurrences of “1,” 2 occurrences of “2,” and 1 occurrence each of “3” and “5.” To apply this to the die rolling, we create a vector of length 10,000 that has all of the observances of the rv \(X\), in the following way:

simdata <- replicate(10000, {
  dieRoll <- sample(1:6, 2, TRUE)
  sum(dieRoll)
})
table(simdata)
## simdata
##    2    3    4    5    6    7    8    9   10   11   12 
##  300  547  822 1080 1412 1655 1418 1082  846  562  276

We then divide each entry of the table by 10,000 to get the estimate of the pmf:

table(simdata) / 10000
## simdata
##      2      3      4      5      6      7      8      9     10     11 
## 0.0300 0.0547 0.0822 0.1080 0.1412 0.1655 0.1418 0.1082 0.0846 0.0562 
##     12 
## 0.0276

We don’t want to hard-code the 10000 value in the above command, because it can be a source of error if we change the number of replications and forget to change the denominator here. There is a base R function proportions24 which is useful for dealing with tabled data. In particular, if the tabled data is of the form above, then it computes the proportion of each value.

proportions(table(simdata))
## simdata
##      2      3      4      5      6      7      8      9     10     11 
## 0.0300 0.0547 0.0822 0.1080 0.1412 0.1655 0.1418 0.1082 0.0846 0.0562 
##     12 
## 0.0276

And, there is our estimate of the pmf of \(X\). For example, we estimate the probability of rolling an 11 to be 0.0562.

The function proportions is discussed further in Section 10.1.

Example 5.7

Suppose fifty randomly chosen people are in a room. Let \(X\) denote the number of people in the room who have the same birthday as someone else in the room. Estimate the pmf of \(X\).

Tryit ??.2

Before reading further, what do you think the most likely outcome of \(X\) is?

We will simulate birthdays by taking a random sample from 1:365 and storing it in a vector. The tricky part is counting the number of elements in the vector that are repeated somewhere else in the vector. We will create a table and add up all of the values that are bigger than 1. Like this:

birthdays <- sample(1:365, 50, replace = TRUE)
table(birthdays)
## birthdays
##   8   9  13  20  24  41  52  64  66  70  92  98  99 102 104 119 123 126 
##   2   1   1   2   2   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 151 175 179 182 185 222 231 237 240 241 249 258 259 276 279 285 287 313 
##   1   1   1   1   1   1   1   1   1   1   1   1   2   2   1   1   1   1 
## 317 323 324 327 333 344 346 364 365 
##   1   1   1   1   1   1   1   1   1

We look through the table and see that anywhere there is a number bigger than 1, there are that many people who share that birthday. Using table(birthdays) > 1 to select the multiple birthday entries, we sum to count the number of people who share a birthday with someone else:

sum(table(birthdays)[table(birthdays) > 1])
## [1] 10

Now, we put that inside replicate, just like in the previous example.

simdata <- replicate(10000, {
  birthdays <- sample(1:365, 50, replace = TRUE)
  sum(table(birthdays)[table(birthdays) > 1])
})
pmf <- proportions(table(simdata))
pmf
plot(pmf,
  main = "50 people in a room",
  xlab = "Number of people sharing a birthday",
  ylab = "Probability"
)
## simdata
##      0      2      3      4      5      6      7      8      9     10 
## 0.0309 0.1196 0.0059 0.1982 0.0219 0.2146 0.0309 0.1688 0.0293 0.0917 
##     11     12     13     14     15     16     17     18     19     21 
## 0.0205 0.0361 0.0097 0.0137 0.0034 0.0034 0.0006 0.0006 0.0001 0.0001

Looking at the pmf, our best estimate is that the most likely outcome is that 6 people in the room share a birthday with someone else, followed closely by 4 and then 8. Note that it is impossible for exactly one person to share a birthday with someone else in the room! It would be preferable for several reasons to have the observation \(P(X = 1) = 0\) listed in the table above, but we leave that for another day.

Example 5.8

You toss a coin 100 times. After each toss, either there have been more heads, more tails, or the same number of heads and tails. Let \(X\) be the number of times in the 100 tosses that there were more heads than tails. Estimate the pmf of \(X\).

Tryit ??.3

Before looking at the solution, guess whether the pmf of \(X\) will be centered around 50, or not.

We start by doing a single run of the experiment. Recall that the function cumsum accepts a numeric vector and returns the cumulative sum of the vector. So, cumsum(c(1, 3, -1)) would return c(1, 4, 3).

coin_flips <- sample(c("H", "T"), 100, replace = TRUE) # flip 100 coins
num_heads <- cumsum(coin_flips == "H") # cumulative number of heads so far
num_tails <- cumsum(coin_flips == "T") # cumulative number of tails so far
sum(num_heads > num_tails) # number of times there were more heads than tails
## [1] 0

Now, we put that inside of replicate.

simdata <- replicate(100000, {
  coin_flips <- sample(c("H", "T"), 100, replace = TRUE)
  num_heads <- cumsum(coin_flips == "H")
  num_tails <- cumsum(coin_flips == "T")
  sum(num_heads > num_tails)
})
pmf <- proportions(table(simdata))

When we have this many possible outcomes, it is easier to view a plot of the pmf than to look directly at the table of probabilities.

plot(pmf, ylab = "Probability", xlab = "More heads than tails")

The most likely outcome (by far) is that there are never more heads than tails in the 100 tosses of the coin. This result can be surprising, even to experienced mathematicians.

Example 5.9

Suppose you have a bag full of marbles; 50 are red and 50 are blue. You are standing on a number line, and you draw a marble out of the bag. If you get red, you go left one unit. If you get blue, you go right one unit. This is called a random walk. You draw marbles up to 100 times, each time moving left or right one unit. Let \(X\) be the number of marbles drawn from the bag until you return to 0 for the first time. The rv \(X\) is called the first return time since it is the number of steps it takes to return to your starting position.

Estimate the pmf of \(X\).

First, simulate the steps of the walk, with 1 and -1 representing steps right and left.

movements <- sample(rep(c(1, -1), times = 50), 100, replace = FALSE)
movements[1:10]
##  [1]  1  1 -1 -1 -1 -1  1  1  1 -1

Next, use cumsum to calculate the cumulative sum of the steps. This vector gives the position of the walk at each step.

cumsum(movements)[1:10]
##  [1]  1  2  1  0 -1 -2 -1  0  1  0

The values where the cumulative sum is zero represent a return to the origin. Using which, we learn which steps of the walk were zero, and then find the first of these with min.

which(cumsum(movements) == 0)
## [1]   4   8  10  12  56  78  82  94 100
min(which(cumsum(movements) == 0))
## [1] 4

This results in a single value of the rv \(X\), in this case 4. To finish, we replicate the code and table it to compute the pmf.

simdata <- replicate(10000, {
  movements <- sample(rep(c(1, -1), times = 50), 100, replace = FALSE)
  min(which(cumsum(movements) == 0))
})
pmf <- proportions(table(simdata))
plot(pmf,
  main = "First return time for a 100 step random walk",
  xlab = "Steps to return",
  ylab = "Probability"
)

Only even return times are possible (why?). Half the time the first return is after two draws, one of each color. There is a slight bump near 100, when the bag of marbles empties out and the number of red and blue marbles drawn are forced to equalize.

5.3 Estimating continuous distributions

In this section, we show how to estimate the pdf of a continuous rv \(X\) via simulation. For discrete rvs, we used table to produce a count of each value that occurred in our simulation. These counts approximated the pmf. However, continuous rvs essentially never take the same value twice, so table is not helpful.

Instead, we divide the range of \(X\) into segments and count the number of values in each segment that appear in the simulation. These counts can be visualized by drawing a vertical bar over each segment of the range, with height corresponding to the count or proportion of values that appeared in that segment. The resulting graph is called a histogram, and is easily produced with the R command hist.

For distributions where the pdf is continuous, we may also use density estimation. Rather than dividing the range into fixed segments, density estimation computes a moving average of the number of values. The R command for density estimation is density, and in this book we will always use the result of density estimation to produce a plot. The plot results in a smooth curve whose height approximates the pdf of the random variable.

Example 5.10

Estimate the pdf of \(2Z\) when \(Z\) is a standard normal random variable.

We simulate 10,000 values of a standard normal rv \(Z\) and then multiply by 2 to get 10,000 values sampled from \(2Z\).

Z <- rnorm(10000)
twoZ <- 2 * Z

We then use density to estimate and plot the pdf of the data:

plot(density(twoZ),
  main = "Density of 2Z",
  xlab = "2Z"
)

Notice that the most likely outcome of \(2Z\) seems to be 0, just as it is for \(Z\), but that the spread of the distribution of \(2Z\) is twice as wide as the spread of \(Z\).

Alternatively, we could create a histogram of the data, using hist. We set probability = TRUE to adjust the scale on the y-axis so that the area of each rectangle in the histogram is the probability that the random variable falls in the interval given by the base of the rectangle.

hist(twoZ,
  probability = TRUE,
  main = "Histogram of 2Z",
  xlab = "2Z"
)

Given experimental information about a probability density, we wish to compare it to a known theoretical pdf. A direct way to make this comparison is to plot the estimated pdf on the same graph as the theoretical pdf. To add a curve to a histogram, density plot, or any plot, use the R function curve.

Example 5.11

Compare the pdf of \(2Z\), where \(Z\sim \text{Norm}(0,1)\) to the pdf of a normal random variable with mean 0 and standard deviation 2.

We already saw how to estimate the pdf of \(2Z\), we just need to plot the pdf of \(\text{Norm}(0,2)\) on the same graph. We show how to do this using both the histogram and the density plot approach. The pdf \(f(x)\) of \(\text{Norm}(0,2)\) is given in R by the function \(f(x) = \text{dnorm}(x,0,2)\).

hist(twoZ,
  probability = TRUE,
  main = "Density and Histogram of 2Z",
  xlab = "2Z"
)
curve(dnorm(x, 0, 2), add = TRUE, col = "red")

Since the area of each rectangle in the histogram is approximately the same as the area under the curve over the same interval, this is evidence that \(2Z\) is normal with mean 0 and standard deviation 2. Next, let’s look at the density estimation together with the true pdf of a normal rv with mean 0 and \(\sigma = 2\).

plot(density(twoZ),
  xlab = "2Z",
  main = "Density 2Z and Norm(0, 2)"
)
curve(dnorm(x, 0, 2), add = TRUE, col = "red")

Wow! Those look really close to the same thing! This is again evidence that \(2Z \sim \text{Norm}(0, 2)\).

It would be a good idea to label the two curves in our plots, but that sort of finesse is easier with ggplot, which will be discussed in Chapter 7. Also in Chapter 7, we introduce qq plots, which are a more accurate (but less intuitive) way to compare continuous distributions.

Example 5.12

Estimate via simulation the pdf of \(\log\bigl(\left|Z\right|\bigr)\) when \(Z\) is standard normal.

expData <- log(abs(rnorm(10000)))
plot(density(expData),
  main = "Density of log|Z|",
  xlab = "log|Z|"
)

The simulations in Examples 5.10 and 5.12 worked well because the pdfs of \(2Z\) and of \(\log\bigl(\left|Z\right|\bigr)\) are continuous functions. For pdfs with discontinuities, density estimation of this type will misleadingly smooth out the jumps in the function.

Example 5.13

Estimate the pdf of \(X\) when \(X\) is uniform on the interval \([-1,1]\).

unifData <- runif(10000, -1, 1)
plot(density(unifData),
  main = "Density of Unif[-1,1]",
  xlab = "X"
)
curve(dunif(x, -1, 1), add = TRUE, col = "red")

This distribution should be exactly 0.5 between -1 and 1, and zero everywhere else, as shown by the theoretical distribution in red. The simulation is reasonable in the middle, but the discontinuities at the endpoints are not shown well. Increasing the number of data points helps, but does not fix the problem.

For this distribution, which has a jump discontinuity, using a histogram works better.

hist(unifData,
  probability = TRUE,
  main = "Histogram of Unif[-1,1]",
  xlab = "X"
)

We now use density estimation to give evidence of an important fact: the sum of normal random variables is still normal.

Example 5.14

Estimate the pdf of \(Z_1 + Z_2\) where \(Z_1\) and \(Z_2\) are independent standard normal random variables.

z1 <- rnorm(10000)
z2 <- rnorm(10000)
plot(density(z1 + z2),
  main = "Sum of two standard normal rvs",
  xlab = expression(Z[1] + Z[2])
)
curve(dnorm(x, 0, sqrt(2)), add = TRUE, col = "red")

In the plot, the pdf for a \(\text{Norm}(0,\sqrt{2})\) rv is included. It appears that the sum of two standard normal random variables is again a normal random variable. The standard deviation is \(\sqrt{2}\) since \(\text{Var}(Z_1 + Z_2) = \text{Var}(Z_1) + \text{Var}(Z_2) = 2\) by Theorem 3.10.

Tryit 5.1

For \(Z_1\),\(Z_2\) independent standard normal, estimate the pdf of \(Z_1 - Z_2\) and see that it is also \(\text{Norm}(0,\sqrt{2})\).

Exercise 5.16 asks you to investigate the sum of two normal rvs which are not standard mean 0, sd 1, and see that their sum is normal as well. We will not prove that this is always true, but state it as the following:

Theorem 5.1

If \(X \sim \text{Norm}(\mu_X,\sigma_X)\) and \(Y \sim \text{Norm}(\mu_Y,\sigma_Y)\) are independent, then \(X+Y\) is normal with mean \(\mu_X + \mu_Y\) and variance \(\sigma_X^2 + \sigma_Y^2\).

More generally,

Theorem 5.2

Let \(X_1, \ldots, X_n\) be mutually independent normal random variables with means \(\mu_1, \ldots, \mu_n\) and standard deviations \(\sigma_1, \ldots, \sigma_n\). The random variable \[ \sum_{i = 1}^n a_i X_i \] is a normal random variable with mean \(\sum_{i = 1}^n a_i \mu_i\) and standard deviation \(\sqrt{\sum_{i = 1}^n a_i^2 \sigma_i^2}\). ]

Let’s meet an important distribution, the \(\chi^2\) (chi-squared) distribution. In fact, \(\chi^2\) is a family of distributions controlled by a parameter called the degrees of freedom, usually abbreviated df. The distname for a \(\chi^2\) rv is chisq, and dchisq requires the degrees of freedom to be specified using the df parameter.

Example 5.15

Let \(Z\) be a standard normal rv. Find the pdf of \(Z^2\) and compare it to the pdf of the \(\chi^2\) rv with one df on the same plot.

zData <- rnorm(10000)^2
hist(zData,
  probability = T,
  main = "Density and Histogram of Z^2",
  xlab = expression(Z^2)
)
curve(dchisq(x, df = 1), add = TRUE, col = "red")

As you can see, the estimated density follows the true histogram quite well. This is evidence that \(Z^2\) is, in fact, chi-squared. Notice that \(\chi^2\) with one df has a vertical asymptote at \(x = 0\).

Example 5.16

Estimate the pdf of \(X + Y\) when \(X\) and \(Y\) are independent exponential random variables with rate 2.

simdata <- replicate(10000, {
  x <- rexp(1, 2)
  y <- rexp(1, 2)
  x + y
})
hist(simdata,
  probability = TRUE,
  main = "Histogram of Sum of Exponentials",
  xlab = "X + Y"
)

The sum of exponential random variables follows what is called a gamma distribution. The gamma distribution is represented in R via the base gamma together with the typical prefixes dpqr. A gamma random variable has two parameters, the shape and the rate. It turns out, see Exercise 5.26, that:

  1. The sum of independent gamma random variables with the same rate rate and shapes s1 and s2 is again gamma with rate rate and shape s1 + s2.
  2. A gamma random variable with shape 1 is exponential.

Combining those two facts means that the sum of the independent exponential random variables with rate 2 in this example should be gamma with shape 2 and rate 2.

hist(simdata,
  probability = TRUE,
  main = "Sum of Exponentials is Gamma",
  xlab = "X + Y",
  ylim = c(0, .72)
) # so full curve will be on graph
curve(dgamma(x, shape = 2, rate = 2), add = T, col = 2)

Example 5.17

Estimate the density of \(X_1 + X_2 + \cdots + X_{20}\) when all of the \(X_i\) are independent exponential random variables with rate 2.

This one is trickier and is our first example where it is easier to use replicate to create the data. Let’s build up the experiment that we are replicating from the ground up.

Here’s the experiment of summing 20 exponential rvs with rate 2:

sum(rexp(20, 2))
## [1] 8.325553

Now replicate it (10 times to test):

replicate(10, sum(rexp(20, 2)))
##  [1] 11.898357  6.407845  8.360866  9.297432 12.212909 11.200338
##  [7]  8.746138 10.663090 11.940880  8.289535

Of course, we don’t want to just replicate it 10 times, we need about 10,000 data points to get a good density estimate.

sumExpData <- replicate(10000, sum(rexp(20, 2)))
plot(density(sumExpData),
  main = "Density of Sum of 20 Exponentials",
  xlab = "X1 + ... + X20"
)
curve(dgamma(x, shape = 20, rate = 2), add = T, col = 2)

As explained in the previous example, this is exactly a gamma random variable with rate 2 and shape 20, and it is also starting to look like a normal random variable! Exercise 5.17 asks you to fit a normal curve to this distribution.

5.4 The Central Limit Theorem

Definition 5.18

Random variables \(X_1,\ldots,X_n\) are called independent identically distributed or iid if the variables are mutually independent, and each \(X_i\) has the same probability distribution.

All of the r + distname functions (rnorm, rexp, etc.) produce samples from iid random variables. In the real world, iid rvs are produced by taking a random sample of size \(n\) from a population. Ideally, this sample would be obtained by numbering all members of the population and then using a random number generator to select members for the sample. As long as the population size is much larger than \(n\), the sampled values will be independent, and all will have the same probability distribution as the population distribution. However, in practice it is quite difficult to produce a true random sample from a real population, and failure to sample randomly can introduce error into mathematical results that depend on the assumption that random variables are iid.

Definition 5.19

Let \(X_1, \dotsc, X_n\) be a random sample. A random variable \(Y = h(X_1, \dotsc, X_n)\) that is derived from the random sample is called a sample statistic or sometimes just a statistic. The probability distribution of \(Y\) is called a sampling distribution.

Three of the most important statistics are the sample mean, sample variance, and sample standard deviation.

Definition 5.20

Assume \(X_1,\ldots,X_n\) are independent and identically distributed with mean \(\mu\) and standard deviation \(\sigma\). Define:

  • The sample mean \[ \overline{X} = \frac{X_1 + \dotsb + X_n}{n}. \]
  • The sample variance \[ S^2 = \frac 1{n-1} \sum_{i=1}^n (X_i - \overline{X})^2 \]
  • The sample standard deviation is \(S\), the square root of the sample variance.

These three statistics are computed in R with the commands mean, var, and sd that we have been using all along.

A primary goal of statistics is to learn something about a population (the distribution of the \(X_i\)’s) by studying sample statistics. We can make precise statements about a population from a random sample because it is possible to describe the sampling distributions of statistics like \(\overline{X}\) and \(S\) even in the absence of information about the distribution of the \(X_i\). The most important example of this phenomenon is the Central Limit Theorem, which is fundamental to statistical reasoning.

We first establish basic properties of the distribution of the sample mean.

Proposition 5.1

The sample mean \(\overline{X}\) has mean and variance given by \[ E[\overline{X}] = \mu \] \[ \text{Var}(\overline{X}) = \sigma^2/n \]

Proof:

From linearity of expected value (Theorem 3.8), \[ E[\overline{X}] = E[\frac{X_1 + \dotsb X_n}{n}] = \frac{1}{n}\Bigl(E[X_1] + \dotsb + E[X_n]\Bigr) = \frac{1}{n}(n\mu) = \mu. \]

Since the \(X_i\) are mutually independent, Theorem 3.10 applies and: \[ \text{Var}[\overline{X}] = \text{Var}\left(\frac{1}{n} \sum_i X_i\right) = \frac{1}{n^2}\text{Var}\left(\sum_i X_i\right) = \frac{1}{n^2}\sum_i \text{Var}(X_i) = \frac{1}{n^2}\cdot n \sigma^2 = \frac{\sigma^2}{n}\]

From Proposition 5.1, the random variable \(\frac{\overline{X} - \mu}{\sigma/\sqrt{n}}\) has mean 0 and standard deviation 1. When the population is normally distributed, Theorem 5.1 implies the following:

Proposition 5.2

If \(X_1, \dotsc, X_n\) are iid normal, then \(\frac{\overline{X} - \mu}{\sigma/\sqrt{n}} = Z\), where \(Z\) is a standard normal rv.

Remarkably, for large sample sizes we can still describe the distribution of \(\frac{\overline{X} - \mu}{\sigma/\sqrt{n}}\) even when we don’t know that the population is normal. This is the Central Limit Theorem.

Theorem 5.3 (Central Limit Theorem)

Let \(X_1,\ldots,X_n\) be iid rvs with finite mean \(\mu\) and standard deviation \(\sigma\). Then \[ \frac{\overline{X} - \mu}{\sigma/\sqrt{n}} \to Z \qquad \text{as}\ n \to \infty \] where \(Z\) is a standard normal rv.

We will not prove Theorem 5.3, but we will do simulations for several examples. They will all follow a similar format.

Example 5.21

Let \(X_1,\ldots,X_{30}\) be independent Poisson random variables with rate 2. From our knowledge of the Poisson distribution, each \(X_i\) has mean \(\mu = 2\) and standard deviation \(\sigma = \sqrt{2}\). The central limit theorem says that \[ \frac{\overline{X} - 2}{\sqrt{2}/\sqrt{30}} \] will be approximately normal with mean 0 and standard deviation 1 if \(n = 30\) is a large enough sample size. We will investigate later how to determine what a large enough sample size would be, but for now we assume that for this case, \(n = 30\) is big enough. Let us check this with a simulation.

This is a little but more complicated than our previous examples, but the idea is still the same. We create an experiment which computes \(\overline{X}\) and then transforms by subtracting 2 and dividing by \(\sqrt{2}/\sqrt{30}\).

Here is a single experiment:

xbar <- mean(rpois(30, 2))
(xbar - 2) / (sqrt(2) / sqrt(30))
## [1] 0.1290994

Now, we replicate and plot:

poisData <- replicate(10000, {
  xbar <- mean(rpois(30, 2))
  (xbar - 2) / (sqrt(2) / sqrt(30))
})
plot(density(poisData),
  main = "Standardized sum of Poisson Compared to Normal",
  xlab = "x"
)
curve(dnorm(x), add = T, col = 2)

Very close!

Example 5.22

Let \(X_1,\ldots,X_{50}\) be independent exponential random variables with rate 1/3. From our knowledge of the exponential distribution, each \(X_i\) has mean \(\mu = 3\) and standard deviation \(\sigma = 3\). The central limit theorem says that \[ \frac{\overline{X} - 3}{3/\sqrt{n}} \] is approximately normal with mean 0 and standard deviation 1 when \(n\) is large. We check this with a simulation in the case \(n = 50\). Again, we can either use histograms or density estimates for this, but this time we choose to use density estimates.

expData <- replicate(10000, {
  dat <- mean(rexp(50, 1 / 3))
  (dat - 3) / (3 / sqrt(50))
})
plot(density(expData),
  main = "Standardized sum of exponentials compared to normal",
  xlab = "x"
)
curve(dnorm(x), add = T, col = 2)

The Central Limit Theorem says that, if you take the mean of a large number of independent samples, then the distribution of that mean will be approximately normal. How large does \(n\) need to be in practice?

  • If the population you are sampling from is symmetric with no outliers, a good approximation to normality appears after as few as 15-20 samples.
  • If the population is moderately skew, such as exponential or \(\chi^2\), then it can take between 30-50 samples before getting a good approximation.
  • Data with extreme skewness, such as some financial data where most entries are 0, a few are small, and even fewer are extremely large, may not be appropriate for the central limit theorem even with 500 samples (see Example 5.23).

There are versions of the Central Limit Theorem available when the \(X_i\) are not iid, but outliers of sufficient size will cause the distribution of \(\overline{X}\) to not be normal, see Exercise 5.33.

Example 5.23

A distribution for which sample size of \(n = 500\) is not sufficient for good approximation via normal distributions.

We create a distribution that consists primarily of zeros, but has a few modest sized values and a few large values:

# Start with lots of zeros
skewdata <- replicate(2000, 0)
# Add a few moderately sized values
skewdata <- c(skewdata, rexp(200, 1 / 10))
# Add a few large values
skewdata <- c(skewdata, seq(100000, 500000, 50000))

mu <- mean(skewdata)
sig <- sd(skewdata)

We use sample to take a random sample of size \(n\) from this distribution. We take the mean, subtract the true mean of the distribution, and divide by \(\sigma/\sqrt{n}\). We replicate that 10000 times to estimate the sampling distribution. Here are the plots when \(\overline{X}\) is the mean of a sample of size 100:

n <- 100

skewSample <- replicate(10000, {
  dat <- mean(sample(skewdata, n, TRUE))
  (dat - mu) / (sig / sqrt(n))
})
hist(skewSample,
  probability = T,
  main = "Histogram of Standardized Mean of 100 Samples",
  xlab = "X"
)
curve(dnorm(x), add = T, col = 2)

When \(\overline{X}\) is the mean of a sample of size 500:

n <- 500

skewSample <- replicate(10000, {
  dat <- mean(sample(skewdata, n, TRUE))
  (dat - mu) / (sig / sqrt(n))
})

hist(skewSample,
  probability = T,
  main = "Histogram of Standardized Mean of 500 Samples",
  xlab = "x"
)
curve(dnorm(x), add = T, col = 2)

Even with a sample size of 500, the density is still far from the normal distribution, especially the lack of left tail. Of course, the Central Limit Theorem is still true, so \(\overline{X}\) must become approximately normal if we choose \(n\) large enough. Here is the distribution when \(n = 5000\):

n <- 5000

skewSample <- replicate(10000, {
  dat <- mean(sample(skewdata, n, TRUE))
  (dat - mu) / (sig / sqrt(n))
})

hist(skewSample,
  probability = T,
  main = "Histogram of Standardized Mean of 5000 Samples",
  xlab = "x"
)
curve(dnorm(x), add = T, col = 2)

There may still be a slight skewness in this picture, but it is finally close to normal.

5.5 Sampling distributions

We describe the \(\chi^2\), \(t\), and \(F\) distributions, which are examples of distributions that are derived from random samples from a normal distribution. The emphasis is on understanding the relationships between the random variables and how they can be used to describe distributions related to the sample statistics \(\overline{X}\) and \(S\). Your goal should be to get comfortable with the idea that sample statistics have known distributions. You may not be able to prove relationships that you see later in this book, but with careful study of this chapter you won’t be surprised, either. We will use density estimation extensively to illustrate the relationships.

5.5.1 The \(\chi^2\) distribution

Definition 5.24

Let \(Z\) be a standard normal random variable. An rv with the same distribution as \(Z^2\) is called a Chi-squared random variable with one degree of freedom.

Let \(Z_1, \dotsc, Z_n\) be \(n\) iid standard normal random variables. Then an rv with the same distribution as \(Z_1^2 + \dotsb + Z_n^2\) is called a Chi-squared random variable with \(n\) degrees of freedom.

Remark

The sum of \(n\) independent \(\chi^2\) rvs with 1 degree of freedom is a \(\chi^2\) rv with \(n\) degrees of freedom. More generally, the sum of a \(\chi^2\) with \(\nu_1\) degrees of freedom and a \(\chi^2\) rv with \(\nu_2\) degrees of freedom is \(\chi^2\) rv with \(\nu_1 + \nu_2\) degrees of freedom.

Let’s check by estimating pdfs that the sum of a \(\chi^2\) rv with 2 df and a \(\chi^2\) rv with 3 df is a \(\chi^2\) rv with 5 df.

chi2Data <- replicate(10000, rchisq(1, 2) + rchisq(1, 3))
hist(chi2Data,
  probability = TRUE,
  main = "Histogram of sum of Chi^2 Random Variables",
  xlab = "x + y"
)
curve(dchisq(x, df = 5), add = T, col = 2)

The \(\chi^2\) distribution is important for understanding the sample variance \(S^2\).

Theorem 5.4

If \(X_1,\ldots,X_n\) are iid normal rvs with mean \(\mu\) and standard deviation \(\sigma\), then \[ \frac{n-1}{\sigma^2} S^2 \] has a \(\chi^2\) distribution with \(n-1\) degrees of freedom.

We provide a heuristic and a simulation as evidence that Theorem 5.4 is true. Additionally, in Exercise 5.39 you are asked to prove Theorem 5.4 in the case that \(n = 2\).

Heuristic argument: Note that \[ \frac{n-1}{\sigma^2} S^2 = \frac 1{\sigma^2}\sum_{i=1}^n \left(X_i - \overline{X}\right)^2 \\ = \sum_{i=1}^n \left(\frac{X_i - \overline{X}}{\sigma}\right)^2 \] Now, if we had \(\mu\) in place of \(\overline{X}\) in the above equation, we would have exactly a \(\chi^2\) with \(n\) degrees of freedom. Replacing \(\mu\) by its estimate \(\overline{X}\) reduces the degrees of freedom by one, but it remains \(\chi^2\).

For the simulation, we estimate the density of \(\frac{n-1}{\sigma^2} S^2\) and compare it to that of a \(\chi^2\) with \(n-1\) degrees of freedom, in the case that \(n = 4\), \(\mu = 3\) and \(\sigma = 9\).

sdData <- replicate(10000, 3 / 81 * sd(rnorm(4, 3, 9))^2)
hist(sdData,
  probability = TRUE,
  main = "Histogram of Sample Variance Compared to Chi^2",
  xlab = "S",
  ylim = c(0, .25)
)
curve(dchisq(x, df = 3), add = T, col = 2)

5.5.2 The \(t\) distribution

This section introduces Student’s \(t\) distribution. The distribution is named after William Gosset, who published under the pseudonym “Student” while working at the Guinness Brewery in the early 1900’s.

Definition 5.25

If \(Z\) is a standard normal(0,1) rv, \(\chi^2_\nu\) is a Chi-squared rv with \(\nu\) degrees of freedom, and \(Z\) and \(\chi^2_\nu\) are independent, then

\[ t_\nu = \frac{Z}{\sqrt{\chi^2_\nu/\nu}} \]

is distributed as a \(t\) random variable with \(\nu\) degrees of freedom .

The \(t\) distributions are symmetric around zero and bump-shaped, like the normal distribution, but the \(t\) distributions are heavy tailed. The heavy tails mean that \(t\) random variables have higher probability of being far from 0 than normal random variables do, in the sense that \[ \lim_{M \to \infty} \frac{P(|T| > M)}{P(|Z| > M)} = \infty \] for any normal rv \(Z\) and \(T\) having any \(t\) distribution.

The \(t\) distribution will play an important role in Chapter 8, where it appears as the result of random sampling. Theorem 5.5 establishes the connection between a random sample \(X_1,\dotsc,X_n\) and the \(t\) distribution.

Theorem 5.5

If \(X_1,\ldots,X_n\) are iid normal rvs with mean \(\mu\) and sd \(\sigma\), then \[ \frac{\overline{X} - \mu}{S/\sqrt{n}} \] is \(t\) with \(n-1\) degrees of freedom.

Proof:

\[ \frac{Z}{\sqrt{\chi^2_{n-1}/n}} = \frac{\overline{X} - \mu}{\sigma/\sqrt{n}}*\sqrt{\frac{\sigma^2 (n-1)}{S^2(n-1)}}\\ = \frac{\overline{X} - \mu}{S/\sqrt{n}} \] Where we have used that \((n - 1) S^2/\sigma^2\) is \(\chi^2\) with \(n - 1\) degrees of freedom.

Since the mean and standard deviation of \(\overline{X}\) are \(\mu\) and \(\sigma/\sqrt{n}\), the random variable \(\frac{\overline{X} - \mu}{\sigma/\sqrt{n}}\) is \(\text{Norm}(0,1)\). Replacing the denominator \(\sigma/\sqrt{n}\) with the random variable \(S/\sqrt{n}\) changes the distribution from normal to \(t\).

Definition 5.26

The random variable \(S/\sqrt{n}\) is the standard error of \(\overline{X}\).

Example 5.27

Estimate the pdf of \(\frac{\overline{X} - \mu}{S/\sqrt{n}}\) in the case that \(X_1,\ldots,X_6\) are iid normal with mean 3 and standard deviation 4, and compare it to the pdf of the appropriate \(t\) random variable.

tData <- replicate(10000, {
  normData <- rnorm(6, 3, 4)
  (mean(normData) - 3) / (sd(normData) / sqrt(6))
})
hist(tData,
  probability = TRUE,
  ylim = c(0, .37),
  main = "Histogram of t",
  xlab = "t"
)
curve(dt(x, df = 5), add = TRUE, col = 2)

We see a very good agreement.

5.5.3 The \(F\) distribution

An \(F\) distribution has the same density function as \[ F_{\nu_1, \nu_2} = \frac{\chi^2_{\nu_1}/\nu_1}{\chi^2_{\nu_2}/\nu_2} \] when \(\chi^2_{\nu_1}\) and \(\chi^2_{\nu_2}\) are independent. We say \(F\) has \(\nu_1\) numerator degrees of freedom and \(\nu_2\) denominator degrees of freedom.

One example of this type is: \[ \frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \] where \(X_1,\ldots,X_{n_1}\) are iid normal with standard deviation \(\sigma_1\) and \(Y_1,\ldots,Y_{n_2}\) are iid normal with standard deviation \(\sigma_2.\)

5.5.4 Summary

  1. \(aX+ b\) is normal with mean \(a\mu + b\) and standard deviation \(a\sigma\) if \(X\) is normal with mean \(\mu\) and sd \(\sigma\).
  2. \(\frac{X - \mu}{\sigma}\sim Z\) if \(X \sim \text{Norm}(\mu, \sigma)\).
  3. \(\frac{\overline{X} - \mu}{\sigma/\sqrt{n}} \sim Z\) if \(X_1,\ldots,X_n\) are iid \(\text{Norm}(\mu, \sigma)\)
  4. \(\lim_{n\to \infty} \frac{\overline{X} - \mu}{\sigma/\sqrt{n}} \sim Z\) if \(X_1,\ldots,X_n\) iid with finite variance.
  5. \(Z^2 \sim \chi^2_1\)
  6. \(\chi^2_\nu + \chi^2_\eta \sim \chi^2_{\nu + \eta}\) if independent
  7. \(\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\) when \(X_1,\ldots,X_n\) iid normal.
  8. \(\frac Z{\sqrt{\chi^2_\nu/\nu}} \sim t_\nu\) if independent
  9. \(\frac{\overline{X} - \mu}{S/\sqrt{n}}\sim t_{n-1}\) when \(X_1,\ldots,X_n\) iid normal.
  10. \(\frac{\chi^2_\nu/\nu}{\chi^2_\mu/\mu} \sim F_{\nu, \mu}\) if independent
  11. \(\frac{S_1^2/\sigma_1^2}{S_2^2/\sigma_2^2} \sim F_{n_1 - 1, n_2 - 1}\) if \(X_1,\ldots,X_{n_1}\) iid normal with sd \(\sigma_1\) and \(Y_1,\ldots,Y_{n_2}\) iid normal with sd \(\sigma_2\).

5.6 Point estimators

This section explores the definition of sample mean and (especially) sample variance, in the more general context of point estimators and their properties.

Suppose that a random variable \(X\) depends on some parameter \(\beta\). In many instances we do not know \(\beta,\) but are interested in estimating it from observations of \(X\).

Definition 5.28

If \(X_1, \ldots, X_n\) are iid with the same distribution as \(X\), then a point estimator for \(\beta\) is a function \(\hat\beta = h(X_1, \ldots, X_n)\).

Typically the value of \(\hat \beta\) represents our best estimate of the parameter \(\beta\), based on the data. Note that \(\hat \beta\) is a random variable, and since \(\hat \beta\) is derived from \(X_1, \dotsc, X_n\), \(\hat\beta\) is a sample statistic and has a sampling distribution.

The two point estimators that we examine in this section are the sample mean and sample variance. The sample mean \(\overline{X}\) is a point estimate for the mean of a random variable, and the sample variance \(S^2\) is a point estimator for the variance of an rv.

Recall that the definition of the mean of a discrete rv with possible values \(x_i\) is \[ E[X] = \mu = \sum_{i=1}^n x_i p(x_i) \] If we are given a random sample of data \(x_1,\ldots,x_n\), it is natural to assign \(p(x_i) = 1/n\) for each of those data points to create a new random variable \(X_0\), and we obtain \[ E[X_0] = \frac 1n \sum_{i=1}^n x_i = \overline{x} \] This is a natural choice for our point estimator for the mean of the rv \(X\), as well. \[ \hat \mu = \overline{x} = \frac 1n \sum_{i=1}^n x_i \]

Continuing in the same way, if we are given data \(x_1,\ldots,x_n\) and we wish to estimate the variance, then we could create a new random variable \(X_0\) whose possible outcomes are \(x_1,\ldots,x_n\) with probabilities \(1/n\) and compute

\[ \sigma^2 = E[(X- \mu)^2] = \frac 1n \sum_{i=1}^n (x_i - \mu)^2 \]

This works just fine as long as \(\mu\) is known. However, most of the time, we do not know \(\mu\) and we must replace the true value of \(\mu\) with our estimate from the data. There is a heuristic “degrees of freedom” that each time you replace a parameter with an estimate of that parameter, you divide by one less. Following that heuristic, we obtain \[ \hat \sigma^2 = S^2 = \frac 1{n-1} \sum_{i=1}^n (x_i - \overline{x})^2 \]

5.6.1 Properties of point estimators

The point estimators for \(\mu\) and \(\sigma^2\) are random variables themselves, since they are computed using a random sample from a distribution. As such, they also have distributions, means and variances. One desirable property of point estimators is that they be unbiased.

Definition 5.29

A point estimator \(\hat \beta\) for \(\beta\) is an unbiased estimator if \(E[\hat \beta] = \beta\).

Intuitively, “unbiased” means that the estimator does not consistently underestimate or overestimate the parameter it is estimating. If we were to estimate the parameter over and over again, the average value of the estimates would converge to the correct value of the parameter.

Example 5.30

Let \(X_1,\ldots, X_{5}\) be a random sample from a normal distribution with mean 1 and variance 1. Compute \(\overline{x}\) based on the random sample.

mean(rnorm(5, 1, 1))
## [1] 0.9330305

You can see the estimate is pretty close to the correct value of 1, but not exactly 1. However, if we repeat this experiment 10,000 times and take the average value of \(\overline{x}\), that should be very close to 1.

mean(replicate(10000, mean(rnorm(5, 1, 1))))
## [1] 1.006326

It can be useful to repeat this a few times to see whether it seems to be consistently overestimating or underestimating the mean. In this case, it is not.

Proposition 5.3

For a random sample from a population with mean \(\mu\), the sample mean \(\overline{X}\) is an unbiased estimator of \(\mu\).

Proof:

Let \(X_1, \ldots, X_n\) be a random sample from the population, so \(E[X_i] = \mu\). Then \[ E[\overline{X}] = E\left[\frac{X_1 + \dotsb + X_n}{n}\right] = \frac{1}{n}\left(E[X_1] + \dotsb + E[X_n]\right) = \mu.\]

Now consider the sample variance \(S^2\). The formula \[ S^2 = \frac {1}{n-1} \sum_{i=1}^n (X_i - \overline{X})^2 \] contains a somewhat surprising division by \(n-1\), since the sum has \(n\) terms. We explore this with simulation.

Example 5.31

Let \(X_1,\ldots,X_5\) be a random sample from a normal distribution with mean 1 and variance 4. Estimate the variance using \(S^2\).

sd(rnorm(5, 1, 2))^2
## [1] 3.393033

We see that our estimate is not ridiculous, but not the best either. Let’s repeat this 10,000 times and take the average. (We are estimating \(E[S^2]\).)

mean(replicate(10000, sd(rnorm(5, 1, 2))^2))
## [1] 3.992801

Again, repeating this a few times, it seems that we are neither overestimating nor underestimating the value of the variance.

Proposition 5.4

For a random sample from a population with mean \(\mu\) and variance \(\sigma^2\), the sample variance \(S^2\) is an unbiased estimator of \(\sigma^2\).

The proof of Proposition 5.4 is left as Exercise 5.45.

Example 5.32

In this example, we see that dividing by \(n\) would lead to a biased estimator. That is, \(\frac 1n \sum_{i=1}^n (X_i - \overline{X})^2\) is not an unbiased estimator of \(\sigma^2\).

We use replicate twice. On the inside, it is running 10000 simulations. On the outside, it is repeating the overall simulation 7 times.

replicate(
  7,
  mean(replicate(10000, {
    data <- rnorm(5, 1, 2)
    1 / 5 * sum((data - mean(data))^2)
  }))
)
## [1] 3.242466 3.212477 3.231688 3.187677 3.157526 3.218117 3.227146

In each of the 7 trials, \(\frac 1n \sum_{i=1}^n (X_i - \overline{X})^2\) underestimates the correct value \(\sigma^2 = 4\) by quite a bit.

In summary, \(S^2\) is an unbiased estimator of the population variance \(\sigma^2\).

5.6.2 Variance of unbiased estimators

From the above, we can see that \(\overline{x}\) and \(S^2\) are unbiased estimators for \(\mu\) and \(\sigma^2\), respectively. There are other unbiased estimator for the mean and variance, however.

The median of a collection of numbers is the middle number, after sorting (or the average of the two middle numbers). When a population is symmetric about its mean \(\mu\), the median is an unbiased estimator of \(\mu\). If the population is not symmetric, the median is not typically an unbiased estimator for \(\mu\), see Exercise 5.40 and Exercise 5.41.

If \(X_1,\ldots,X_n\) is a random sample from a normal rv, then the median of the \(X_i\) is an unbiased estimator for the mean. Moreover, the median seems like a perfectly reasonable thing to use to estimate \(\mu\), and in many cases is actually preferred to the mean.

There is one way, however, in which the sample mean \(\overline{x}\) is definitely better than the median, and that is that the mean has a lower variance than the median. This means that \(\overline{x}\) does not deviate from the true mean as much as the median will deviate from the true mean, as measured by variance.

Let’s do a simulation to illustrate. Suppose you have a random sample of size 11 from a normal rv with mean 0 and standard deviation 1. We estimate the variance of the mean \(\overline{x}\).

var(replicate(10000, mean(rnorm(11, 0, 1))))
## [1] 0.08961579

Now we repeat for the median.

var(replicate(10000, median(rnorm(11, 0, 1))))
## [1] 0.1400155

We see that the variance of the mean is lower than the variance of the median. This is a good reason to use the sample mean to estimate the mean of a normal rv rather than using the median, absent other considerations such as outliers.

5.6.3 MSE and bias-variance decomposition

Let \(\hat \beta\) be a point estimator for \(\beta\). One way to quantify how well \(\hat \beta\) estimates \(\beta\) is to compute the mean squared error given by the formula \[ \mathrm{MSE}(\hat \beta) = E[(\hat \beta - \beta)^2]. \] The bias of the point estimator \(\hat \beta\) is defined to be the difference between the expected value of \(\hat \beta\) and $, \[ \mathrm{Bias}(\hat \beta) = E[\hat \beta] - \beta. \] In particular, \(\hat \beta\) is unbiased exactly when \(\mathrm{Bias}(\hat \beta) = 0\).

Theorem 5.6

Let \(\hat \beta\) be a point estimator for \(\beta\). \[ \mathrm{MSE}(\hat \beta) = \left(\mathrm{Bias}(\hat \beta)\right)^2 + \mathrm{Var}(\hat \beta) \]

Proof:

\[\begin{align*} E[(\hat \beta - \beta)^2] &= E[(\hat \beta - E[\hat \beta] + E[\hat \beta] - \beta)^2]\\ &= E[(\hat \beta - E[\hat \beta])^2] + E[(E[\hat \beta] - \beta)^2] + 2E[(\hat \beta - E[\hat \beta])(E[\hat \beta] - \beta)] \end{align*}\] Since \(E\left[\hat \beta - E[\hat \beta]\right] = 0\) and \(E[\hat \beta] - \beta\) is a constant, the last line above simplifies to \(\left(\mathrm{Bias}(\hat \beta)\right)^2 + \mathrm{Var}(\hat \beta)\).

Example 5.33

Suppose you wish to estimate the MSE of the sample standard deviation \(S\) when estimating \(\sigma\), assuming that \(S\) is computed from a random sample of size 10 from a normal distribution with mean 1 and standard deviation 2. We could either compute the MSE directly, or we could compute the variance and bias of \(S\) and apply Theorem 5.6. Let’s see how those two approaches differ.

We start by estimating the MSE directly.

sim_data <- replicate(10000, {
  sam <- rnorm(10, 1, 2)
  s <- sd(sam)
  (s - 2)^2
})
mean(sim_data)
## [1] 0.2173475

Now, we estimate the variance and bias separately, and add the bias squared to the variance.

bias_estimate <- mean(replicate(10000, {
  sam <- rnorm(10, 1, 2)
  s <- sd(sam)
  (s - 2)
}))
variance_estimate <- var(replicate(10000, {
  sam <- rnorm(10, 1, 2)
  s <- sd(sam)
  s
}))
variance_estimate + bias_estimate^2
## [1] 0.2169428

We see that we get approximately the same answer for the estimate of the MSE, which provides evidence for Theorem 5.6.

Many times, it is possible to replace an unbiased estimator with a biased estimator that has lower mean squared error. For example, let’s consider the sample variance \(S^2\) as an estimator for \(\sigma^2\) based on independent and identically distributed \(X_1, \ldots, X_{20}\) which are normal with mean 0 and standard deviation 2. As a reminder, the formula for the sample variance is \[ S^2 = \frac 1{n - 1}\sum_{ i = 1}^n\left(X_i - \overline{X}\right)^2, \] and \(S^2\) is an unbiased estimator for \(\sigma^2\). Therefore, the MSE of \(S^2\) is its variance, and we can estimate it as follows.

sim_data <- replicate(10000, {
  dat <- rnorm(20, 0, 2)
  var(dat)
})
var(sim_data)
## [1] 1.713379
mean((sim_data - 4)^2)
## [1] 1.713287

We get slightly different estimates when we estimate the variance of \(S^2\) or when we estimate the MSE directly, but either way we get an estimate for the MSE of about 1.68.

Now, if we want to reduce the variance of our estimator (while increasing the bias), we can use the esimator \[ \hat\sigma^2 = \frac 1{n}\sum_{ i = 1}^n\left(X_i - \overline{X}\right)^2. \] In order to estimate the MSE of this estimator, we can either estimate the variance and the bias separately or the MSE directly. We show both ways below.

sim_data <- replicate(10000, {
  dat <- rnorm(20, 0, 2)
  mu <- mean(dat)
  1 / 20 * sum((dat - mu)^2)
})
var(sim_data)
## [1] 1.576474

Compare the estimated variance of \(\hat \sigma^2\), 1.58 to the estimated variance of \(S^2\), which we estimated above to be 1.71.

mean((sim_data - 4))^2 # estimated bias squared
## [1] 0.02684131
mean((sim_data - 4)^2) # estimated MSE
## [1] 1.603157

In this case, we can see that the variance of \(\hat \sigma^2\) is considerably less than the variance of \(S^2\), and while the bias squared of \(\hat \sigma^2\) is positive, it is less than the reduction in variance. Therefore, the MSE of \(\hat \sigma^2\) is lower than the MSE of \(S^2\).

Vignette: Loops in R

Most tasks in R can be done using the implicit loops described in this section. There are also powerful tools in the purrr package. However, sometimes it is easiest just to write a conventional loop. This vignette explains how to use while loops in the context of simulations.

A while loop repeats one or more statements until a certain condition is no longer met. It is crucial when writing a while loop that you ensure that the condition to exit the loop will eventually be satisfied! The format looks like this.

i <- 0 # initialize
while (i < 2) { # check condition
  print(i)
  i <- i + 1 # increment
}
## [1] 0
## [1] 1

R sets i to 0 and then checks whether i < 2 (it is). It then prints i and adds one to i. It checks again - is i < 2? Yes, so it prints i and adds one to it. At this point i is not less than 2, so R exits the loop.

As a more realistic example of while loops in action, suppose we want to estimate the expected value of a geometric random variable \(X\).

Example 3.13 simulated tossing a fair coin until heads occurs, and counting the number of tails. In that case, \(X\) is geometric with \(p=0.5\) and so has \(E[X] = 1\). The approach was to simulate 100 coin flips and assume that heads would appear before we ran out of flips. The probability of flipping 100 tails in a row is negligible, so 100 flips was enough.

To simulate a geometric rv where \(p\) is small and unknown it may take a large number of trials before observing the first success. If each Bernoulli trial is actually a complicated calculation, we want to stop as soon as the first success occurs. Under these circumstances, it is better to use a while loop.

Here is a simulation of flipping a coin until heads, using a while loop. In each iteration of the loop, we flip the coin with sample, print the result, and then add one to the count num_tails.

num_tails <- 0
coin_toss <- ""
while (coin_toss != "Head") {
  coin_toss <- sample(c("Head", "Tail"), 1)
  print(coin_toss)
  if (coin_toss == "Tail") {
    num_tails <- num_tails + 1
  }
}
## [1] "Tail"
## [1] "Tail"
## [1] "Head"
num_tails
## [1] 2

This gives one simulated value of \(X\). In this case we found \(X = 2\), because two tails appeared before the first head. To estimate \(E[X]\), replicate the experiment.

sim_data <- replicate(10000, {
  num_tails <- 0
  coin_toss <- ""
  while (coin_toss != "Head") {
    coin_toss <- sample(c("Head", "Tail"), 1)
    if (coin_toss == "Tail") {
      num_tails <- num_tails + 1
    }
  }
  num_tails
})
mean(sim_data)
## [1] 0.9847

The mean of our simulated data is 0.9847, which agrees with the theoretical value \(E[X] = 1\).

Exercises


Exercises 5.15.2 require material through Section 5.1.

5.1

Let \(Z\) be a standard normal random variable. Estimate via simulation \(P(Z^2 < 2)\).

5.2

Let \(X\) and \(Y\) be independent exponential random variables with rate 3. Let \(Z = \max(X, Y)\) be the maximum of \(X\) and \(Y\). a. Estimate via simulation \(P(Z < 1/2)\). b. Estimate the mean and standard deviation of \(Z\).


Exercises 5.35.13 require material through Section 5.2.

5.3

Five coins are tossed and the number of heads \(X\) is counted. Estimate via simulation the pmf of \(X\).

5.4

Three dice are tossed and their sum \(X\) is observed. Use simulation to estimate and plot the pmf of \(X\).

5.5

Seven balls numbered 1-7 are in an urn. Two balls are drawn from the urn without replacement and the sum \(X\) of the numbers is computed.

  1. Estimate via simulation the pmf of \(X\).
  2. What are the least likely outcomes of \(X\)?
5.6

Five six-sided dice are tossed and their product is observed. Use the estimated pmf to find the most likely outcome. (The R function prod computes the product of a vector.)

5.7

Fifty people put their names in a hat. They then all randomly choose one name from the hat. Let \(X\) be the number of people who get their own name. Estimate and plot the pmf of \(X\).

5.8

Consider an experiment where 20 balls are placed randomly into 10 urns. Let \(X\) denote the number of urns that are empty.

  1. Estimate via simulation the pmf of \(X\).
  2. What is the most likely outcome?
  3. What is the least likely outcome that has positive probability?
5.9 (Hard)

Suppose 6 people, numbered 1-6, are sitting at a round dinner table with a big plate of spaghetti, and a bag containing 5005 red marbles and 5000 blue marbles. Person 1 takes the spaghetti, serves themselves, and chooses a marble. If the marble is red, they pass the spaghetti to the left (person number 2). If it is blue, they pass the spaghetti to the right (person number 6). The guests continue doing this until the last person receives the plate of spaghetti. Let \(X\) denote the number of the person holding the spaghetti at the end of the experiment. Estimate the pmf of \(X\).

5.10

Suppose you roll a die until the first time you obtain an even number. Let \(X_1\) be the total number of times you roll a 1, and let \(X_2\) be the total number of times that you roll a 2.

  1. Is \(E[X_1] = E[X_2]\)? (Hint: Use simulation. It is extremely unlikely that you will roll 30 times before getting the first even number, so you can safely assume that the first even occurs inside of 30 rolls.)
  2. Is the pmf of \(X_1\) the same as the pmf of \(X_2\)?
  3. Estimate the variances of \(X_1\) and \(X_2\).
5.11

Simulate creating independent uniform random numbers in [0,1] and summing them until your cumulative sum is larger than 1. Let \(N\) be the random variable which is how many numbers you needed to sample. For example, if your numbers were 0.35, 0.58, 0.22 you would have \(N = 3\) since the sum exceeds 1 after the third number. What is the expected value of \(N\)?

5.12

Recall the Example 5.8 in the text, where we show that the most likely number of times you have more Heads than Tails when a coin is tossed 100 times is zero. Suppose you toss a coin 100 times.

  1. Let \(X\) be the number of times in the 100 tosses that you have exactly the same number of Heads as Tails. Estimate the expected value of \(X\).
  2. Let \(Y\) be the number of tosses for which you have more Heads than Tails. Estimate the expected value of \(Y\).
5.13

Suppose there are two candidates in an election. Candidate A receives 52 votes and Candidate B receives 48 votes. You count the votes one at a time, keeping a running tally of who is ahead. At each point, either A is ahead, B is ahead or they are tied. Let \(X\) be the number of times that Candidate B is ahead in the 100 tallies.

  1. Estimate the pmf of \(X\) and plot it.
  2. Estimate \(P(X > 50)\).

Exercises 5.145.25 require material through Section 5.3.

5.14

Let \(X\) and \(Y\) be independent uniform random variables on the interval \([0,1]\). Estimate the pdf of \(X+Y\) and plot it.

5.15

Let \(X\) and \(Y\) be independent uniform random variables on the interval \([0,1]\). Let \(Z\) be the maximum of \(X\) and \(Y\).

  1. Plot the pdf of \(Z\).
  2. From your answer to (a), decide whether \(P(0 \le Z \le 1/3)\) or \(P(1/3\le Z \le 2/3)\) is larger.
5.16

Let \(X\) and \(Y\) be independent normal random variables with means \(\mu_X = 0, \mu_Y = 8\) and standard deviations \(\sigma_X = 3\) and \(\sigma_Y = 4\).

  1. What are the mean and variance of \(X+Y\)?
  2. Simulate the distribution of \(X + Y\) and plot it. Add a normal pdf to your plot with mean and standard deviation to match the density of \(X + Y\).
  3. What are the mean and standard deviation of \(5X - Y/2\)?
  4. Simulate the distribution of \(5X - Y/2\) and plot it. Add a normal pdf to your plot with mean and standard deviation to match the density of \(5X-2Y\).
5.17

Example 5.17 plotted the probability density function for the sum of 20 independent exponential random variables with rate 2. The density appeared to be approximately normal.

  1. What are \(\mu\) and \(\sigma\) for the exponential distribution with rate 2?
  2. What should \(\mu\) and \(\sigma\) be for the sum of 20 independent rvs which are exponential with rate 2?
  3. Plot the density for the sum of exponentials and add the pdf of \(\text{Normal}(\mu,\sigma)\) to your plot. How well does it seem to fit?
5.18

Estimate the value of \(a\) such that \(P(a \le Y \le a + 1)\) is maximized when \(Y\) is the maximum value of two independent exponential random variables with mean 2.

5.19

Is the product of two normal rvs normal? Estimate via simulation the pdf of the product of \(X\) and \(Y\), when \(X\) and \(Y\) are independent normal random variables with mean 0 and standard deviation 1. How can you determine whether \(XY\) is normal? (Hint: you will need to estimate the mean and standard deviation.)

5.20

The minimum of two independent exponential rvs with mean 2 is an exponential rv. Use simulation to determine what the rate is.

5.21

The sum of two independent chi-squared rvs with 2 degrees of freedom is either exponential or chi-squared. Which one is it? What is the parameter associated with your answer?

5.22

Consider the Log Normal distribution, whose root in R is lnorm. The log normal distribution takes two parameters, meanlog and sdlog.

  1. Graph the pdf of a Log Normal random variable with meanlog = 0 and sdlog = 1. The pdf of a Log Normal rv is given by dlnorm(x, meanlog, sdlog).
  2. Let \(X\) be a Log Normal rv with meanlog = 0 and sdlog = 1. Estimate via simulation the density of \(\log(X)\), and compare it to a standard normal random variable.
5.23

The beta distribution plays an important role in Bayesian statistics. It has two parameters, called shape parameters.

Let \(X\) and \(Y\) be independent uniform rvs on the interval \([0,1]\). Estimate via simulation the pdf of the maximum of \(X\) and \(Y\), and compare it to the pdf of a beta distribution with parameters shape1 = 2 and shape2 = 1. (Use dbeta().)

5.24

Estimate the density of the maximum of 7 independent uniform random variables on the interval \([0,1]\). This density is also beta. What shape parameters are required for the beta distribution to match your estimated density?

5.25

If \(X_1, \ldots, X_7\) are iid uniform(0, 1) random variables, and let \(Y\) be the second largest of the \(X_1, \ldots, X_7\).

  1. Estimate the pdf of \(Y\)
  2. This pdf is also beta. What are the shape parameters?
5.26

Let \(X_1\) and \(X_2\) be independent gamma random variables with rate 2 and shapes 3 and 4. Confirm via simulation that \(X_1 + X_2\) is gamma with rate 2 and shape 7.

5.27

Plot the pdf of a gamma random variable with shape 1 and rate 3, and confirm that is the same pdf as an exponential random variable with rate 3.

5.28

If \(X\) is a gamma random variable with rate \(\beta\) and shape \(\alpha\), use simulation to show that \(\frac 1cX\) is a gamma random variable with rate \(c\beta\) and shape \(\alpha\) for two different choices of positive \(\alpha\), \(\beta\) and \(c\).


Exercises 5.295.33 require material through Section 5.4.

5.29

Let \(X_1,\ldots,X_{n}\) be independent uniform rvs on the interval \((0,1)\).

  1. What are the mean \(\mu\) and the sd \(\sigma\) of a uniform rv on the interval \((0,1)\)?
  2. How large does \(n\) need to be before the pdf of \(\frac{\overline{X} - \mu}{\sigma/\sqrt{n}}\) is approximately that of a standard normal rv?

(Note: there is no “right” answer, as it depends on what is meant by “approximately.” You should try various values of \(n\) until you find the lowest one where the estimate of the \(\overline{X}\) distribution is still close to standard normal.)

5.30

Let \(X_1,\ldots,X_{n}\) be independent exponential rvs with rate \(\lambda = 10\)?

  1. What are the mean \(\mu\) and the sd \(\sigma\) of an exponential rv with rate 10?
  2. How large does \(n\) need to be before the pdf of \(\frac{\overline{X} - \mu}{\sigma/\sqrt{n}}\) is approximately that of a standard normal rv?
5.31

Let \(X_1,\ldots,X_{n}\) be independent chi-squared rvs with 2 degrees of freedom.

  1. What are the mean \(\mu\) and the sd \(\sigma\) of a chi-squared rv with 2 degrees of freedom?
  2. How large does \(n\) need to be before the pdf of \(\frac{\overline{X} - \mu}{\sigma/\sqrt{n}}\) is approximately that of a standard normal rv?
5.32

Let \(X_1,\ldots,X_{n}\) be independent binomial rvs with \(n = 10\) and \(p = 0.8\).

  1. What are the mean \(\mu\) and the sd \(\sigma\) of the \(\text{Binom}(10,0.8)\) distribution?
  2. How large does \(n\) need to be before the pdf of \(\frac{\overline{X} - \mu}{\sigma/\sqrt{n}}\) is approximately that of a standard normal rv?
5.33

Let \(X_1, \ldots, X_{1000}\) be a random sample from a uniform random variable on the interval \([-1,1]\). Suppose this sample is contaminated by a single outlier that is chosen uniformly in the interval \([200, 300]\), so that there are in total 1001 samples. Plot the estimated pdf of \(\overline{X}\) and give convincing evidence that it is not normal.

5.34

The Central Limit Theorem describes the distribution for the mean of iid random variables. This exercise explores the distribution of the median of iid random variables. Suppose \(X_1, \dotsc, X_n\) are iid continuous random variables with pdf \(f(x)\), median \(\theta\), and with the additional condition that the cdf \(F\) has derivative greater than 0 at the median, i.e. \(F'(\theta) > 0\). Then for \(n\) large, the median is approximately normal with mean \(\theta\) and standard deviation \(1/(2\sqrt{n}f(\theta))\). You will check this for some examples using simulation. The R function to compute the median of a sample is median.

  1. Let \(X_1, \ldots, X_n\) be independent uniform random variables on \([-1, 1]\). The median is 0 and \(f(0) = 1/2\). Show that for large \(n\), the median of \(X_1, \ldots, X_n\) is approximately normal with mean 0 and standard deviation \(1/\sqrt{n}\).
  2. Let \(X_1, \ldots, X_n\) be iid exponential random variables with rate 1. Find an \(n\) so that the pdf of the median of \(X_1, \ldots, X_n\) is approximately normally distributed. The true median of an exponential random variable with rate 1 is \(\ln 2\).
  3. Let \(X_1, \ldots, X_n\) be iid binomial random variables with size 3 and \(p = 1/2\). Note this does not satisfy the hypotheses that would guarantee that the median is approximately normal. Choose \(n = 10, 100\) and \(n = 1000\), and see whether the median appears to be approximately normally distributed as \(n\) gets bigger. The true median of a binomial with parameters 3 and \(1/2\) is 1.5.
5.35

In this exercise, we investigate the importance of the assumption of finite mean and variance in the statement of the Central Limit Theorem. Let \(X_1, \ldots, X_n\) be iid random variables with a \(t\) distribution with one degree of freedom, also called the Cauchy distribution. You can sample from such a \(t\) random variable using rt(N, df = 1).

  1. Use dt(x,1) to plot the pdf of a \(t\) random variable with one degree of freedom.
  2. Confirm for \(N = 100, 1000\) or \(10000\) that mean(rt(N, 1)) does not give consistent results. This is because \(\int_{-\infty}^\infty |x| \mathrm{dt(x, 1)}\, dx = \infty\), so the mean of a \(t\) random variable with 1 degree of freedom does not exist.
  3. Estimate by simulation the pdf of \(\overline{X}\) for \(N = 100, 1000, 10000\). To visualize this distribution, use a histogram with breaks = c(-Inf, -20:20, Inf) and xlim = c(-20,20). Check by adding a curve that \(\overline{X}\) has the \(t\) distribution with 1 df no matter what \(N\) you choose.
  4. Does the Central Limit Theorem hold for this distribution?

Exercises 5.365.39 require material through Section 5.5.

5.36

Let \(X_1, \ldots, X_9\) be independent identically distributed normal random variables with mean 2 and standard deviation 3. For what constants \(a\) and \(b\) is \[ \frac{\overline{X} - a}{b} \] a standard normal random variable?

5.37

Let \(X_1, \ldots, X_{8}\) be independent normal random variables with mean 2 and standard deviation 3. Show using simulation that \[ \frac{\overline{X} - 2}{S/\sqrt{8}} \] is a \(t\) random variable with 7 degrees of freedom.

5.38

Let \(X_1, \ldots, X_{10}\) and \(Y_1, \ldots, Y_{20}\) be independent normal random variables with means \(\mu_{X_i} = 0, \mu_{Y_i} = 1\) and standard deviations \(\sigma_X = 1\) and \(\sigma_Y = 2\). Show using simulation that \[ \frac{S_X^2/1^2}{S_Y^2/2^2} \] is an \(F\) random variable with 9 and 19 degrees of freedom.

5.39

Prove Theorem 5.4 in the case \(n=2\). That is, suppose that \(X_1\) and \(X_2\) are independent, identically distributed normal random variables. Prove that \[ \frac{(n-1)S^2}{\sigma^2} = \frac 1{\sigma^2} \sum_{i = 1}^2 \bigl(X_i - \overline{X}\bigr)^2 \] is a \(\chi^2\) random variable with one degree of freedom.


Exercises 5.405.44 require material through Section 5.6.

5.40

In this exercise, we show through simulation that the median is an unbiased estimator for the population mean when the population is symmetric.

  1. Use 20 replicates to estimate the expected value of the median of five independent normal random variables with mean 2 and standard deviation 4. Repeat 5 times and write down the 5 values.
  2. Use 1000 replicates to estimate the expected value of the median of five independent normal random variables with mean 2 and standard deviation 4. Repeat 5 times and write down the 5 values.
  3. Use 20000 replicates to estimate the expected value of the median of five independent normal random variables with mean 2 and standard deviation 4. Repeat 5 times and write down the 5 values.
  4. Does it appear that the estimate for the expected value of the median is getting closer to 2 as the number of replicates increases?
5.41

Show through simulation that the median is a biased estimator for the mean of an exponential rv with \(\lambda = 1\). Assume a random sample of size 8.

5.42

Determine via simulation whether \(\frac 1n \sum_{i=1}^n (x_i - \mu)^2\) is a biased estimator for \(\sigma^2\) when \(n = 10\) and \(x_1,\ldots,x_{10}\) is a random sample from a normal rv with mean 1 and variance 9.

5.43

Show through simulation that \(S^2 = \frac {1}{n-1} \sum_{i=1}^n \bigl(x_i - \overline{x}\bigr)^2\) is an unbiased estimator for \(\sigma^2\) when \(x_1,\ldots, x_{10}\) are iid normal rvs with mean 0 and variance 4.

5.44

Show through simulation that \(S = \sqrt{S^2}\), where \(S\) is defined in the previous problem, is a biased estimator for \(\sigma\) when \(x_1,\ldots,x_{10}\) are iid normal random variables with mean 0 and variance 4, i.e. standard deviation 2.

5.45

This exercise walks step by step through the mathematical proof of Proposition 5.4, that \(S^2\) is an unbiased estimator of the population variance \(\sigma^2\). Suppose \(X_1, \dotsc, X_n\) are an independent random sample from a population with mean \(\mu\) and variance \(\sigma^2\).

  1. Use Theorem 3.10 to show that \(\text{Var}(\overline{X}) = \sigma^2/n\).
  2. Use the computational formula for variance, Theorem 3.9, to show that \(E[X_i^2] = \mu^2 + \sigma^2\) and that \(E[\overline{X}^2] = \mu^2 + \sigma^2/n\).
  3. Show that \(E[\sum_i X_i \overline{X}] = n\mu^2 +\sigma^2\)
  4. Show that \(E[\sum_i (X_i - \overline{X})^2] = E[\sum_i X_i^2] -2E[\sum_i X_i \overline{X}] + E[\sum_i \overline{X}^2] = (n-1)\sigma^2\).
  5. Conclude that \(E[S^2] = \sigma^2\).
5.46

Determine whether \(\hat \sigma_1^2 = \frac {1}{n} \sum_{i=1}^n \left(x_i - \overline{x}\right)^2\) or \(\hat \sigma_2^2 = \frac {1}{n + 1/2} \sum_{i=1}^n \left(x_i - \overline{x}\right)^2\) has lower mean squared error when estimating \(\sigma^2\) when \(x_1, \ldots, x_{10}\) are iid normal random variables with mean 0 and variance 4.

5.47 (Requires material covered in vignette or similar)

Deathrolling in World of Warcraft works as follows. Player 1 tosses a 1000 sided die. Say they get \(x_1\). Then player 2 tosses a die with \(x_1\) sides on it. Say they get \(x_2\). Player 1 tosses a die with \(x_2\) sides on it. The player who loses is the player who first rolls a 1.

  1. Estimate the expected total number of rolls before a player loses.
  2. Estimate the probability mass function of the total number of rolls.
  3. Estimate the probability that player 1 wins.

  1. The R function proportions is new to R 4.0.1 and is recommended as a drop-in replacement for the unfortunately named prop.table. Similarly, marginSums is recommended as a drop-in replacement for margin.table.↩︎