Chapter 2 Probability

A primary goal of statistics is to describe the real world based on limited observations. These observations may be influenced by random factors, such as measurement error or environmental conditions. This chapter introduces probability, which is designed to describe random events. Later, we will see that the theory of probability is so powerful that we intentionally introduce randomness into experiments and studies so we can make precise statements from data.

2.1 Probability Basics

In order to learn about probability, we must first develop a vocabulary that we can use to discuss various aspects of it.

Definitions

An experiment is a process that produces an observation.
An outcome is a possible observation
The set of all possible outcomes is called the sample space
An event is a subset of the sample space.

Example Roll a die and observe the number of dots on the top face. This is an experiment, with six possible outcomes. The sample space is the set \(S = \{1,2,3,4,5,6\}\). The event “roll higher than 3” is the set \(\{4,5,6\}\).

Example Stop a random person on the street and ask them what month they were born. This experiment has the twelve months of the year as possible outcomes. An example of an event \(E\) might be that they were born in a summer month, \(E = \{\text{June}, \text{July}, \text{August}\}\)

Example Suppose a traffic light stays red for 90 seconds each cycle. While driving you arrive at this light, and observe the amount of time that you are stopped until the light turns green. The sample space is the interval of real numbers \([0,90]\). The event “you didn’t have to stop” is the set \(\{0\}\).

Since events are, by their very definition, sets, it will be useful for us to review some basic set theory.

Definition 2.1 Let \(A\) and \(B\) be events in a sample space \(S\).

\(A \cap B\) is the set of outcomes that are in both \(A\) and \(B\).
\(A \cup B\) is the set of outcomes that are in either \(A\) or \(B\) (or both).
\(A \setminus B\) is the set of outcomes that are in \(A\) and not in \(B\).
The complement of \(A\) is \(\overline{A} = S \setminus A\). So, \(\overline{A}\) is the set of outcomes that are not in \(A\).
\(A\) and \(B\) are disjoint if \(A \cap B = \emptyset\).
We say that \(A\) is a subset of \(B\), written \(A \subset B\), if every element of \(A\) is also an element of \(B\).

Example Suppose that the sample space \(S\) consists of the positive integers. Let \(A\) be the set of all positive even numbers, and let \(B\) be the set of all prime numbers. Then, \(A = \{2, 4, 6, \ldots\}\) and \(B = \{2, 3, 5, 7, 11, \ldots\}\). Then,

\(A \cap B = \{2\}\)
\(A \cup B = \{2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 16, 17, 18, 19, 20, 22, \ldots\}\)
\(B \setminus A\) is the set of odd prime numbers.
\(\overline{A}\) is the set of all positive odd integers.
\(A\) and \(B\) are not disjoint, since both contain the number 2.
\(A\) is not a subset of \(B\) since \(4\) is an element of \(A\), but 4 is not an element of \(B\).

Definition 2.2 Let \(S\) be a sample space. A valid probability of events \(E\) is a number \(P(E)\) between 0 and 1 (inclusive), so \(0 \leq P(E) \leq 1\), that satisfies the following probability axioms:

The probability of the sample space is 1, \(P(S) = 1\).
The probability of the empty set is 0, \(P(\emptyset) = 0\).
Probabilities are monotonic: if \(A \subset B\), the \(P(A) \le P(B)\).
Probabilities are countably additive: If \(A_1, A_2, \ldots\) are pairwise disjoint, then \[ P\bigl(\cup_{n=1}^\infty A_n\bigr) = \sum_{n= 1}^\infty P(A_n) \]

The four axioms above are not a minimal set of axioms. For example, countable additivity implies both \(P(\emptyset) = 0\) and monotonicity. We will not be concerned in this book about carefully describing all of the subsets of \(S\) which have an associated probability. We will assume that any event of interest will be an event associated with a probability.

There are multiple possible interpretations of a probability. One interpretation, which is the one used in this book, is that if the probability of an event \(E\) is \(p\), then if you repeat the experiment many times, then the proportion of times that the event occurs will eventually be close to \(p\). Another interpretation is that the probability of an event measures the degree of certainty that a person has about whether the event occurs or not.

To understand the difference, consider the following thought experiment. Suppose that I am about to toss a coin, and I ask you to estimate the probability that the coin will land on Heads. Not knowing any reason to think otherwise, you might say that you estimate it to be \(p = 0.5\). From the frequentist point of view, that would mean that you believe that if I repeat the experiment infinitely many times, then the proportion of times that it is Heads will converge to 0.5. From the certainty of belief point of view, you believe that each outcome (Heads/Tails) is equally likely.

Now, suppose that I flip the coin and look it, but don’t tell you whether it is Heads or Tails. At this point, there is nothing random that I can repeat, so it wouldn’t make sense from the first interpretation for you to say that the probability is \(p = 0.5\). However, your degree of certainty about the outcome hasn’t changed, it is still \(p = 0.5\). We bring up this example to illustrate the importance of a random nature to the experiments that we are studying in this section.

Probabilities obey some important rules, which are consequences of the axioms.

Theorem Let \(A\) and \(B\) be events in the sample space \(S\).

If \(A\) and \(B\) are disjoint, then \(P(A \cup B) = P(A) + P(B)\).
\(P(A) = 1 - P(\overline{A})\).
\(P(A \setminus B) = P(A) - P(A \cap B)\).
\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)

Proof. We sketch the proof of these results. Part 1 is just a special case of probability Axiom 4 with \(A_1 = A\), \(A_2 = B\), and \(A_3,A_4,\dotsc\) all equal to the empty set.

For part 3, we have that \(A = (A \cap B) \cup (A \setminus B)\), where \(A \cap B\) and \(A \setminus B\) are disjoint. We have \(P(A) = P(A \cap B) + P(A \setminus B)\), which gives the result. Part 2 is a special case of part 3.

To prove part 4, we note that \(A \cup B = A \cup (B \setminus A)\), where \(A\) and \(B \setminus A\) are disjoint. Therefore, \(P(A \cup B) = P(A) + P(B \setminus A) = P(A) + P(B) - P(A \cap B)\) by 1 and 3.

One way to assign probabilities to events is empirically, by repeating an experiment many times and observing the proportion of times the event occurs. While this can only approximate the true probability, it is sometimes the only approach possible. For example, in the United States the probability of being born in October is noticeably higher than the probability of being born in January, and these values can only be estimated by observing actual patterns of human births.

Another method is to make an assumption that all outcomes are equally likely, usually because of some physical property of the experiment. For example, because (high quality) dice are close to perfect cubes, one believes that all six sides of a die are equally likely to occur. Using the additivity of disjoint events (rule 4 in the definition of probability),

\[ P(\{1\}) + P(\{2\}) + P(\{3\}) + P(\{4\}) + P(\{5\}) + P(\{6\}) = P(\{1,2,3,4,5,6\}) = 1 \]

Since all six probabilities are equal and sum to 1, the probability of each face occurring is \(1/6\). In this case, the probability of an event \(E\) can be computed by counting the number of elements in \(E\) and dividing by the number of elements in \(S\).

Example 2.1 Suppose that two six-sided dice are rolled and the numbers appearing on the dice are observed.

The sample space \(S\) is given by

\[ \begin{pmatrix} (1,1), (1,2), (1,3), (1,4), (1,5), (1,6) \\ (2,1), (2,2), (2,3), (2,4), (2,5), (2,6) \\ (3,1), (3,2), (3,3), (3,4), (3,5), (3,6) \\ (4,1), (4,2), (4,3), (4,4), (4,5), (4,6) \\ (5,1), (5,2), (5,3), (5,4), (5,5), (5,6) \\ (6,1), (6,2), (6,3), (6,4), (6,5), (6,6) \end{pmatrix} \]

By the symmetry of the dice, we expect all 36 possible outcomes to be equally likely. So the probability of each outcome is \(1/36\).
The event “The sum of the dice is 6” is represented by

\[ E = \{(1,5), (2,4), (3,3), (4,2), (5,1)\} \]

The probability that the sum of two dice is 6 is given by \[ P(E) = \frac{|E|}{|S|} = \frac{5}{36}, \] which can be obtained by simply counting the number of elements in each set above.
Let \(F\) be the event “At least one of the dice is a 2.” This event is represented by

\[ F = \{(2,1), (2,2), (2,3), (2,4), (2,5), (2,6), (1,2), (3,2), (4,2), (5,2), (6,2)\} \]

and the probability of \(F\) is \(P(F) = \frac{11}{36}\).

\(E \cap F = \{(2,4), (4,2)\}\) and \(P(E \cap F) = \frac{2}{36}\).
\(P(E \cup F) = P(E) + P(F) - P(E \cap F) = \frac{5}{36} + \frac{11}{36} - \frac{2}{36} = \frac{14}{36}\).
\(P(\overline{E}) = 1 - P(E) = \frac{31}{36}\).

2.2 Simulations

The goal of probability and statistics is to model the real world. Statistical experiments in the real world are usually slow and often expensive. Instead of running real world experiments, it is easier to model these experiments and then use a computer to imitate the results. This process is called simulation.

This book places simulation at the center of the study of probability. With a good understanding of how to simulate experiments, you can answer a wide range of questions involving probability. Later in the book, we will use simulation to explore how statistical methods behave under different assumptions about data. Simulation also plays a fundamental role in modern statistical methods such as resampling, bootstrapping, non-parametric statistics, and genetic algorithms.

2.2.1 Simulation with `sample`

For an experiment with a finite sample space \(S = \{x_1, x_2, \dotsc, x_n\}\), the R command sample() can simulate one or many trials of the experiment. Essentially, sample treats \(S\) as a bag of outcomes, reaches into the bag, and picks one.

The syntax of sample is

sample(x, size, replace = FALSE, prob = NULL)

where:

the parameter x is the vector of elements from which you are sampling.
size is the number of samples you wish to take.
replace determines whether you are sampling with replacement or not. Sampling without replacement means that sample will not pick the same value twice, and this is the default behavior. Pass replace = TRUE to sample if you wish to sample with replacement.
prob is a vector of probabilities or weights associated with x. It should be a vector of nonnegative numbers of the same length as x. If the sum of prob is not 1, it will be normalized. If this value is not provided, then each element of x is considered to be equally likely.

The most straightforward use of sample is to choose one element of a vector “at random”. When people say “at random”, they usually mean that all outcomes are equally likely to be chosen, and that is how sample operates. To get a random number from 1 to 10:

sample(x = 1:10, size = 1)

## [1] 2

The size argument tells sample how many random numbers you want:

sample(x = 1:10, size = 8)

## [1]  9 10  1  5  6  3  7  2

Observe that the eight numbers chosen are all different. By default, sample chooses different numbers every time. If you ask it for more than ten different numbers from 1 to 10, you get an error:

sample(x = 1:10, size = 30)

## Error in sample.int(length(x), size, replace, prob): cannot take a sample larger than the population when 'replace = FALSE'

The replace argument of sample determines whether sample is allowed to repeat values. The name “replace” comes from the model that sample has a bag of outcomes and is reaching into the bag to draw one. When replace = FALSE, the default, once sample chooses a value from the bag of outcomes it won’t replace it into the bag. With replace = TRUE, sample draws an outcome from the bag, records it, and then puts it back into the bag.

Here we use set replace = TRUE to get 20 random numbers from 1 to 10:

sample(x = 1:10, size = 20, replace = TRUE)

##  [1] 10  5 10  5  9  6 10  5 10 10  3  1 10  6  8  7  6  2  9 10

For sample and many other functions in R, you are not required to name the arguments with x = ... or size = ... as long as these come first and second in the function. For example:

sample(1:10, 20, replace = TRUE)

##  [1] 10  4  8  8  4  4  8  6  5  3  8  3  2  4  8  1  2 10  1  6

However, it is often clearer to explicitly name the arguments to complicated functions like sample. Use your best judgment, and include the parameter name if there is any doubt.

The prob argument of sample allows for sampling when outcomes are not all equally likely. In the United States, human blood comes in four types: O, A, B, and AB . These types occur with the following probability distribution: \[ \begin{array}{lcccc} \text{Type} & O & A & B & AB\\ \text{Probability} & 0.45 & 0.40 & 0.11 & 0.04 \end{array} \] We can sample thirty blood types from this distribution by defining a vector of blood types and a vector of their probabilities:

bloodtypes <- c("O", "A", "B", "AB")
bloodprobs <- c(0.45, 0.40, 0.11, 0.04)
sample(x = bloodtypes, size = 30, prob = bloodprobs, replace = TRUE)

##  [1] "A"  "O"  "AB" "A"  "O"  "A"  "A"  "O"  "O"  "O"  "O"  "A"  "O"  "A"  "A" 
## [16] "A"  "B"  "A"  "A"  "B"  "A"  "O"  "O"  "A"  "A"  "O"  "O"  "A"  "A"  "O"

Observe that a large sample reproduces the original probabilities with reasonable accuracy:

sim_data <- sample(x = bloodtypes, size = 10000, prob = bloodprobs, replace = TRUE)
table(sim_data)

## sim_data
##    A   AB    B    O 
## 3998  425 1076 4501

table(sim_data)/10000

## sim_data
##      A     AB      B      O 
## 0.3998 0.0425 0.1076 0.4501

2.2.2 Using simulation to compute probabilities

The goal of simulation is usually to compute the probability of an event. This is a three step process:

Simulate the experiment many times to produce a vector of outcomes.
Test if the outcomes are in the event to produce a vector of TRUE/FALSE.
Compute the mean of the TRUE/FALSE vector to compute the probability estimate.

Steps 1 and 2 can often be interesting problems. Step 3 relies on the fact that R converts TRUE to 1 and FALSE to 0 when taking the mean of a vector. If we take the average of a vector of TRUE/FALSE values, we get the number of TRUE divided by the size of the vector, which is exactly the proportion of times that the event occurred. Alternately, one could use table or sum to count the TRUE values and then divide by the size of the vector by hand.

We illustrate this process by reworking Example 2.1 using simulation.

Example 2.2 Suppose that two six-sided dice are rolled and the numbers appearing on the dice are added.

Simulate this experiment by performing 10000 rolls of each die with sample and then adding the two dice:

die1 <- sample(x = 1:6, size = 10000, replace = TRUE)
die2 <- sample(x = 1:6, size = 10000, replace = TRUE)
sumDice <- die1 + die2

Let’s take a look at the simulated data:

head(die1)

## [1] 1 4 1 2 5 3

head(die2)

## [1] 1 6 1 4 1 3

head(sumDice)

## [1]  2 10  2  6  6  6

Let \(E\) be the event “the sum of the dice is 6”, and \(F\) be the event “At least one of the dice is a 2”. We define these events from our simulated data:

eventE <- sumDice == 6
head(eventE)

## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE

eventF <- die1 == 2 | die2 == 2
head(eventF)

## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

Here, \(F\) is interpreted as “die 1 is a two or die 2 is a two” and uses R’s “or” operator |.

From theory, \(P(E) = \frac{5}{36} \approx 0.139\), and \(P(F) = \frac{11}{36} \approx 0.306\). Using mean we find out what percentage of the time our events occurred in the simulation, which estimates the correct probabilities:

mean(eventE)   # P(E)

## [1] 0.1409

mean(eventF)  # P(F)

## [1] 0.2998

To estimate \(P(E \cap F) = \frac{2}{36} \approx 0.056\) we use R’s “and” operator &:

mean(eventE & eventF)

## [1] 0.0587

It is not necessary to store the TRUE/FALSE vectors in event variables. Here is an estimate of \(P(E \cup F) = \frac{14}{36} \approx 0.389\):

mean((sumDice == 6) | (die1 == 2 | die2 == 2))

## [1] 0.382

2.2.3 Using `replicate` to repeat experiments

The size argument to sample allowed us to perform many repetitions of an experiment. For more complicated statistical experiments we use the R function replicate, which can take a single R expression and repeat it many times.

The function replicate is an example of an implicit loop in R. Suppose that expr is one or more R commands, the last of which returns a single value. The call

replicate(n, expr)

repeats the expression stored in expr n times, and stores the resulting values as a vector.

Example

Estimate the probability that the sum of seven dice is larger than 30.

To simulate this event once, we can use sample to roll seven dice, sum to add them, and then test for the event “the sum is larger than 30”:

dice <- sample(x = 1:6, size = 7, replace = TRUE) # roll seven dice
sum(dice) > 30 # test if the event occurred

## [1] FALSE

The result of this single simulation was FALSE. Using replicate repeats the experiment many times:

replicate(20, {
  dice <- sample(x = 1:6, size = 7, replace = TRUE) # roll seven dice
  sum(dice) > 30 # test if the event occurred
})

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

The curly braces { and } are required to replicate more than one R command. When you put multiple commands inside {} you are creating a code block. A code block acts like a single statement, and only the result of the last command is saved in the vector via replicate.

Using multiple lines for a code block is not required, but highly recommended for readability. It is also legal to put the entire code block on a single line and separate each command in the block with a semicolon.

Finally, we want to compute the probability of the event. We replicate 10000 times for a reasonably accurate estimate:

event <- replicate(10000, {
  dice <- sample(x = 1:6, size = 7, replace = TRUE) # roll seven dice
  sum(dice) > 30 # test if the event occurred
})
mean(event)

## [1] 0.0963

When rolling seven dice, there is about a 9.63% probability the sum will be larger than 30. How accurate is our estimate? It is often a good idea to repeat a simulation a couple of times to get an idea about how much variance there is in the results. Running the code a few more times gave answers 0.0947, 0.091, 0.0965, and 0.0867. It seems safe to report the answer as roughly 9%.

The more replications you perform with replicate(), the more accurate you can expect your simulation to be. On the other hand, replications can be slow. For events which are not rare, 10000 trials runs quickly and gives an answer accurate to about two decimal places.

For complicated simulations, we strongly recommend that you follow the workflow as presented above; namely,

Write code that performs the experiment a single time.
Replicate the experiment a small number of times and check the results:

replicate(100, { EXPERIMENT GOES HERE }))
Replicate the experiment a large number of times and store the result:

event <- replicate(10000, { EXPERIMENT GOES HERE }))
Compute probability using mean(event).

It is much easier to trouble-shoot your code this way, as you can test each line of your simulation separately.

Example Three dice are thrown. Estimate the probability that the largest value is a 4.

Here is one trial:

die_roll <- sample(1:6, 3, TRUE)
max(die_roll) == 4

## [1] TRUE

Here are a few trials, and we observe that sometimes the event occurs and sometimes it does not:

replicate(20, {
  die_roll <- sample(1:6, 3, TRUE)
  max(die_roll) == 4
})

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE

Finally, perform many trials and compute the probability of the event:

event <- replicate(10000, {
  die_roll <- sample(1:6, 3, TRUE)
  max(die_roll) == 4
})
mean(event)

## [1] 0.1737

When three dice are thrown, the probability that the largest value is 4 is approximately 17%.

Example

A fair coin is repeatedly tossed. Estimate the probability that you observe Heads for the third time on the 10th toss.

In this example, an outcome of the experiment is ten tosses of the coin. The event “you observe Heads for the third time on the 10th toss” is complicated, and most of the work involves testing whether that event occurred.

As before, we build this up in stages. Begin by simulating an outcome, a sample of ten tosses of a coin:

coinToss <- sample(c("H", "T"), 10, replace = TRUE)
coinToss

##  [1] "H" "T" "T" "H" "H" "H" "H" "T" "H" "H"

In order for the event to occur, we need for there to be exactly 3 heads, so we count the number of heads and check whether it is equal to 3:

sum(coinToss == "H")

## [1] 7

sum(coinToss == "H") == 3

## [1] FALSE

Next, we also need to make sure that we had only two heads in the first nine tosses. So, we look only at the first nine tosses:

coinToss[1:9]

## [1] "H" "T" "T" "H" "H" "H" "H" "T" "H"

and add up the heads observed in the first nine tosses:

sum(coinToss[1:9] == "H") == 2

## [1] FALSE

Note that both of those have to be true in order for the event to occur:

sum(coinToss == "H") == 3 & sum(coinToss[1:9] == "H") == 2

## [1] FALSE

We put this inside replicate and compute the probability:

event <- replicate(10000, {
  coinToss <- sample(c("H", "T"), 10, replace = TRUE) 
  (sum(coinToss == "H") == 3) & (sum(coinToss[1:9] == "H") == 2)
})

mean(event)

## [1] 0.033

The probability of observing Heads for the third time on the 10th toss is about 3%.

The test that there were 3 total heads but only 2 on the first nine tosses is not the only way to approach this problem. Here are two other tests that give the same result by looking at the tenth coin toss:

sum(coinToss == "H") == 3 & coinToss[10] == "H"

## [1] FALSE

sum(coinToss[1:9] == "H") == 2  & coinToss[10] == "H"

## [1] FALSE

Example (The Birthday Problem)

Estimate the probability that out of 25 randomly selected people, at least two will have the same birthday. Assume that all birthdays are equally likely, except that none are leap day babies.

An outcome of this experiment is 25 randomly selected birthdays, and the event \(B\) is that at least two birthdays are the same. Simulating the experiment is straightforward, with sample:

birthdays <- sample(x = 1:365, size = 25, replace = TRUE)
birthdays

##  [1] 324 167 129 299 270 187 307  85 277 362 330 263 329  79 213  37 105 217 165
## [20] 290 362  89 289 340 326

Next, we test for the event \(B\). In order to do this, we need to be able to find any duplicates in a vector. R has many, many functions that can be used with vectors. For most things that you want to do, there will be an R function that does it. In this case it is anyDuplicated(), which returns the location of the first duplicate if there are any, and zero otherwise. The important thing to learn here isn’t necessarily this particular function, but rather the fact that most tasks are possible via some built in functionality.

anyDuplicated(birthdays)

## [1] 21

It happens that our sample did have two of the same birthday. At location 21 of the birthday vector, the number 362 appears. That same number showed up earlier in location 10. The event \(B\) occurred in this example, and we can test for it by checking that the result of anyDuplicated is larger than 0. Putting it all together:

eventB <- replicate(n = 10000, {
  birthdays <- sample(x = 1:365, size = 25, replace = TRUE)
  anyDuplicated(birthdays) > 0
})
mean(eventB)

## [1] 0.5631

The probability of this event is approximately 0.56. Interestingly, we see that it is actually quite likely that a group of 25 people will contain two with the same birthday.

Challenge Modify the above code to take into account leap years. Is it reasonable to believe that birthdays throughout the year are equally likely? Later in this book, we will use birthday data to get a better approximation of the probability that out of 25 randomly selected people, no two will have the same birthday.

Example

Three numbers are picked uniformly at random from the interval \((0,1)\). What is the probability that a triangle can formed whose side-lengths are the three numbers that you chose?

Solution: We need to be able to simulate picking three numbers at random from the interval \((0, 1)\). We will see later in the book that the way to do this is via runif(3, 0, 1), which returns 3 numbers randomly chosen between 0 and 1. We then need to check whether the sum of the two smaller numbers is larger than the largest number. We use the sort command to sort the three numbers into increasing order, as follows:

event <- replicate(10000, {
  x = sort(runif(3,0,1)) 
  sum(x[1:2]) > x[3]
})
mean(event)

## [1] 0.4979

Example According to Rick Wicklin, the proportion of M&M’s of various colors produced in the New Jersey M&M factory are as follows.

Color	Proportion
Blue	25.0
Orange	25.0
Green	12.5
Yellow	12.5
Red	12.5
Brown	12.5

If you buy a bag from the New Jersey factory that contains 35 M&M’s, what is the probability that it will contain exactly 9 Blue and 5 Red M&M’s?

To do this, we can use the prob argument in the sample function, as follows.

mm.colors <- c("Blue", "Orange" ,"Green", "Yellow", "Red", "Brown")
mm.probs <- c(25, 25, 12.5, 12.5, 12.5, 12.5)
bag <- sample(x = mm.colors,
              size = 35, 
              replace = TRUE, 
              prob = mm.probs)
sum(bag == "Blue") #counts the number of Blue M&M's

## [1] 11

event <- replicate(10000, {
  bag <- sample(x = mm.colors, 
              size = 35, 
              replace = TRUE, 
              prob = mm.probs)
  sum(bag == "Blue")== 9 & sum(bag == "Red") == 5
})
mean(event)

## [1] 0.0291

2.3 Conditional Probability and Independence

Sometimes when considering multiple events, we have information that one of the events has occurred. This new information requires us to reconsider the probability that the other event occurs. For example, suppose that you roll two dice and one of them falls off of the table where you cannot see it, while the other one shows a 4. We would want to update the probabilities associated with the sum of the two dice based on this information. The new probability that the sum of the dice is 2 would be 0, the new probability that the sum of the dice is 5 would be 1/6 because that is just the probability that the die that we cannot see is a “1,” and the new probability that the sum of the dice is 7 would also be 1/6 (which is the same as its original probability).

Formally, we have the following definition.

Definition 2.3 Let \(A\) and \(B\) be events in the sample space \(S\), with \(P(B) \not= 0\). The conditional probability of \(A\) given \(B\) is \[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]

We read \(P(A|B)\) as “the probability of \(A\) given \(B\).” In R, the \(|\) symbol denotes the or operator, but in mathematical notation \(P(A \cup B)\) is the probability of \(A\) or \(B\). It is important to keep these distinct. We also note that \(P(A|B)\) does not mean the probability of the event \(A|B\), as there is no event in the sample space \(S\) that corresponds to the “event” \(A|B\). It is important to keep straight in your mind the fixed idiom \(P(A|B)\) means the probability of \(A\) given \(B\), or the probability that \(A\) occurs given that \(B\) occurs.

The general process of assuming that \(B\) occurs and making computations under that assumption is called conditioning on \(B\). Note that in order to condition on \(B\) in the definition of \(P(A|B)\), we must assume that \(P(B)\not= 0\), since otherwise we would get \(\frac {0}{0}\), which is undefined. This also makes some intuitive sense. If we assume that a probability zero event occurs, then probability of further events conditioned on that would need to be undefined.

Example Two dice are rolled. What is the probability that both dice are 4, given that the sum of two dice is 8?

Solution: Let \(A\) be the event “both dice are 4” and \(B\) be the event “the sum is 8”. Then \[ P(A|B) = P(A \cap B)/P(B) = \frac{1/36}{5/36} = 1/5. \] Rolling two 4’s is the hardest way to get an 8. Check that the probability of rolling one three and one five is 2/5, and also for one two and one six.

With conditional probability the order of the two events is important. Suppose we reverse the order, and ask: “What is the probability that the sum of two dice is 8, given that both dice are 4?”. Now the answer is 1, because if both dice are 4 then the sum is certainly 8. Formally, with the events defined as in the previous paragraph: \[ P(B|A) = P(B \cap A)/P(A) = \frac{1/36}{1/36} = 1. \]

Example Show that:

\(P((A \cap B)|B) = P(A|B)\).
\(P(A \cup B|B) = 1\)

This example will prove statement 1. Statement 2 is left as exercise 26.

In words, statement 1 says that the probability of \(A\) and \(B\) given \(B\) is the probability of \(A\) given \(B\). Arguing informally, assume that we know that \(B\) occurs. Then the probability that both \(A\) and \(B\) occur is just the probability that \(A\) occurs. Using set notation:

\[\begin{align*} P((A \cap B)|B) &= P((A \cap B) \cap B)/P(B)\\ &= P(A\cap (B \cap B))/P(B)\\ &= P(A \cap B)/P(B)\\ &= P(A|B). \end{align*}\]

We included parentheses around \((A \cap B)\) above, but we did not need to. Remember, there is no event that is associated with \(B|B\), so the only possible interpretation of \(P(A \cap B | B)\) is \(P((A \cap B)|B)\).

2.3.1 Independent Events

We have seen examples where the probability of \(A\) given \(B\) can be larger than \(P(A)\), smaller than \(P(A)\) or equal to \(P(A)\). Of particular interest are pairs of events \(A\) and \(B\) such that knowledge that one of the events occurs does not impact the probability that the other event occurs.

Definition 2.4 Two events are said to be independent if knowledge that one event occurs does not give any probabilistic information as to whether the other event occurs. Formally, we say that \(A\) and \(B\) are independent if \(P(A \cap B) = P(A) P(B)\).

It is not immediately clear why the formal statement in the definition of independence implies the intuitive statement that “the knowledge that one event occurs does not give any probabilistic information as to whether the other event occurs.” To see that, we assume \(P(B) \not= 0\) and compute:

\[ P(A|B) = P(A \cap B)/P(B) = \frac{P(A)P(B)}{P(B)} = P(A) \]

A similar computation shows that (if \(P(A) \not= 0\)) then \(P(B|A) = P(B)\) and proves the following theorem.

Theorem 2.1 Let \(A\) and \(B\) be events with non-zero probability in the sample space \(S\). The following are equivalent:

\(A\) and \(B\) are independent.
\(P(A \cap B) = P(A)P(B)\).
\(P(A|B) = P(A)\).
\(P(B|A) = P(B)\).

Events \(A\) and \(B\) are said to be dependent if they are not independent.

Example 2.3 Two dice are rolled. Let \(A\) be the event “The first die is a 5”, let \(B\) be the event “The sum of the dice is 7”, and let \(C\) be the event “The sum of the dice is 8.” Show that:

\(A\) and \(B\) are independent
\(A\) and \(C\) are dependent.
\(B\) and \(C\) are dependent.

For part 1, compute \(P(B) = 6/36 = 1/6\). Now, \(P(B|A) = P(A \cap B)/P(A) = \frac{1/36}{1/6} = 1/6\). Therefore, \(A\) and \(B\) are independent.

For part 2, \(P(C) = 5/36\), which is not the same as \(P(C|A) = \frac{1/36}{1/6} = 1/6\). Therefore, \(A\) and \(C\) are not independent.

Finally, \(P(B \cap C) = 0 \not= P(B) P(C)\) so \(B\) and \(C\) are not independent.

Usually independence will be an assumption that we make about events, rather than a property of events that we prove. It is important to develop an intuition about independence in order to determine whether such assumptions are reasonable. Consider the following scenarios, and determine whether the events indicated are most likely dependent or independent.

A randomly selected day in the last 365 days is selected, and \(A\) is the event that the high temperature in St. Louis, Missouri on that day was greater than 90 degrees, while \(B\) is the event that the high temperature on that same day in Cape Town, South Africa was greater than 90 degrees.
Two coins are flipped, and \(A\) is the event that the first coin lands on Heads, while \(B\) is the event that the second coin lands on Heads.
Six patients are given a tuberculosis skin test, which requires a professional to estimate the size of the reaction to the tuberculin agent. Two professionals, Alice and Bob, are randomly chosen. Let \(A\) be the event that Alice estimates the size of the reaction in each of patients 1-5 to be larger than Bob does. Let \(B\) be the event that that Alice estimates the size of the reaction in patient 6 to be larger than Bob does.

Discussion: in scenario 1, the events are dependent. If we know that the high temperature in St Louis was greater than 90 degrees, then the day was most likely a day in June, July, August or September, which gives us probabilistic information about whether the high temperature in Cape Town was greater than 90 degrees on that day. (In this case, it means that it was very unlikely, since that is winter in the southern hemisphere.)

In scenario 2, the events are independent, or at least approximately so. One could argue that knowing that one of the coins is Heads means the person tossing the coins might be more likely to obtain Heads when tossing coins. However, the potential effect here seems so weak based on experience, that it is a reasonable assumption that the events are independent.

In scenario 3, it may be inadvisable to assume that the events are independent. Of course, they may be. It could be that Alice and Bob are well-trained, and there is no bias in their measurements. However, it is also possible that there is something systematic about how they measure the reactions so that one of them usually measures it as larger than the other one does. Knowing \(A\) may be an indication that Alice does systematically measure reactions as larger than Bob does. (Of course, it would also be interesting to know which one was closer to the true value, but that is not what we are worried about at this point.) Later, we will develop tools that will allow us to make a more quantitative statement about this type of problem.

Example In a certain hotel near the US/Canada border, 70% of hotel guests are American and 30% are Canadian. It is known that 40% of Americans wear white socks, while 20% of Canadians wear white socks. Suppose you randomly select a person and observe that they are wearing white socks. What is the probability that the person in that is Canadian?

To solve this, let’s first set notation. Let \(A\) be the event that a randomly person selected is Canadian. Let \(B\) denote the event that a randomly selected person is wearing white socks. We are asked to find \(P(A|B)\). The problem gives us that \(P(A) = 0.3\) and \(P(B|A) = 0.2\), but is silent on the crucial issue of \(P(A|B)\). In order to do this problem, we need a new technique given in the following theorem.

Theorem 2.2 (Bayes’ Rule) Let \(A\) and \(B\) be events in the sample space \(S\). \[ P(B|A) = \frac{P(A|B)P(B)}{P(A)} = \frac{P(A|B)P(B)}{P(A|B)P(B) + P(A|\overline{B})P(\overline{B})} \]

To finish this problem, then, we apply Bayes’ rule and see that \[\begin{align*} P(A|B) &= \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\overline{A})P(\overline{A})}\\ &= \frac{0.2 \times 0.3}{0.2 \times 0.3 + 0.4 \times 0.7} = 0.176 \end{align*}\]

Example The name “Mary” was given to 7065 girls in 1880, and to 11475 girls in 1980. There were 97583 girls born in 1880, and 177907 girls born in 1980. Suppose that a randomly selected girl born in 1880 or 1980 is chosen. What is the probability that the girl’s name is “Mary”?

To solve this, let’s let \(A\) be the event that the randomly selected girl’s name is Mary. If we knew what year the girl was born in, then we would have a good idea what to do. We don’t, so we condition on the birth year. Let \(B\) be the event that the randomly selected girl was born in 1880. We need a new fact.

Theorem 2.3 (Law of Total Probability) Let \(A\) and \(B\) be events in the sample space \(S\). \[ P(A) = P(A \cap B) + P(A \cap \overline{B}) = P(A|B)P(B) + P(A|\overline{B})P(\overline{B}) \]

Continuing with the example, we have \[\begin{align*} P(A) &= P(A|B)P(B) + P(A|\overline{B})P(\overline{B}) \\ &= \frac{7065}{97583} \frac{97583}{97583 + 177907} + \frac{11475}{177907} \frac{177907}{97583 + 177907}\\ &= .0676 \end{align*}\]

2.3.2 Simulating Conditional Probability

Simulating conditional probabilities is challenging. In order to estimate \(P(A|B)\), we will estimate \(P(A \cap B)\) and \(P(B)\), and then divide the two answers. This is not the most efficient or best way to estimate \(P(A|B)\), but it is easy to do with the tools that we already have developed.

Example

Two dice are rolled. Estimate the conditional probability that the sum of the dice is at least 10, given that at least one of the dice is a 6.

First, we estimate the probability that the sum of the dice is at least 10 and at least one of the dice is a 6.

eventAB <- replicate(10000, { 
  dieRoll <- sample(1:6, 2, replace = TRUE)
  (sum(dieRoll) >= 10) && (6 %in% dieRoll)
})
probAB <- mean(eventAB)

Next, we estimate the probability that at least one of the dice is a 6.

eventB <- replicate(10000, {
  die_roll <- sample(1:6, 2, replace = TRUE)
  6 %in% die_roll
})
probB <- mean(eventB)

Finally, we take the quotient.

probAB/probB

## [1] 0.4752148

The correct answer is \(P(A\cap B)/P(B) = \frac{5/36}{11/36} = 5/11 \approx 0.4545\).

2.4 Counting Arguments

Given a sample space \(S\) consisting of equally likely simple events, and an event \(E\), recall that \(P(E) = \frac{|E|}{|S|}\). For this reason, it can be useful to be able to carefully enumerate the elements in a set. While an interesting topic, this is not a point of emphasis of this book, as (1) we assume that students have seen some basic counting arguments in the past and (2) we emphasize simulation techniques.

This text will only work with two counting rules:

Proposition 2.1 (Rule of product) If there are \(m\) ways to do something, and for each of those \(m\) ways there are \(n\) ways to do another thing, then there are \(m \times n\) ways to do both things

Proposition 2.2 (Combinations) The number of ways of choosing \(k\) distinct objects from a set of \(n\) is given by \[ {\binom{n}{k}} = \frac{n!}{k!(n - k)!} \]

The R command for computing \({\binom{10}{3}}\) is choose(10,3).

Example

A coin is tossed 10 times. Some possible outcomes are HHHHHHHHHH, HTHTHTHTHT, and HHTHTTHTTT. Since each toss has two possibilities, the rule of product says that there are \(2\cdot 2\cdot 2\cdot 2\cdot 2\cdot 2\cdot 2\cdot 2\cdot 2\cdot 2 = 2^{10} = 1024\) possible outcomes for the experiment. We expect each possible outcome to be equally likely, so the probability of any single outcome is 1/1024.

Let \(E\) be the event “We flipped exactly three heads”. This might happen as the sequence HHHTTTTTTT, or TTTHTHTTHT, or many other ways. What is \(P(E)\)? To compute the probability, we need to count the number of possible ways that three heads may appear. Since the three heads may appear in any of the ten slots, the answer is

\[ |E| = {\binom{10}{3}} = \frac{10 \times 9 \times 8}{3 \times 2 \times 1} = 120. \]

Then \(P(E) = 120/1024 \approx 0.117\). We can also estimate \(P(E)\) with simulation:

event <- replicate(10000,{
  flips <- sample(c('H','T'),10,replace=TRUE)
  heads <- sum(flips == 'H')
  heads == 3
})
mean(event)

## [1] 0.1173

Example

Suppose that in a class of 10 boys and 10 girls, 5 students are randomly chosen to present work at the board. a. What is the probability that all 5 students are boys? b. What is the probability that exactly 4 of the students are girls?

Let \(E\) be the event “all 5 students are boys.” The sample space consists of all ways of choosing 5 students from a class of 20, so choose(20,5) = 15504. The event \(E\) consists of all ways of choosing 5 boys from a group of 10, so choose(10, 5) = 252. Therefore, the probability is 252/15504 = .016.

Next, let \(A\) be the event “exactly 4 of the students are girls.” The sample space is still the same. The event \(A\) can be broken down into two tasks: choose the 4 girls and choose the 1 boy. By the multiplication principle, there are choose(10, 4) * choose(10,1) = 2100 ways of doing that. Therefore, the probability is 2100/15504 = .135.

Example

A deck of 52 cards has four suits and thirteen ranks in each suit: 2,3,4,5,6,7,8,9,10,J,Q,K,A. If you are dealt two cards, what is the probability they have the same rank?

The sample space is all possible two card hands. Since there are 52 cards in the deck, there are \(\binom{52}{2} = 1326\) possible hands.

The event “both cards have the same rank” can be broken down into two choices. First, choose the rank those cards will have. There are 13 choices. Next choose two of the four cards with that rank for your hand, for which there are \(\binom{4}{2} = 6\) choices. Then there are \(13 \times 6 = 78\) ways for your hand to have a pair of the same rank.

The probability of getting a pair is \(78/1326 \approx 0.059\).

To estimate with simulation, we build a deck of cards by thinking of the ranks as the numbers 2 through 14 and then using ‘rep’ to produce four copies:

deck <- rep(2:14,4)
pair <- replicate(10000,{
    hand <- sample(deck,2)
    hand[1] == hand[2]
})
mean(pair)

## [1] 0.0602

Exercises

Two dice are rolled.
1. What is the probability that the sum of the numbers is exactly 10?
2. What is the probability that the sum of the numbers is at least 10?
3. What is the probability that the sum of the numbers is exactly 10, given that it is at least 10?
A hat contains six slips of paper with the numbers 1 through 6 written on them. Two slips of paper are drawn from the hat (without replacing), and the sum of the numbers is computed.
1. What is the probability that the sum of the numbers is exactly 10?
2. What is the probability that the sum of the numbers is at least 10?
3. What is the probability that the sum of the numbers is exactly 10, given that it is at least 10?
When rolling two dice, what is the probability that one die is twice the other?
Consider an experiment where you roll two dice, and subtract the smaller value from the larger value (getting 0 in case of a tie).
1. What is the probability of getting 0?
2. What is the probability of getting 4?
Roll two dice, one white and one red. Consider these events:
- \(A\): The sum is 7
- \(B\): The white die is odd
- \(C\): The red die has a larger number showing than the white
- \(D\): The dice match (doubles)
1. Which pair(s) of events are disjoint (events \(A\) and \(B\) are disjoint if \(A \cap B = \emptyset\))?
2. Which pair(s) are independent?
3. Which pair(s) are neither disjoint nor independent?
Suppose you do an experiment where you select ten people at random and ask their birthdays. Here are three events:
- \(A\) : all ten people were born in February
- \(B\) : the first person was born in February
- \(C\) : the second person was born in January
1. Which pair(s) of these events are disjoint, if any?
2. Which pair(s) of these events are independent, if any?
3. What is \(P(B | A)\)?
Suppose a die is tossed three times. Let \(A\) be the event “The first toss is a 5”. Let \(B\) be the event “The first toss is the largest number rolled” (the “largest” can be a tie). Determine, via simulation or otherwise, whether \(A\) and \(B\) are independent.
Bob Ross was a painter with a PBS television show “The Joy of Painting” that ran for 11 years.
1. 91% of Bob’s paintings contain a tree. 85% contain two or more trees.
  
  What is the probability that he painted a second tree, given that he painted a tree?
2. 18% of Bob’s paintings contain a cabin. Given that he painted a cabin, there is a 35% chance the cabin is on a lake.
  
  What is the probability that a Bob Ross painting contains both a cabin and a lake?
(Source: https://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/)
Rick Wilkin determined that in the Tennessee M&M’s factory, the proportion of M&M’s by color is: \[ \begin{array}{cccccc} Yellow & Red & Orange & Brown & Green & Blue \\ 0.14 & 0.13 & 0.20 & 0.12 & 0.20 & 0.21 \end{array} \]
1. What is the probability that a randomly selected M&M is not green?
2. What is the probability that a randomly selected M&M is red, orange, or yellow?
3. Estimate the probability that a random selection of four M&M’s will contain a blue one.
4. Estimate the probability that a random selection of six M&M’s will contain all six colors.
With the distribution from problem 9, suppose you buy a bag of M&M’s with 30 pieces in it. Estimate the probability of obtaining at least 9 Blue M&M’s and at least 6 Orange M&M’s in the bag.
Blood types O, A, B, and AB have the following distribution in the US: \[ \begin{array}{lcccc} \text{Type} & O & A & B & AB\\ \text{Probability} & 0.45 & 0.40 & 0.11 & 0.04 \end{array} \] What is the probability that two randomly selected people have the same blood type?
Use simulation to estimate the probability that a 10 is obtained when two dice are rolled.
Estimate the probability that exactly 3 Heads are obtained when 7 coins are tossed.
Estimate the probability that the sum of five dice is between 15 and 20, inclusive.
Suppose a die is tossed repeatedly, and the cumulative sum of all tosses seen is maintained. Estimate the probability that the cumulative sum ever is exactly 20. (Hint: the function cumsum computes the cumulative sums of a vector.)
Rolling two dice
1. Simulate rolling two dice and adding their values. Do 10000 simulations and make a bar chart showing how many of each outcome occurred.
2. You can buy trick dice, which look (sort of) like normal dice. One die has numbers 5, 5, 5, 5, 5, 5. The other has numbers 2, 2, 2, 6, 6, 6.
  
  Simulate rolling the two trick dice and adding their values. Do 10000 simulations and make a bar chart showing how many of each outcome occurred.
3. Sicherman dice also look like normal dice, but have unusual numbers.
  
  One die has numbers 1, 2, 2, 3, 3, 4. The other has numbers 1, 3, 4, 5, 6, 8.
  
  Simulate rolling the two Sicherman dice and adding their values. Do 10000 simulations and make a bar chart showing how many of each outcome occurred.
  
  How does your answer compare to part a?
In a room of 200 people (including you), estimate the probability that at least one other person will be born on the same day as you.
In a room of 100 people, estimate the probability that at least two people were not only born on the same day, but also during the same hour of the same day. (For example, both were born between 2 and 3.)
Assuming that there are no leap day babies and that all birthdays are equally likely, estimate the probability that at least three people have the same birthday in a group of 50 people. (Hint: try using table.)
If 100 balls are randomly placed into 20 urns, estimate the probability that at least one of the urns is empty.
A standard deck of cards has 52 cards, four each of 2,3,4,5,6,7,8,9,10,J,Q,K,A. In blackjack, a player gets two cards and adds their values. Cards count as their usual numbers, except Aces are 11 (or 1), while K, Q, J are all 10.
1. “Blackjack” means getting an Ace and a value ten card. What is probability of getting a blackjack?
2. What is probability of getting 19? (The probability that the sum of your cards is 19, using Ace as 11)
Use R to simulate dealing two cards, and compute these probabilities experimentally.
Ultimate frisbee players are so poor they don’t own coins. So, team captains decide which team will play offense first by flipping frisbees before the start of the game. Rather than flip one frisbee and call a side, each team captain flips a frisbee and one captain calls whether the two frisbees will land on the same side, or on different sides. Presumably, they do this instead of just flipping one frisbee because a frisbee is not obviously a fair coin - the probability of one side seems likely to be different from the probability of the other side.
1. Suppose you flip two fair coins. What is the probability they show different sides?
2. Suppose two captains flip frisbees. Assume the probability that a frisbee lands convex side up is \(p\). Compute the probability (in terms of \(p\)) that the two frisbees match.
3. Make a graph of the probability of a match in terms of \(p\).
4. One Reddit user flipped a frisbee 800 times and found that in practice, the convex side lands up 45% of the time. When captains flip, what is the probability of “same”? What is the probability of “different”?
5. What advice would you give to an ultimate frisbee team captain?
6. Is the two frisbee flip better than a single frisbee flip for deciding the offense?
Suppose you have two coins (or frisbee) that land with Heads facing up with probability \(p\), where \(0 < p < 1\). One coin is red and the other is white. You toss both coins. Find the probability that the red coin is Heads, given that the red coin and the white coin are different.
Suppose that you have 10 boxes, numbered 0-9. Box \(i\) contains \(i\) red marbles and \(9 - i\) blue marbles. You perform the following experiment. You pick a box at random, draw a marble and record its color. You then replace the marble back in the box, and draw another marble from the same box and record its color. You replace the marble back in the box, and draw another marble from the same box and record its color. So, all three marbles are drawn from the same box.
1. If you draw three consecutive red marbles, what is the probability that the 4th marble will also be red?
2. If you draw three consecutive red marbles, what is the probability that you chose the 9th box?
Deathrolling in World of Warcraft works as follows. Player 1 tosses a 1000 sided die. Say they get \(x_1\). Then player 2 tosses a die with \(x_1\) sides on it. Say they get \(x_2\). Player 1 tosses a die with \(x_2\) sides on it. This pattern continues until a player rolls a 1. The player who loses is the player who rolls a 1. Estimate via simulation the probability that a 1 will be rolled on the 4th roll in deathroll.
Let \(A\) and \(B\) be events. Show that \(P(A \cup B|B) = 1\).

Chapter 2 Probability

2.1 Probability Basics

2.2 Simulations

2.2.1 Simulation with sample

2.2.2 Using simulation to compute probabilities

2.2.3 Using replicate to repeat experiments

2.3 Conditional Probability and Independence

2.3.1 Independent Events

2.3.2 Simulating Conditional Probability

2.4 Counting Arguments

Exercises

2.2.1 Simulation with `sample`

2.2.3 Using `replicate` to repeat experiments