Chapter 4 Intro to Predictive Modeling

In Chapter 3 we saw that it can be hard to decide how to predict future values from a regression model. The reason is that our model isn’t constructed so as to efficiently predict future values, but rather for other goals. In this chapter, we focus on doing regression for predictions.

4.1 Comparing Two Models

Let’s return to the ISwR::cystfibr data set. We imagine that we want to determine whether height or weight is a better predictor for pemax.

cystfibr <- ISwR::cystfibr

Our idea is that we want to split the data set into two parts. One of the parts is the training data, and the other part is the testing data. We are required to build our model 100% solely on the training data, and we will determine how will it can predict future values based on the testing data. As a first step, let’s imagine that our training data consists of all of the data except for the 6th observation. The test data is then the data that is not in the training data, that is, the 6th observation.

train_data <- cystfibr[-6,]
test_data <- cystfibr[6,]

We build our two models (based on height or weight separately), and we compute the error when predicting the value in the test data.

mod1 <- lm(pemax ~ height, data = train_data)
mod2 <- lm(pemax ~ weight, data = train_data)
predict(mod1, newdata = test_data) - test_data$pemax  #this is the error using model 1

##        6 
## 8.621298

predict(mod2, newdata = test_data) - test_data$pemax #this is the error using model 2

##        6 
## 4.775489

We see that model 1 did a better job predicting the value of pemax in this case than model 2 did. Now, we repeat this for every observation. That is, for each observation, we remove it from the data set, train our model, and compute the error when predicting the new value. This is called Leave one out cross validation (LOOCV).

Here are the details on how to do it in R.

height_errors <- sapply(1:nrow(cystfibr), function(x){
  train_data <- cystfibr[-x,]
  test_data <- cystfibr[x,]
  mod <- lm(pemax ~ height, data = train_data)
  predict(mod, newdata = test_data) - test_data$pemax
})
height_errors

##          1          2          3          4          5          6          7 
## -33.920085 -17.163862 -20.016129  -2.007188 -11.025840   8.621298  33.154993 
##          8          9         10         11         12         13         14 
##  -3.637413  34.297888  16.851617   2.191805  20.110890  16.569973  35.471233 
##         15         16         17         18         19         20         21 
## -19.019928 -25.722616 -39.286566  11.783307  -4.213606  28.258534  47.718375 
##         22         23         24         25 
## -30.346873 -34.178520  38.018016 -68.435878

weight_errors <- sapply(1:nrow(cystfibr), function(x){
  train_data <- cystfibr[-x,]
  test_data <- cystfibr[x,]
  mod <- lm(pemax ~ weight, data = train_data)
  predict(mod, newdata = test_data) - test_data$pemax
})
weight_errors

##           1           2           3           4           5           6 
## -18.1454890  -7.0206108 -22.3306483  -2.4888298  -6.4368299   4.7754887 
##           7           8           9          10          11          12 
##  36.7303628 -13.4657504  24.9013559   6.2140517   0.9329037  24.5308972 
##          13          14          15          16          17          18 
##  18.5253420  47.1982421 -29.0366653 -30.4253120 -50.6977861  16.5412245 
##          19          20          21          22          23          24 
## -16.6026998  23.6413130  46.8093696 -23.4400631 -17.4084988  31.0813452 
##          25 
## -57.0067353

Now we need to decide which errors are “smaller”. A common measurement is the mean-sqaured error (MSE), or equivalently the root mean-squared error (RMSE). Let’s use it.

sqrt(mean(height_errors^2))

## [1] 28.68975

sqrt(mean(weight_errors^2))

## [1] 27.5804

And we see that the MSE for using just weight is slighlty better than the MSE for using just height.

Next, let’s get a common misconception out of the way. Many people who haven’t worked with things like this before would think that it would be best to build the model on all of the variables. Let’s do that and compute the RMSE.

full_errors <- sapply(1:nrow(cystfibr), function(x){
  train_data <- cystfibr[-x,]
  test_data <- cystfibr[x,]
  mod <- lm(pemax ~ ., data = train_data)
  predict(mod, newdata = test_data) - test_data$pemax
})
sqrt(mean(full_errors^2))

## [1] 32.39895

We see that the RMSE is about 17% higher with the full model than it is with the model using just weight!

Exercise 4.1 The aim of this exercise is to illustrate the difference between using $p$-values for determining whether to select variables, and using estimates of MSE.

Find the LOOCV estimate for the MSE when estimating pemax via weight + bmp + fev1 + rv, and compare to just using weight.
Use lm to model pemax ~ weight + bmp + fev1 + rv and observe that rv is the only non-significant variable. Remove it, and estimate the MSE via LOOCV again. Did removing rv improve the estimate of MSE?

LOOCV works well for small data sets such as cystfibr, but can become unwieldy when the data set is large and the method of estimation is involved. For each model that we want to test, we would need to build it $n-1$ times, where $n$ is the number of observations. Let’s examine a couple of alternatives for when LOOCV takes too long.

We can also use LGOCV, which stands for leave group out cross validation. As you might guess, this means that instead of leaving one observation out and building the model, we leave a group of observations out. Within this paradigm, there are several ways to proceed.

We could randomly select 70% to leave in, and leave 30% out, and estimate the MSE based on this split.
We could repeatedly select 70% to leave in, estimate the MSE based on the 30% left out each time, and take the average.
We could split the data into $k$ groups, and estimate the MSE based on leaving out each of the groups separately, and take the average.
We could repeatedly split the data into $k$ groups, and estimate the MSE based on leaving out each of the groups separately, and take the average. Then take the average of the averages.
We could bootstrap resample from the data set. A bootstrap resample is one that resamples observations from a data set with replacement. So, the same observation can appear multiple times in the resampled version. In this case, generally about 38% of the observations do not appear in the resampled version (so-called out of bag), and they can be used to estimate the error of estimation.

Let’s look at how to implement these in R. We will use the insurance data set that is available on the course web page.

dd <- read.csv("data/insurance.csv")
summary(dd)

##       age            sex           bmi           children     smoker    
##  Min.   :18.00   female:662   Min.   :16.00   Min.   :0.000   no :1064  
##  1st Qu.:27.00   male  :676   1st Qu.:26.30   1st Qu.:0.000   yes: 274  
##  Median :39.00                Median :30.40   Median :1.000             
##  Mean   :39.21                Mean   :30.67   Mean   :1.095             
##  3rd Qu.:51.00                3rd Qu.:34.70   3rd Qu.:2.000             
##  Max.   :64.00                Max.   :53.10   Max.   :5.000             
##        region       expenses    
##  northeast:324   Min.   : 1122  
##  northwest:325   1st Qu.: 4740  
##  southeast:364   Median : 9382  
##  southwest:325   Mean   :13270  
##                  3rd Qu.:16640  
##                  Max.   :63770

The first think that I notice is that the expenses look right skew. Let’s plot a histogram.

hist(dd$expenses)

And we can compute the skewness of the data via

e1071::skewness(dd$expenses)

## [1] 1.512483

That is moderately skew; about like an exponential rv.

e1071::skewness(log(dd$expenses))

## [1] -0.08989552

Yep. It might be worth taking the log of expenses and modeling that instead, but I will leave that as an exercise.

If we want to do repeated 70/30 splits, then that is perhaps easiest to do by hand.

N <- nrow(dd)
train_indices <- sample(1:N,ceiling(N * .7))
test_indices <- setdiff((1:N), train_indices)
mod <- lm(expenses ~ ., data = dd)
errors <- predict(mod, newdata = dd[test_indices,]) - dd$expenses[test_indices]
sqrt(mean(errors^2))

## [1] 6102.799

We see that with the first train/test split, we have an estimated RMSE of 6103. (For comparison purposes, if we simply took the mean value, then the RMSE would be about $12K).

Now, we can repeat the train/test split a bunch of times… say 50. If our model building procedure were slower than this, we might not be able to do 50.

rmse <- replicate(50, {
  train_indices <- sample(1:N,ceiling(N * .7))
  test_indices <- setdiff((1:N), train_indices)
  mod <- lm(expenses ~ ., data = dd)
  errors <- predict(mod, newdata = dd[test_indices,]) - dd$expenses[test_indices]
  sqrt(mean(errors^2))
})

Once we have these values, we can compute the mean and standard deviation to get a feeling for the range of MSE that we would expect to get with this model.

mean(rmse)

## [1] 6089.312

quantile(rmse, c(.025, .975))

##     2.5%    97.5% 
## 5394.731 6580.315

c(mean(rmse) - 2 * sd(rmse), mean(rmse) + 2 * sd(rmse))

## [1] 5526.369 6652.256

We can either compute the 95% confidence interval of the RMSE from the quantiles or by taking two times the standard deviation. This is saying that we would be surprised if the true RMSE of this model on unseen, future data is less than $5500 or more than $6500

Exercise 4.2 Consider the insurance data set. Estimate the RMSE with error bounds when modeling expenses on the other variables after taking the log of expenses, using repeated LGOCV. Note that we will still want to get the RMSE of the estimate for expenses, not the log of expenses. Compare your answer to what we achieved in the previous example.

We could also do this for repeated $k$-fold CV, but let’s instead start using the caret function train. The train function is a powerful tool that allows us to train many types of models with different types of cross validation. This example is about as simple of an example as we can do.

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

repeated_cv <- train(x = select(dd, -expenses), 
      y = dd$expenses,
      method = "lm",
      trControl = trainControl(method = "repeatedcv",
                               repeats = 5))
repeated_cv$results

##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      TRUE 6069.047 0.7496436 4201.888 443.7151 0.04441889 245.8752

From this table, we see that the repeated 10-fold CV estimated RMSE is 6073 with standard deviation 368. Recall that using repeated LGOCV, we got an estimated RMSE of 6089 and a standard deviation of 281.

Finally, let’s see what happens with the bootstrap, which is the default in caret for lm.

boot_rmse <- train(x = select(dd, -expenses), 
      y = dd$expenses,
      method = "lm")
boot_rmse$results

##   intercept     RMSE  Rsquared      MAE   RMSESD RsquaredSD  MAESD
## 1      TRUE 6197.658 0.7398227 4278.145 275.6327 0.01830887 162.05

All three methods give relatively similar results in this case. We can guess that our RMSE on future, unseen data will be about $\$6100 \pm \$600$.

4.2 Simulations of Expected MSE

In the previous section, we saw several methods for estimating the MSE or RMSE for models. One question that you may have is: which method should I be using? If we want to get error bounds for our estimate of RMSE, then we will need to use one that is repeated, but that still leaves repeated LGOCV, repeated $k$-vold CV, and repeated bootstrap. The type of model we are building can have an impact on our choice: if the model doesn’t handle data that is repeated well, then bootstrapping would not be a good choice, for example. If model accuracy is highly sensitive to sample size, then LGOCV might not be a good choice, as the model is built on data roughly 70-80% of the size of the original data set.

In this section, though, we will thouroughly examine the estimates for MSE in the case of a simple linear model $y = 1 + 2x + \epsilon$, where $\epsilon \sim N(0, \sigma = 3)$. We suppose that we have data $\{(x_1, y_1), \ldots, (x_N, y_N)\}$, and we do computations in two special cases.

We assume that future $x$ values are a fixed constant $x_0$. In this case, it can be shown that the expected MSE of the model is given by \[ \sigma^2 + \sigma^2/N + \sigma^2 \frac{(x_0 - \overline{x})^2}{S_{xx}}, \] where $S_{xx} = \sum_{i = 1}^N (x_i - \overline{x})^2$.
We assume that future $x$ values are generated through a random process with pdf $f$. It can be shown that the expected MSE of the model is given by \[ \sigma^2 + \sigma^2/N + \sigma^2 \int \frac{(x - \overline{x})^2}{S_{xx}} f(x)\, dx \] where $S_{xx} = \sum_{i = 1}^N (x_i - \overline{x})^2$. A consequence of this is that, in order to minimize expected MSE, the mean of the predictors should equal the mean of the future values!

4.2.1 Estimating MSE and Bias-Variance Tradeoff

Let’s start with case 1. We assume the $x_i$ are uniformly sampled on $[0, 1]$ and we have 100 samples. We make predictions at $x_0 = 0.75$ and compute the MSE.

N <- 100
set.seed(2052020)
xs <- seq(0, 1, length.out = N)
ys <- 1 + 2 * xs + rnorm(N, 0, 3)
sxx <- sum((xs - mean(xs))^2)
true_mse <- 9 + 9/N + 9 * (.75 - mean(xs))^2/sxx

We see from above that the true MSE is 9.156 in this case. Let’s estimate it directly.

sim_data <- replicate(10000, {
  ys <- 1 + 2 * xs + rnorm(N, 0, 3)
  mod <- lm(ys ~ xs)
  predict(mod, newdata = data.frame(xs = 0.75)) - (1 + 2 * .75 + rnorm(1, 0, 3))
})
mean(sim_data^2)

## [1] 9.317569

Pretty close.

One reason that the full model of pemax ~ . in cystfibr does worse than just pemax ~ weight or pemax ~ height is the so-called bias-variance trade-off. The easiest version of this principle was observed in STAT 3850. Assume that $\hat \theta$ is an estimator for $\theta$ and $E[\hat \theta] = \theta_0$.

\[ E[(\hat \theta - \theta)^2] = MSE(\hat \theta) = V(\hat \theta) + Bias(\hat \theta)^2 = V(\hat \theta) + (E[\hat \theta] - \theta)^2 \]

It can be shown, for example, that $S^2 = \frac 1{n - 1} \sum_{i = 1}^n (y_i - \overline{y})^2$ is an unbiased estimator for $\sigma^2$, and that among all unbiased estimators for $\sigma^2$, $S^2$ as the smallest variance. However, we can increase the bias and the corresponding decrease in variance can lead to an estimator for $\sigma^2$ that has a lower MSE. In this case, we can show that $\frac 1n \sum_{i = 1}^n \bigl(y_i - \overline{y}\bigr)^2 = \frac {n-1}{n} S^2$ has lower MSE than $S^2$ does.

In the context of modeling, we have a different formulation of the bias-variance trade-off. Suppose that we are modeling an outcome $y$ that consists of independent noise about a curve with constant variance $\sigma^2$. (We can assume the curve is $y = 0$.) Suppose also that we use a model to estimate the outcome. We have that

\[ E[MSE] = \sigma^2 + (\text{Model Bias})^2 + \text{Model Variance} \]

We are never going to be able to get rid of the $\sigma^2$ part, because each time we draw new samples, the $\sigma^2$ is going to be there. However, we can trade-off between the model bias and the model variance. Often, we can reduce the bias by increasing the variance, or reduce the variance by increasing the bias in such a way that the overall expected value of the MSE decreases. Let’s look at an example with simulated data to further understand what each component of the above means!

We again consider data generated from $y = 1 + 2x + \epsilon$, and we look at the MSE when predicting values at $x_0 = 0.75$, which we computed above to be 9.156. Let’s estimate the bias of the prediction.

sim_data <- replicate(10000, {
  ys <- 1 + 2 * xs + rnorm(N, 0, 3)
  mod <- lm(ys ~ xs)
  predict(mod, newdata = data.frame(xs = .75))
})
mean(sim_data)

## [1] 2.498589

1 + 2 * .75

## [1] 2.5

We see that this is, in fact, an unbiased estimator for $y$ when $x_0 = 0.75$. We could have recalled that $E[\hat \beta_0] = \beta_0$ and $E[\hat \beta_1] = \beta_1$, and seen $E[\hat \beta_0 + \hat \beta_1 \times 0.75] = \beta_0 + \beta_1 \times .75 = 2.5$, as well. So, the bias squared is 0. Now, we need to estimate the variance of the model when $x_0 = 0.75$. To do so, we take the variance of the simulated data.

var(sim_data) + 3^2

## [1] 9.156849

true_mse

## [1] 9.156163

Adding this to $\sigma^2$ gives us the MSE. If we repeat this a couple of times, we see this is correct.

Example Suppose we have data generated by $y = sin(2 \pi x) + \epsilon$, where $\epsilon \sim N(0, \sigma = .4)$. We choose to predict new values by splitting the data into two groups and approximating with a horizontal line, see the textbook for details. Let’s generate some data and plot our predictions.

N <- 20
xs <- seq(0, 1, length.out = N)
ys <- sin(2 * pi * xs) + rnorm(N, 0, .4)
plot(xs, ys)
y1 <- mean(ys[xs < .5])
y2 <- mean(ys[xs >= .5])
curve(ifelse(x < .5, y1, y2), add = TRUE, col = 2)

The red line is our prediction at the various $x$ values. In particular, when $x_0 = 0.75$, our prediction is y2, -0.672. In this case, we will

estimate the MSE.

for_mse <- replicate(10000, {
  ys <- sin(2 * pi * xs) + rnorm(N, 0, .4)
  y2 <- mean(ys[xs >= .5])
  new_y = sin(2 * pi * 0.75) + rnorm(1, 0, .4)
  y2 - new_y
})
mse_estimate <- mean(for_mse^2)
mse_estimate

## [1] 0.336882

&nbsp;

We know $\sigma^2$

sigma_squared <- .4^2

&nbsp;

Estimate the Bias at $x_0 = 0.75$

for_bias <- replicate(10000, {
  ys <- sin(2 * pi * xs) + rnorm(N, 0, .4)
  y2 <- mean(ys[xs >= .5])
  new_y = sin(2 * pi * 0.75) + rnorm(1, 0, .4)
  y2 - new_y
})
bias_estimate <- mean(for_bias)
bias_estimate

## [1] 0.3930525

&nbsp;

Estimate the model variance at $x_0 = 0.75$.

for_model_variance <- replicate(10000, {
  ys <- sin(2 * pi * xs) + rnorm(N, 0, .4)
  y2 <- mean(ys[xs >= .5])
  new_y = sin(2 * pi * 0.75) + rnorm(1, 0, .4)
  y2
})
model_variance_estimate <- var(for_model_variance)
model_variance_estimate

## [1] 0.01600941

&nbsp;

Check the bias-variance trade-off.

sigma_squared + bias_estimate^2 + model_variance_estimate

## [1] 0.3304996

mse_estimate

## [1] 0.336882

4.2.2 Estimating MSE for new predictors uniformly sampled

In this section, we imagine that our predictors that we are using for model building are uniformly sampled between 0 and 1, and future predictors will be drawn from a uniform random variable on the interval $(0, 1)$. In the special case that the original $x_i$ are uniformly sampled from 0 to $M$ (ie seq(0, M, length.out = N) and new $x$ values are chosen uniformly over the interval $[0, M]$ we get an expected MSE of \[ \sigma^2\bigl(1 + 1/N + \frac{M^2}{12 S_{xx}}\bigr) \] Let’s make a computation when $N = 100$ and $M = 10$ in our model where $\sigma^2 = 9$.

sigma <- 3
N <- 100
minx <- 0
maxx <- 10
xs <- seq(minx, maxx, length.out = N)
sxx <- sum((xs - mean(xs))^2)
mse_real <- sigma^2 * (1  + 1/N + maxx^2/12/sxx)
mse_real

## [1] 9.178218

We see that the true expected MSE from building a model like this is 9.178. Since we are going to be doing simulation in this section, let’s convince ourselves that we can simulate this value correctly.

mean_x <- mean(xs)
sxx <- sum((xs - mean_x)^2)
mse_sim <- replicate(40000, {
  ys <- 1 + 2 * xs + rnorm(N, 0, sigma)
  dd <- data.frame(xs, ys)
  sxy <- sum((dd$xs - mean_x) * dd$ys)
  coeffs_2 <- sxy/sxx
  coeffs_1 <- mean(dd$ys) - coeffs_2 * mean_x
  x_star <- runif(1, minx, maxx)
  y_star <- 1 + 2 * x_star + rnorm(1, 0, 3)
  (coeffs_1 + coeffs_2 * x_star - y_star)^2
})
mean(mse_sim)

## [1] 9.268973

mse_real

## [1] 9.178218

If you run the above code a few times, you should see that the estimate value through simulation is close to the true value we computed above. To be clear, we could have done the following to get coeffs_1 and coeffs_2, but we want to resample a bunch and the way done above is much faster, and gets the same answer as illustrated below:

ys <- 1 + 2 * xs + rnorm(N, 0, sigma)
dd <- data.frame(xs, ys)
sxy <- sum((dd$xs - mean_x) * dd$ys)
coeffs_2 <- sxy/sxx
coeffs_1 <- mean(dd$ys) - coeffs_2 * mean_x
lm(ys ~ xs, data = dd)

## 
## Call:
## lm(formula = ys ~ xs, data = dd)
## 
## Coefficients:
## (Intercept)           xs  
##      0.7865       2.0341

coeffs_1

## [1] 0.7865495

coeffs_2

## [1] 2.034139

Now let’s look at estimating the MSE using bootstrapping. We are assuming that the xs are fixed here, and the random part is generating the ys.

sim_data <- replicate(100, {
  dd$ys <- 1 + 2 * dd$xs + rnorm(N, 0, sigma) 
  boot_error <- train(ys ~ xs, data = dd, method = "lm")
  boot_error$results$RMSE
})
mean(sim_data)

## [1] 3.05326

min(sim_data)

## [1] 2.480849

sd(sim_data)

## [1] 0.2291119

t.test(sim_data^2, mu = true_mse)

## 
##  One Sample t-test
## 
## data:  sim_data^2
## t = 1.5417, df = 99, p-value = 0.1263
## alternative hypothesis: true mean is not equal to 9.156163
## 95 percent confidence interval:
##  9.093537 9.655188
## sample estimates:
## mean of x 
##  9.374362

This runs too slow on my computer, and the following code speeds things up.

sim_data <- replicate(1000, {replicate(25, {
  dd$ys <- 1 + 2 * dd$xs + rnorm(N, 0, sigma)
  train_indices <- sample(1:nrow(dd), replace = T)
  train <- dd[train_indices,]
  test <- dd[-train_indices,]
  sxx <- sum((train$xs - mean(train$xs))^2)
  sxy <- sum((train$xs - mean(train$xs)) * train$ys)
  coeffs_2 <- sxy/sxx
  coeffs_1 <- mean(train$ys) - coeffs_2 * mean(train$xs)
  errors <- coeffs_1 + test$xs * coeffs_2 - test$ys
  mean(errors^2)
}) %>% mean()
})
mean(sim_data)

## [1] 9.35093

sd(sim_data)

## [1] 0.4636004

t.test(sim_data, mu = true_mse)

## 
##  One Sample t-test
## 
## data:  sim_data
## t = 13.285, df = 999, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 9.156163
## 95 percent confidence interval:
##  9.322161 9.379698
## sample estimates:
## mean of x 
##   9.35093

We see that bootstrapping significantly overestimates the MSE in this scenario.

Now, let’s look at repeated $10$-fold CV using the caret package. As a reminder, here is how we do it.

library(caret)
tt <- caret::train(ys ~ xs, 
                   data = dd, 
                   method = "lm", 
                   trControl = trainControl(method = "repeatedcv",
                                            repeats = 5))
tt$results$RMSE^2

## [1] 10.25573

Let’s replicate this and see what the mean is.

cv_data <- replicate(50, {
  ys <- 1 + 2 * xs + rnorm(N, 0, sigma)
  dd$ys <- ys
  tt <- caret::train(ys ~ xs, 
                   data = dd, 
                   method = "lm", 
                   trControl = trainControl(method = "repeatedcv",
                                            repeats = 5))
  tt$results$RMSE^2
})
mean(cv_data)

## [1] 8.726005

sd(cv_data)

## [1] 1.356354

mse_real

## [1] 9.178218

t.test(cv_data, mu = mse_real)

## 
##  One Sample t-test
## 
## data:  cv_data
## t = -2.3575, df = 49, p-value = 0.02244
## alternative hypothesis: true mean is not equal to 9.178218
## 95 percent confidence interval:
##  8.340534 9.111477
## sample estimates:
## mean of x 
##  8.726005

Here, we see that 10-fold repeated CV tends to underestimate the true MSE, and the standard deviation of the estimator is quite high to boot. This indicates that, for this particular problem, bootstrapping seems to be the better way to go. Again, the above code is really slow, and only does 50 different data sets. Here is a speed-up. The other advantage of doing it this way is that we can estimate MSE directly, rather than estimating RMSE and squaring, which introduces bias (as seen above).

sim_data_2 <- replicate(200, {
  dd$ys <- 1 + 2 * dd$xs + rnorm(N, 0, 3)
  replicate(5, {
    folds <- caret::createFolds(dd$ys)
    sapply(1:10, function(x) {
      test_indices <- folds[[x]]
      test <- dd[test_indices,]
      train <- dd[-test_indices,]
      sxx <- sum((train$xs - mean(train$xs))^2)
      sxy <- sum((train$xs - mean(train$xs)) * train$ys)
      coeffs_2 <- sxy/sxx
      coeffs_1 <- mean(train$ys) - coeffs_2 * mean(train$xs)
      errors <- coeffs_1 + test$xs * coeffs_2 - test$ys
      mean(errors^2)
    })
  }) %>% mean() #estimate of MSE via 5 repeated 10-fold CV
})
mean(sim_data_2)

## [1] 9.17039

sd(sim_data_2)

## [1] 1.260917

t.test(sim_data_2, mu = mse_real)

## 
##  One Sample t-test
## 
## data:  sim_data_2
## t = -0.087796, df = 199, p-value = 0.9301
## alternative hypothesis: true mean is not equal to 9.178218
## 95 percent confidence interval:
##  8.99457 9.34621
## sample estimates:
## mean of x 
##   9.17039

For this example, repeated CV seems considerably less biased than bootstrapping. Bootstrapping has a lower standard deviation.

Exercise 4.3 Use simulation to compute the expected value and standard deviation of the estimator for MSE when using LGOCV with 70% in group and 30% out of group in the above scenarios. (That is, assume that we have a sample size of 100 from data that follows the pattern $y = 1 + 2x + \epsilon$, where $\epsilon \sim N(0, \sigma = 3)$. Assume the $x$ values in the data are uniformly spaced between 0 and 1, and the additional $x$ value is chosen randomly between 0 and 1, so we know the true MSE is 9.178218.) How does the estimate compare to bootstrapping and repeated 10-fold CV in terms of bias and standard deviation?

4.3 Bias Variance Trade-off

Let’s imagine that we take a sample of size 10 and we connect those points with line segments as our model. We imagine that the 10 $x$ values are fixed and do not change with subsequent data collection (which would change things up).

xs <- 1:10
ys <- rnorm(10, 0, 2)
dd <- data.frame(xs, ys)
ggplot(dd, aes(xs, ys)) + 
  geom_line()

This is our model. In order to compute the expected mean-squared error, we will need to sample data, build our model, create a new data point and compute the error squared, then find the expected value of that. For simplicity sake, we assume that the new data point falls on the $x$ value of 5. In that case, we don’t need to compute anything else, the error is simply the difference between consecutive draws of normal random variables, which has variance $2^2 + 2^2 = 8$. However, we include the full code of the full simulation below.

xs <- 1:10
errs <- replicate(10000, {
  ys <- rnorm(10, 0, 2)
  new_x <- sample(1:10, 1)
  new_y <- rnorm(1, 0, 2)
  ys[new_x] - new_y
})
mean(errs^2) #this is our estimate of E[MSE]

## [1] 8.293638

Next, we estimate the bias.

mean(errs) #our estimate of the bias

## [1] -0.002079004

Next, we estimate the variance of the model. Here, we only look at the variance of the estimated values.

est_values <- replicate(10000, {
    ys <- rnorm(10, 0, 2)
    new_x <- sample(1:10, 1)
    ys[new_x]
})
var(est_values)

## [1] 3.952116

And we see that the MSE, which we estimate to be about 8, is the sum of the variance $\sigma^2 = 4$, the bias squared = 0, and the variance of the model itself, which is about 4. Hmmm, that one was maybe too easy in that it didn’t illustrate all of the ideas. Let’s try the one from the textbook, where they model $y = \sin(2\pi x) + \epsilon$ by splitting the x-values into two groups and approximating with a line. That’ll be fun!

This time, we also assume that our $x$’s are chosen randomly. We also plot our estimate.

xs <- runif(40, 0, 1)
ys <- sin(2 * pi * xs) + rnorm(40, 0, .4)
plot(xs, ys)
new_f <- function(x) {
  ifelse(x < 1/2, mean(ys[xs < 1/2]), mean(ys[xs >= 1/2]))
}
curve(new_f, add = T, col = 2)

OK, the variance that we can’t remove from this model is $.4^2 = 0.16$. Now, we estimate the expected value of the MSE by picking a bunch of $x$ values, estimating the expected MSE for that $x$ value, and integrating over all of the $x$ values. This yields the same result as the method we used in the previous section. The reason we are doing this more complicated thing is that when we have to estimate the Variance of the model, we will need to do it this way, so we may as well get used to doing it like that now…

x_vals <- seq(0, 1, length.out = 300)
for_mse <- sapply(x_vals, function(x1) {
  errors <- replicate(1000, {
    xs <- runif(40, 0, 1)
    ys <- sin(2 * pi * xs) + rnorm(40, 0, .4)
    y1 <- mean(ys[xs < 1/2])
    y2 <- mean(ys[xs >= 1/2])
    new_x <-  x1
    new_y <- sin(2*pi*new_x) + rnorm(40, 0, .4)
    ifelse(new_x < 1/2, new_y - y1, new_y - y2)
  })
  mean(errors^2)
})
head(for_mse)

## [1] 0.5810354 0.5781042 0.5196992 0.4853559 0.4650616 0.4196338

We see that the expected MSE is larger when the value of $sin(2 \pi x)$ is close to 0; that is, when $x$ is close to 0. Now, we just take the average MSE, and that is our estimate for the overall expected MSE of the model.

mean(for_mse)

## [1] 0.2689246

Cool. That is our estimate of the MSE associated with the model. Now we estimate the variance using the same technique.

for_variance <- sapply(x_vals, function(x1) {
  errors <- replicate(1000, {
    xs <- runif(40, 0, 1)
    ys <- sin(2 * pi * xs) + rnorm(40, 0, .4)
    y1 <- mean(ys[xs < 1/2])
    y2 <- mean(ys[xs >= 1/2])
    new_x <-  x1
    ifelse(new_x < 1/2, y1, y2)
  })
  var(errors)
})
head(for_variance)

## [1] 0.01247829 0.01323231 0.01343762 0.01326937 0.01323904 0.01245012

These seem not to depend nearly as much on $x$, which is what should be expected! Our integral of this is about

mean(for_variance)

## [1] 0.01306395

Let’s look at the bias squared.

biases <- sapply(seq(0, 1, length.out = 300), function(x1) {
  errors <- replicate(1000, {
    xs <- runif(40, 0, 1)
    ys <- sin(2 * pi * xs) + rnorm(40, 0, .4)
    y1 <- mean(ys[xs < 1/2])
    y2 <- mean(ys[xs >= 1/2])
    new_x <-  x1
    new_y <- sin(2 * pi * new_x) + rnorm(1, 0, .4)
    ifelse(new_x < 1/2, y1 - new_y, y2 - new_y)
  })
  mean(errors)
})
head(biases)

## [1] 0.6278339 0.5922848 0.6125268 0.5666743 0.5615541 0.5299953

OK, and the integral of the biases squared is about

sum(biases^2/100)

## [1] 0.2866719

Here is the moment of truth!

mean(for_mse)

## [1] 0.2689246

.4^2 + mean(for_variance) + mean(biases^2)

## [1] 0.2686213

Woo-hoo! Those seem to match! I wasn’t sure when I was doing the simulations with smaller samples, so I kept making them bigger and now I am convinced. This also helps me understand what the model variance and model bias refer to, though in this case the model variance is constant across all of the values of $x$.

Exercise 4.4 Verify the bias-variance trade-off formula through simulations when the signal is $y = \sin(2 \pi x) + \epsilon$, where $\epsilon \sim N(0, \sigma = .4)$, and the model is obtained by the following procedure. (Note that this is the same procedure as above, but we have replaced 1/2 by 3/4).

If $x < 3/4$, then $y$ is the mean of all of the $y$ values associated with $x$ values that are less than 3/4.
If $x \ge 3/4$, then $y$ is the mean of all of the $y$ values associated with $x$ values that are greater than or equal to 3/4.

4.4 Exercises

Exercise 4.1
Exercise 4.2
Exercise 4.3
Exercise 4.4