This document will help you walk through some of the fundamental R commands you need for exploring cross-validation, and will give you an alternative way to compute the estimated cross-validation error.
It is assumed that you have already attended the week 15 lecture.
library(ggplot2) # graphics library
library(MASS) # contains data sets
library(ISLR) # contains code and data from the textbook
library(knitr) # contains kable() function
library(boot) # contains cross-validation functions
library(gam) # needed for additive models
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.20
options(scipen = 4) # Suppresses scientific notation
We studied the computation of test MSE for each fold in the lecture.
Sum the MSE for each fold, divide by the number of observations, and take the square root to get the cross-validated standard error of estimate.
If you are following along the ISLR book, take a look at §5.3 (Pages 191 - 193). You will want to have the textbook Lab open in front you as you go through these exercises. The ISLR Lab provides much more context and explanation for what you’re doing.
You will need the Auto
data set from the ISLR
library in order to complete this exercise.
Please run all of the code indicated in §5.3.1 of ISLR, even if I don’t explicitly ask you to do so in this document.
View()
command on the Auto
data to see what the data set looks like.#View(Auto)
qplot
to construct a scatterplot of mpg
vs horsepower
. Use stat_smooth()
to overlay a linear, quadratic, and cubic polynomial fit to the data.qplot(data = Auto, x = horsepower, y = mpg) +
stat_smooth(method = "lm", aes(colour = "Linear regression")) +
stat_smooth(method = "lm", formula = y ~ poly(x, 2), aes(colour = "Quadratic fit")) +
scale_colour_discrete("Model") +
theme_bw()
## `geom_smooth()` using formula 'y ~ x'
set.seed(1)
to set the seed of the random number generator. This will ensure that your answers match those in the text.# Edit me
set.seed(1)
sample()
command to construct train
, a vector of observation indexes to be used for the purpose of training your model.# Edit me
train <- sample(392, 196)
sample(n, size)
syntax, the sample
function produces a random sample of size size
from 1:n
. Sampling is done without replacement.mpg
on horsepower
, specifying subset = train
. Save this in a variable called lm.fit
.# Edit me
lm.fit <- lm(mpg ~ horsepower, data = Auto, subset = train)
subset = train
?train
indexes form the training set. We want to train on just the train
data, not the full data.lm.fit
on the test set (i.e., all points that are not in train
)mse1 <- with(Auto, mean((mpg - predict(lm.fit, Auto))[-train]^2))
mse1
## [1] 23.26601
poly()
command to fit a quadratic regression model of mpg
on horsepower
, specifying subset = train
. Save this in a variable called lm.fit2
.lm.fit2 <- lm(mpg ~ poly(horsepower, 2), data = Auto, subset = train)
# Pull the p-value
coef(summary(lm.fit2))["poly(horsepower, 2)2", "Pr(>|t|)"]
## [1] 2.247784e-12
lm.fit2
. How does it compare to the linear regression fit?mse2 <- with(Auto, mean((mpg - predict(lm.fit2, Auto))[-train]^2))
mse2
## [1] 18.71646
poly()
command to fit a cubic regression model of mpg
on horsepower
, specifying subset = train
. Save this in a variable called lm.fit3
.lm.fit3 <- lm(mpg ~ poly(horsepower, 3), data = Auto, subset = train)
# Pull the p-value
coef(summary(lm.fit3))["poly(horsepower, 3)3", "Pr(>|t|)"]
## [1] 0.812198
mse3 <- with(Auto, mean((mpg - predict(lm.fit3, Auto))[-train]^2))
mse3
## [1] 18.79401
mse1 # Linear test error
## [1] 23.26601
mse2 # Quadratic test error
## [1] 18.71646
mse3 # Cubic test error
## [1] 18.79401
set.seed(5)
. You do not have to retype all of the code. You can just change the initial set.seed()
command and see what happens.set.seed(5)
, you are getting a different random split of the data between Train and Validation. The estimates of test error also change, because we now have slightly different model estimates, and a different random set to test the models on. We now get an error of 22.2
for the linear model; 15.2
for the quadratic, and 15.2
for the cubic. These numbers are quite different from our estimates based on the first split we tried. However, the conclusion is essentially the same: The cubic model doesn’t seem to add much, but the quadratic is much better than the linear. We should use the quadratic model.This exercise introduces you to the cv.glm()
command from the boot
library, which automates K-fold cross-validation for Generalized Linear Models (GLMs). Linear regression is one example of a GLM. Logistic regression is another. GLMs are not the same as Generalized Additive Models (GAMs).
Please run all of the code indicated in §5.3.2 of ISLR, even if I don’t explicitly ask you to do so in this document.
glm
command to fit a linear regression of mpg
on horsepower
. Call the resulting model glm.fit
Confirm that this gives the same coefficient estimates as a linear model fit with the lm
command.Note: You should fit the model to the entire data, not just to the training data.
glm.fit <- glm(mpg ~ horsepower, data = Auto)
coef(glm.fit)
## (Intercept) horsepower
## 39.9358610 -0.1578447
coef(lm(mpg ~ horsepower, data = Auto))
## (Intercept) horsepower
## 39.9358610 -0.1578447
# Are they the same? Yes.
identical(coef(glm.fit), coef(lm(mpg ~ horsepower, data = Auto)))
## [1] TRUE
cv.error
, the vector of LOOCV error estimates for polynomials of degree 1-5.Note: The computations take some time to run, so I’ve set cache = TRUE
in the code chunk header to make sure that the code below doesn’t re-execute at knit time unless this particular chunk has changed.
cv.err <- cv.glm(Auto, glm.fit)
cv.err$delta
## [1] 24.23151 24.23114
cv.error <- rep(0, 5)
for (i in 1:5) {
glm.fit <- glm(mpg ~ poly(horsepower, i), data=Auto)
cv.error[i] <- cv.glm(Auto, glm.fit)$delta[1]
}
cv.error
## [1] 24.23151 19.24821 19.33498 19.42443 19.03321
# Form a little data frame for plotting
cv.df <- data.frame(degree = 1:5,
cv.error = cv.error)
qplot(data = cv.df, x = degree, y = cv.error, geom = "line",
ylab = "LOOCV error estimate") + geom_point()
which.min(cv.error)
## [1] 5
which.min(cv.error)
model has the lowest LOOCV estimate of test error.Please run all of the code indicated in §5.3.3 of ISLR
set.seed(17)
cv.error.10 <- rep(0,10)
for (i in 1:10){
glm.fit=glm(mpg ~ poly(horsepower, i), data=Auto)
cv.error.10[i] <- cv.glm(Auto, glm.fit, K=10)$delta[1]
}
cv.error.10
## [1] 24.27207 19.26909 19.34805 19.29496 19.03198 18.89781 19.12061 19.14666
## [9] 18.87013 20.95520
# Form a little data frame for plotting
cv.df <- data.frame(degree = 1:10,
cv.error = cv.error.10)
qplot(data = cv.df, x = degree, y = cv.error, geom = "line",
ylab = "10-fold CV error estimate") + geom_point()
which.min(cv.error.10)
## [1] 9
Lab exercises from Prof. Alexandra Chouldechova, released under a Attribution-NonCommercial-ShareAlike 3.0 United States license.
This document is used for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.