Topic 4 - Linear regression in R
Setup
Let’s begin by loading the packages we’ll need to get started
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.4 ✓ dplyr 1.0.3
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Linear regression
Linear regression is just a more general form of ANOVA, which itself is a generalized t-test. In each case, we’re assessing if and how the mean of our outcome \(y\) varies with other variables. Unlike t-tests and ANOVA, which are restricted to the case where the factors of interest are all categorical, regression allows you to also model the effects of continuous variables.
linear regression is used to model linear relationship between an outcome variable, \(y\), and a set of covariates or predictor variables \(x_1, x_2, \ldots, x_p\).
Loading the data
For our first example we’ll look at a small data set in which we’re interested in predicting the crime rate per million population based on socio-economic and demographic information at the state level.
Let’s first import the data set and see what we’re working with.
# Import data set
# "https://github.com/Sondzus/StatsAnalytics/blob/master/data/crime_simple.txt"
crime <- read_delim("http://www.andrew.cmu.edu/user/achoulde/94842/data/crime_simple.txt", delim = "\t")##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## R = col_double(),
## Age = col_double(),
## S = col_double(),
## Ed = col_double(),
## Ex0 = col_double(),
## Ex1 = col_double(),
## LF = col_double(),
## M = col_double(),
## N = col_double(),
## NW = col_double(),
## U1 = col_double(),
## U2 = col_double(),
## W = col_double(),
## X = col_double()
## )
The variable names that this data set comes with are very confusing, and even misleading.
R: Crime rate: # of offenses reported to police per million population
Age: The number of males of age 14-24 per 1000 population
S: Indicator variable for Southern states (0 = No, 1 = Yes)
Ed: Mean # of years of schooling x 10 for persons of age 25 or older
Ex0: 1960 per capita expenditure on police by state and local government
Ex1: 1959 per capita expenditure on police by state and local government
LF: Labor force participation rate per 1000 civilian urban males age 14-24
M: The number of males per 1000 females
N: State population size in hundred thousands
NW: The number of non-whites per 1000 population
U1: Unemployment rate of urban males per 1000 of age 14-24
U2: Unemployment rate of urban males per 1000 of age 35-39
W: Median value of transferable goods and assets or family income in tens of $
X: The number of families per 1000 earning below 1/2 the median income
We really need to give these variables better names
# Assign more meaningful variable names, also
# Convert is.south to a factor
# Divide average.ed by 10 so that the variable is actually average education
# Convert median assets to 1000's of dollars instead of 10's
crime <- crime %>%
rename(crime.per.million = R,
young.males = Age,
is.south = S,
average.ed = Ed,
exp.per.cap.1960 = Ex0,
exp.per.cap.1959 = Ex1,
labour.part = LF,
male.per.fem = M,
population = N,
nonwhite = NW,
unemp.youth = U1,
unemp.adult = U2,
median.assets = W,
num.low.salary = X) %>%
mutate(is.south = as.factor(is.south),
average.ed = average.ed / 10,
median.assets = median.assets / 100)
# print summary of the data
summary(crime)## crime.per.million young.males is.south average.ed exp.per.cap.1960
## Min. : 34.20 Min. :119.0 0:31 Min. : 8.70 Min. : 45.0
## 1st Qu.: 65.85 1st Qu.:130.0 1:16 1st Qu.: 9.75 1st Qu.: 62.5
## Median : 83.10 Median :136.0 Median :10.80 Median : 78.0
## Mean : 90.51 Mean :138.6 Mean :10.56 Mean : 85.0
## 3rd Qu.:105.75 3rd Qu.:146.0 3rd Qu.:11.45 3rd Qu.:104.5
## Max. :199.30 Max. :177.0 Max. :12.20 Max. :166.0
## exp.per.cap.1959 labour.part male.per.fem population
## Min. : 41.00 Min. :480.0 Min. : 934.0 Min. : 3.00
## 1st Qu.: 58.50 1st Qu.:530.5 1st Qu.: 964.5 1st Qu.: 10.00
## Median : 73.00 Median :560.0 Median : 977.0 Median : 25.00
## Mean : 80.23 Mean :561.2 Mean : 983.0 Mean : 36.62
## 3rd Qu.: 97.00 3rd Qu.:593.0 3rd Qu.: 992.0 3rd Qu.: 41.50
## Max. :157.00 Max. :641.0 Max. :1071.0 Max. :168.00
## nonwhite unemp.youth unemp.adult median.assets
## Min. : 2.0 Min. : 70.00 Min. :20.00 Min. :2.880
## 1st Qu.: 24.0 1st Qu.: 80.50 1st Qu.:27.50 1st Qu.:4.595
## Median : 76.0 Median : 92.00 Median :34.00 Median :5.370
## Mean :101.1 Mean : 95.47 Mean :33.98 Mean :5.254
## 3rd Qu.:132.5 3rd Qu.:104.00 3rd Qu.:38.50 3rd Qu.:5.915
## Max. :423.0 Max. :142.00 Max. :58.00 Max. :6.890
## num.low.salary
## Min. :126.0
## 1st Qu.:165.5
## Median :176.0
## Mean :194.0
## 3rd Qu.:227.5
## Max. :276.0
EDA
First step: some plotting and summary statistics
You can start by feeding everything into a regression, but it’s often a better idea to construct some simple plots (e.g., scatterplots and boxplots) and summary statistics to get some sense of how the data behaves.
# Scatter plot of outcome (crime.per.million) against average.ed
qplot(average.ed, crime.per.million, data = crime)## [1] 0.3228349
This seems to suggest that higher levels of average education are associated with higher crime rates. Can you come up with an explanation for this phenomenon?
# Scatter plot of outcome (crime.per.million) against median.assets
qplot(median.assets, crime.per.million, data = crime)## [1] 0.4413199
There also appears to be a positive association between median assets and crime rates.
# Boxplots showing crime rate broken down by southern vs non-southern state
qplot(is.south, crime.per.million, geom = "boxplot", data = crime)Constructing a regression model
To construct a linear regression model in R, we use the lm() function. You can specify the regression model in various ways. The simplest is often to use the formula specification.
The first model we fit is a regression of the outcome (crimes.per.million) against all the other variables in the data set. You can either write out all the variable names. or use the shorthand y ~ . to specify that you want to include all the variables in your regression.
crime.lm <- lm(crime.per.million ~ ., data = crime)
# Summary of the linear regression model
crime.lm##
## Call:
## lm(formula = crime.per.million ~ ., data = crime)
##
## Coefficients:
## (Intercept) young.males is.south1 average.ed
## -6.918e+02 1.040e+00 -8.308e+00 1.802e+01
## exp.per.cap.1960 exp.per.cap.1959 labour.part male.per.fem
## 1.608e+00 -6.673e-01 -4.103e-02 1.648e-01
## population nonwhite unemp.youth unemp.adult
## -4.128e-02 7.175e-03 -6.017e-01 1.792e+00
## median.assets num.low.salary
## 1.374e+01 7.929e-01
##
## Call:
## lm(formula = crime.per.million ~ ., data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.884 -11.923 -1.135 13.495 50.560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.918e+02 1.559e+02 -4.438 9.56e-05 ***
## young.males 1.040e+00 4.227e-01 2.460 0.01931 *
## is.south1 -8.308e+00 1.491e+01 -0.557 0.58117
## average.ed 1.802e+01 6.497e+00 2.773 0.00906 **
## exp.per.cap.1960 1.608e+00 1.059e+00 1.519 0.13836
## exp.per.cap.1959 -6.673e-01 1.149e+00 -0.581 0.56529
## labour.part -4.103e-02 1.535e-01 -0.267 0.79087
## male.per.fem 1.648e-01 2.099e-01 0.785 0.43806
## population -4.128e-02 1.295e-01 -0.319 0.75196
## nonwhite 7.175e-03 6.387e-02 0.112 0.91124
## unemp.youth -6.017e-01 4.372e-01 -1.376 0.17798
## unemp.adult 1.792e+00 8.561e-01 2.093 0.04407 *
## median.assets 1.374e+01 1.058e+01 1.298 0.20332
## num.low.salary 7.929e-01 2.351e-01 3.373 0.00191 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.94 on 33 degrees of freedom
## Multiple R-squared: 0.7692, Adjusted R-squared: 0.6783
## F-statistic: 8.462 on 13 and 33 DF, p-value: 3.686e-07
R’s default is to output values in scientific notation. This can make it hard to interpret the numbers. Here’s some code that can be used to force full printout of numbers.
##
## Call:
## lm(formula = crime.per.million ~ ., data = crime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.884 -11.923 -1.135 13.495 50.560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -691.837588 155.887918 -4.438 0.0000956 ***
## young.males 1.039810 0.422708 2.460 0.01931 *
## is.south1 -8.308313 14.911588 -0.557 0.58117
## average.ed 18.016011 6.496504 2.773 0.00906 **
## exp.per.cap.1960 1.607818 1.058667 1.519 0.13836
## exp.per.cap.1959 -0.667258 1.148773 -0.581 0.56529
## labour.part -0.041031 0.153477 -0.267 0.79087
## male.per.fem 0.164795 0.209932 0.785 0.43806
## population -0.041277 0.129516 -0.319 0.75196
## nonwhite 0.007175 0.063867 0.112 0.91124
## unemp.youth -0.601675 0.437154 -1.376 0.17798
## unemp.adult 1.792263 0.856111 2.093 0.04407 *
## median.assets 13.735847 10.583028 1.298 0.20332
## num.low.salary 0.792933 0.235085 3.373 0.00191 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.94 on 33 degrees of freedom
## Multiple R-squared: 0.7692, Adjusted R-squared: 0.6783
## F-statistic: 8.462 on 13 and 33 DF, p-value: 0.0000003686
Looking at the p-values, it looks like num.low.salary (number of families per 1000 earning below 1/2 the median income), unemp.adult (Unemployment rate of urban males per 1000 of age 35-39), average.ed (Mean # of years of schooling 25 or older), and young.males (number of males of age 14-24 per 1000 population) are all statistically significant predictors of crime rate.
The coefficients for these predictors are all positive, so crime rates are positively associated with wealth inequality, adult unemployment rates, average education levels, and high rates of young males in the population.
Exploring the lm object
What kind of output do we get when we run a linear model (lm) in R?
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
##
## $class
## [1] "lm"
## (Intercept) young.males is.south1 average.ed
## -691.837587905 1.039809653 -8.308312889 18.016010601
## exp.per.cap.1960 exp.per.cap.1959 labour.part male.per.fem
## 1.607818377 -0.667258285 -0.041031047 0.164794968
## population nonwhite unemp.youth unemp.adult
## -0.041276887 0.007174688 -0.601675298 1.792262901
## median.assets num.low.salary
## 13.735847285 0.792932786
None of the attributes seem to give you p-values. Here’s what you can do to get a table that allows you to extract p-values.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -691.838 155.888 -4.438 0.000
## young.males 1.040 0.423 2.460 0.019
## is.south1 -8.308 14.912 -0.557 0.581
## average.ed 18.016 6.497 2.773 0.009
## exp.per.cap.1960 1.608 1.059 1.519 0.138
## exp.per.cap.1959 -0.667 1.149 -0.581 0.565
## labour.part -0.041 0.153 -0.267 0.791
## male.per.fem 0.165 0.210 0.785 0.438
## population -0.041 0.130 -0.319 0.752
## nonwhite 0.007 0.064 0.112 0.911
## unemp.youth -0.602 0.437 -1.376 0.178
## unemp.adult 1.792 0.856 2.093 0.044
## median.assets 13.736 10.583 1.298 0.203
## num.low.salary 0.793 0.235 3.373 0.002
If you want a particular p-value, you can get it by doing the following
# Pull the coefficients table from summary(lm)
crime.lm.coef <- round(summary(crime.lm)$coef, 3)
# See what this gives
class(crime.lm.coef)## [1] "matrix" "array"
## $dim
## [1] 14 4
##
## $dimnames
## $dimnames[[1]]
## [1] "(Intercept)" "young.males" "is.south1" "average.ed"
## [5] "exp.per.cap.1960" "exp.per.cap.1959" "labour.part" "male.per.fem"
## [9] "population" "nonwhite" "unemp.youth" "unemp.adult"
## [13] "median.assets" "num.low.salary"
##
## $dimnames[[2]]
## [1] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
## [1] 0.009
The coefficients table is a matrix with named rows and columns. You can therefore access particular cells either by numeric index, or by name (as in the example above).
Plotting the lm object
These four plots are important diagnostic tools in assessing whether the linear model is appropriate. The first two plots are the most important, but the last two can also help with identifying outliers and non-linearities.
Residuals vs. Fitted When a linear model is appropriate, we expect
the residuals will have constant variance when plotted against fitted values; and
the residuals and fitted values will be uncorrelated.
If there are clear trends in the residual plot, or the plot looks like a funnel, these are clear indicators that the given linear model is inappropriate.
Normal QQ plot You can use a linear model for prediction even if the underlying normality assumptions don’t hold. However, in order for the p-values to be believable, the residuals from the regression must look approximately normally distributed.
Scale-location plot This is another version of the residuals vs fitted plot. There should be no discernible trends in this plot.
Residuals vs Leverage. Leverage is a measure of how much an observation influenced the model fit. It’s a one-number summary of how different the model fit would be if the given observation was excluded, compared to the model fit where the observation is included. Points with high residual (poorly described by the model) and high leverage (high influence on model fit) are outliers. They’re skewing the model fit away from the rest of the data, and don’t really seem to fit with the rest of the data.
A detour: Diagnostic plots for diamonds data.
The residual vs fitted and scale-location diagnostic plots for the crime data aren’t especially insightful, largely due to the very small sample size. Below we look at the
diamondsdata to see what a more typical anaylsis of linear model diagnostic plots might reveal.
Residuals vs. Fitted
There is a clear indication of non-linearity present in this plot. Furthermore, we see that the variance appears to be increasing in fitted value.
Normal QQ plot The residuals appear highly non-normal. Both the lower tail and upper tail are heavier than we would expect under normality. This may be due to the non-constant variance issue we observed in the Residuals vs. Fitted plot.
Scale-location plot We see a clear increasing trend in residual variance that runs through most of the plot. This is indicated by the upward slope of the red line, which we can interpret as the standard deviation of the residuals at the given level of fitted value.
Residuals vs Leverage. None of the points appear to be outliers.
Here’s what happens if we log-transform both the price and carat variables.
diamonds.lm2 <- lm(log(price) ~ I(log(carat)) + cut + clarity + color, data = diamonds)
plot(diamonds.lm2)While there remains a very slight indication of non-linearity in the Residual vs Fitted plot, the non-constant variance issue appears to have been addressed by the variable transformations. The Normal QQ plot indicates that the residuals have a heavier tailed distribution, but since we have a very large sample size this should not cause problems for inference. There do not appear to be any clear outliers in the data.
Collinearity and pairs plots
In your regression class you probably learned that collinearity can throw off the coefficient estimates. To diagnose collinearity, we can do a plot matrix. In base graphics, this can be accomplished via the pairs function.
As a demo, let’s look at some of the economic indicators in our data set.
economic.var.names <- c("exp.per.cap.1959", "exp.per.cap.1960", "unemp.adult", "unemp.youth", "labour.part", "median.assets")
pairs(crime[,economic.var.names])## exp.per.cap.1959 exp.per.cap.1960 unemp.adult unemp.youth
## exp.per.cap.1959 1.000 0.994 0.169 -0.052
## exp.per.cap.1960 0.994 1.000 0.185 -0.044
## unemp.adult 0.169 0.185 1.000 0.746
## unemp.youth -0.052 -0.044 0.746 1.000
## labour.part 0.106 0.121 -0.421 -0.229
## median.assets 0.794 0.787 0.092 0.045
## labour.part median.assets
## exp.per.cap.1959 0.106 0.794
## exp.per.cap.1960 0.121 0.787
## unemp.adult -0.421 0.092
## unemp.youth -0.229 0.045
## labour.part 1.000 0.295
## median.assets 0.295 1.000
Since the above-diagonal and below-diagonal plots contain essentially the same information, it’s often more useful to display some other values in one of the spaces. In the example below, we use the panel.cor function from the pairs() documentation to add text below the diagonal.
# Function taken from ?pairs Example section.
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = pmax(1, cex.cor * r))
}
# Use panel.cor to display correlations in lower panel.
pairs(crime[,economic.var.names], lower.panel = panel.cor)# ggpairs from GGally library
# Unlike pairs(), ggpairs() works with non-numeric
# predictors in addition to numeric ones.
# Consider ggpairs() for your final project
ggpairs(crime[,c(economic.var.names, "is.south")], axisLabels = "internal")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.