Linear regression in R & Python
Estimating the coefficients in R;
Model analytics and diagnostics
Model analytics and diagnostics
Background
What is the purpose of these notes?
- Provide an example of simple and multiple linear regression;
- Learn the linear regression function in
R
’ - Learn the linear regression function in
Python
.
What is the format of this document?
This document was created using R Markdown
. You can read more about it here and check out a cheat sheet here, which will guide you through installing RStudio, and from there the moment you create a new .Rmd
document, it will be a working template to start from. If you are used to using LaTeX
, no worries: it can be embedded into Markdown
, with overall simpler formatting. I hope you find this useful!
Installing and loading packages
Just like every other programming language you may be familiar with, R’s capabilities can be greatly extended by installing additional “packages” and “libraries”.
To install a package, use the install.packages()
command. You’ll want to run the following commands to get the necessary packages for today’s lab:
install.packages("rmdformats")
install.packages("ggplot2")
install.packages("knitr")
install.packages("ISLR")
You only need to install packages once. Once they’re installed, you may use them by loading the libraries using the library()
command. For today’s lab, you’ll want to run the following code
Context
Recall some important questions about linear regression model
- Is at least one of the predictors \(X_1, \dots, X_p\) useful in predicting the response?
- Do all the predictors help to explain Y, or is only a subset of the predictors useful?
- How well does the model fit the data?
- Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
Review:
These concepts will be important to keep in mind: * what is the population regression line? * what is the estimated regression line? * what is the meaning of the estimated intercept when you estimate the regression line? * what is the meaning of the estimated coefficients?
Simple linear regression example in R
Throughout this handout, we will be working on the Auto
data set that comes with the ISLR
package in R
.
lm
and output
Here is how to do simple linear regression in R
.
- Use the
lm()
function to perform a simple linear regression withmpg
as the response andhorsepower
as the predictor using the following code .
Wait a minute. Where is the output??
As usual, it’s stored in the object to the left of
<-
and so it is not printed.
Let’s take a look at the fit.lm
object. Use the summary()
function to print the results.
Call:
lm(formula = mpg ~ horsepower, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-13.5710 -3.2592 -0.3435 2.7630 16.9240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.935861 0.717499 55.66 <2e-16 ***
horsepower -0.157845 0.006446 -24.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Interpreting the output
- Is there a relationship between the predictor and the response?
Yes.
- How strong is the relationship between the predictor and the response?
\(p\)-value is close to 0: relationship is strong
- Is the relationship between the predictor and the response positive or negative?
Coefficient is negative: relationship is negative
- What is the predicted
mpg
associated with ahorsepower
of \(98\)? What are the associated 95% confidence and prediction intervals?
First you have to make a new data frame object which will contain the new point:
1
24.46708
fit lwr upr
1 24.46708 23.97308 24.96108
fit lwr upr
1 24.46708 14.8094 34.12476
- \(\rightarrow\) confidence interval vs. prediction interval \(\leftarrow\)
What is the difference between confidence and prediction intervals!? \(\rightarrow\) we will learn in the next lecture!!
Simple plots
Plot the response and the predictor. Use the abline()
function to display the least squares regression line.
Use the plot()
function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
- residuals vs fitted plot shows that the relationship is non-linear
Simple linear regression example in Python
Please note: In this section, all code chunks are
python
code chunks. This means we have written{python}
rather than{r}
at the top.
path_to_file="Desktop/auto.csv"
df= pd.read_csv(path_to_file) # Remember to change the path of the file when you are running the code and use thep ath to file you downloaded to your computer
Now that we have imported the data let’s take a peek:
mpg cylinders displacement ... year origin name
0 18.0 8 307.0 ... 70 1 chevrolet chevelle malibu
1 15.0 8 350.0 ... 70 1 buick skylark 320
2 18.0 8 318.0 ... 70 1 plymouth satellite
3 16.0 8 304.0 ... 70 1 amc rebel sst
4 17.0 8 302.0 ... 70 1 ford torino
[5 rows x 9 columns]
sm.OLS
and output
The target variable that we are interested to study is “mpg” and we want to see the relationship of it with other variables.
First, a simple linear regression on “horsepower” and “mpg” is done like this:
For this purpose we are using statsmodels.api
. One thing to note before making the model is to keep this in mind that for containing an intercept in our model we should add a new "constant variable"
to our predictor(s) because statsmodels.api
object does perform a regression without an intercept!
#Adding constant because we want the intercept
X = sm.add_constant(X)
#with the following command you can fit the model, remember to put the y(target) first!
model =sm.OLS(y,X).fit()
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.606
Model: OLS Adj. R-squared: 0.605
Method: Least Squares F-statistic: 599.7
Date: Mon, 12 Apr 2021 Prob (F-statistic): 7.03e-81
Time: 14:51:09 Log-Likelihood: -1178.7
No. Observations: 392 AIC: 2361.
Df Residuals: 390 BIC: 2369.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 39.9359 0.717 55.660 0.000 38.525 41.347
horsepower -0.1578 0.006 -24.489 0.000 -0.171 -0.145
==============================================================================
Omnibus: 16.432 Durbin-Watson: 0.920
Prob(Omnibus): 0.000 Jarque-Bera (JB): 17.305
Skew: 0.492 Prob(JB): 0.000175
Kurtosis: 3.299 Cond. No. 322.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Simple plots
Now lets plot the fitted model:
Prediction
If you want to predict a new value such as 98
, you have to give it as an input like [1,98]
for predicting because of the intercept, if you want to predict many values like a set called New_values, all you need to put in the argument is : exog=[1,New_vallues]
.
array([24.46707715])
Quick reference to multiple linear regression
Q: is there a relationship between the Response and Predictor?
Multiple case: \(p\) predictors; we need to ask whether all of the regression coefficients are zero: \(\beta_1=\dots=\beta_p=0\)?
- Hypothesis test: \(H_0:\beta_1=\dots=\beta_p=0\) vs. \(H_1:\) at least one \(\beta_i\neq 0\).
- Which statistic? \[F= \frac{(TSS-RSS)/p}{RSS/(n-p-1)}.\]
- TSS and RSS defined as in simple case.
when there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. [this can be proved via expected values]
else \(>1\).
\(\rightarrow\) in case of large \(p\), may want to measure partial effects, and do some variable selection
Multiple linear regression example in R
- Produce a scatterplot matrix which includes all of the variables in the data set.
What are we looking at? (Read the help page on pairs
.)
- Compute the matrix of correlations between the variables using the function
cor()
. You will need to exclude thename
variable, which is qualitative.
mpg cylinders displacement horsepower weight
mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
acceleration year origin
mpg 0.4233285 0.5805410 0.5652088
cylinders -0.5046834 -0.3456474 -0.5689316
displacement -0.5438005 -0.3698552 -0.6145351
horsepower -0.6891955 -0.4163615 -0.4551715
weight -0.4168392 -0.3091199 -0.5850054
acceleration 1.0000000 0.2903161 0.2127458
year 0.2903161 1.0000000 0.1815277
origin 0.2127458 0.1815277 1.0000000
- Use the
lm()
function to perform a multiple linear regression withmpg
as the response and all other variables except name as the predictors. Print results.
Call:
lm(formula = mpg ~ . - name, data = Auto)
Residuals:
Min 1Q Median 3Q Max
-9.5903 -2.1565 -0.1169 1.8690 13.0604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.218435 4.644294 -3.707 0.00024 ***
cylinders -0.493376 0.323282 -1.526 0.12780
displacement 0.019896 0.007515 2.647 0.00844 **
horsepower -0.016951 0.013787 -1.230 0.21963
weight -0.006474 0.000652 -9.929 < 2e-16 ***
acceleration 0.080576 0.098845 0.815 0.41548
year 0.750773 0.050973 14.729 < 2e-16 ***
origin 1.426141 0.278136 5.127 0.000000467 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Let’s refer to the output in this last code chunk:
Is there a relationship between the predictors and the response?
- There is a relationship between predictors and response.
Which predictors appear to have a statistically significant relationship to the response?
weight
,year
,origin
anddisplacement
have statistically significant relationships
What does the coefficient for the
year
variable suggest?- \(0.75\) coefficient for
year
suggests that later model year cars have better (higher)mpg
.
- \(0.75\) coefficient for
Use the `plot() function to produce diagnostic plots of the linear regression fit.
* Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? * There is evidence of non-linearity
Multiple linear regression example in Python
#we are going to use all the features except 'name' and of course dropping the 'mpg' too
X = df.drop(labels=['name','mpg'],axis=1)
y=df['mpg']
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.821
Model: OLS Adj. R-squared: 0.818
Method: Least Squares F-statistic: 252.4
Date: Mon, 12 Apr 2021 Prob (F-statistic): 2.04e-139
Time: 14:51:12 Log-Likelihood: -1023.5
No. Observations: 392 AIC: 2063.
Df Residuals: 384 BIC: 2095.
Df Model: 7
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const -17.2184 4.644 -3.707 0.000 -26.350 -8.087
cylinders -0.4934 0.323 -1.526 0.128 -1.129 0.142
displacement 0.0199 0.008 2.647 0.008 0.005 0.035
horsepower -0.0170 0.014 -1.230 0.220 -0.044 0.010
weight -0.0065 0.001 -9.929 0.000 -0.008 -0.005
acceleration 0.0806 0.099 0.815 0.415 -0.114 0.275
year 0.7508 0.051 14.729 0.000 0.651 0.851
origin 1.4261 0.278 5.127 0.000 0.879 1.973
==============================================================================
Omnibus: 31.906 Durbin-Watson: 1.309
Prob(Omnibus): 0.000 Jarque-Bera (JB): 53.100
Skew: 0.529 Prob(JB): 2.95e-12
Kurtosis: 4.460 Cond. No. 8.59e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
Appendix
Some resources and further reading:
Read about model diagnostics in ISLR book.
Read about regression (more in depth) in ISLR book.
To view
lm {stats} R Documentation
, type:
help(lm)
and/or visit:
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm
License
This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Contents of this lecture is based on the chapter 3 of the textbook Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, ’ An Introduction to Statistical Learning: with Applications in R’.
Python code for this lecture is developed by Amirreza Eshraghi, our TA in 2020/21.