Linear regression in R & Python

Estimating the coefficients in R;
Model analytics and diagnostics

Background

What is the purpose of these notes?

Provide an example of simple and multiple linear regression;
Learn the linear regression function in R’
Learn the linear regression function in Python.

What is the format of this document?

This document was created using R Markdown. You can read more about it here and check out a cheat sheet here, which will guide you through installing RStudio, and from there the moment you create a new .Rmd document, it will be a working template to start from. If you are used to using LaTeX, no worries: it can be embedded into Markdown, with overall simpler formatting. I hope you find this useful!

Installing and loading packages

Just like every other programming language you may be familiar with, R’s capabilities can be greatly extended by installing additional “packages” and “libraries”.

To install a package, use the install.packages() command. You’ll want to run the following commands to get the necessary packages for today’s lab:

install.packages("rmdformats")
install.packages("ggplot2")
install.packages("knitr")
install.packages("ISLR")

You only need to install packages once. Once they’re installed, you may use them by loading the libraries using the library() command. For today’s lab, you’ll want to run the following code

library(ggplot2) # graphics library
library(knitr)   # contains kable() function
options(scipen = 4)  # Suppresses scientific notation

library("ISLR") # to obtain the data sets used in the book 
library("reticulate") # so we can use Python

# this is another r-code chunk to install the packages i need for Python: 
py_install("pandas")
py_install("statsmodels")
py_install("matplotlib")

Context

Recall some important questions about linear regression model

Is at least one of the predictors \(X_1, \dots, X_p\) useful in predicting the response?
Do all the predictors help to explain Y, or is only a subset of the predictors useful?
How well does the model fit the data?
Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

Review:

These concepts will be important to keep in mind: * what is the population regression line? * what is the estimated regression line? * what is the meaning of the estimated intercept when you estimate the regression line? * what is the meaning of the estimated coefficients?

Simple linear regression example in `R`

Throughout this handout, we will be working on the Auto data set that comes with the ISLR package in R.

require(ISLR)
data(Auto)

`lm` and output

Here is how to do simple linear regression in R.

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor using the following code .

fit.lm <- lm(mpg ~ horsepower, data=Auto)

Wait a minute. Where is the output??

As usual, it’s stored in the object to the left of <- and so it is not printed.

Let’s take a look at the fit.lm object. Use the summary() function to print the results.

summary(fit.lm)


Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Interpreting the output

Is there a relationship between the predictor and the response?

Yes.

How strong is the relationship between the predictor and the response?

\(p\)-value is close to 0: relationship is strong

Is the relationship between the predictor and the response positive or negative?

Coefficient is negative: relationship is negative

What is the predicted mpg associated with a horsepower of \(98\)? What are the associated 95% confidence and prediction intervals?

First you have to make a new data frame object which will contain the new point:

new <- data.frame(horsepower = 98)
predict(fit.lm, new) # predicted mpg

       1 
24.46708

predict(fit.lm, new, interval="confidence") # conf interval

       fit      lwr      upr
1 24.46708 23.97308 24.96108

predict(fit.lm, new, interval="prediction") # pred interval

       fit     lwr      upr
1 24.46708 14.8094 34.12476

\(\rightarrow\) confidence interval vs. prediction interval \(\leftarrow\)

What is the difference between confidence and prediction intervals!? \(\rightarrow\) we will learn in the next lecture!!

Simple plots

Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(Auto$horsepower, Auto$mpg)
abline(fit.lm, col="red")

Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

par(mfrow=c(2,2))
plot(fit.lm)

residuals vs fitted plot shows that the relationship is non-linear

Simple linear regression example in `Python`

Please note: In this section, all code chunks are python code chunks. This means we have written {python} rather than {r} at the top.

import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

path_to_file="Desktop/auto.csv"
df= pd.read_csv(path_to_file)  # Remember to change the path of the file when you are running the code and use thep ath to file you downloaded to your computer

Now that we have imported the data let’s take a peek:

df.head()

    mpg  cylinders  displacement  ...  year  origin                       name
0  18.0          8         307.0  ...    70       1  chevrolet chevelle malibu
1  15.0          8         350.0  ...    70       1          buick skylark 320
2  18.0          8         318.0  ...    70       1         plymouth satellite
3  16.0          8         304.0  ...    70       1              amc rebel sst
4  17.0          8         302.0  ...    70       1                ford torino

[5 rows x 9 columns]

`sm.OLS` and output

The target variable that we are interested to study is “mpg” and we want to see the relationship of it with other variables.

First, a simple linear regression on “horsepower” and “mpg” is done like this:

X,y= df["horsepower"] , df["mpg"]

For this purpose we are using statsmodels.api. One thing to note before making the model is to keep this in mind that for containing an intercept in our model we should add a new "constant variable" to our predictor(s) because statsmodels.api object does perform a regression without an intercept!

#Adding constant because we want the intercept
X = sm.add_constant(X)

#with the following command you can fit the model, remember to put the y(target) first!
model =sm.OLS(y,X).fit()

#The following command will provide a good summary of the fitted model:
model.summary()

<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.606
Model:                            OLS   Adj. R-squared:                  0.605
Method:                 Least Squares   F-statistic:                     599.7
Date:                Mon, 12 Apr 2021   Prob (F-statistic):           7.03e-81
Time:                        14:51:09   Log-Likelihood:                -1178.7
No. Observations:                 392   AIC:                             2361.
Df Residuals:                     390   BIC:                             2369.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         39.9359      0.717     55.660      0.000      38.525      41.347
horsepower    -0.1578      0.006    -24.489      0.000      -0.171      -0.145
==============================================================================
Omnibus:                       16.432   Durbin-Watson:                   0.920
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.305
Skew:                           0.492   Prob(JB):                     0.000175
Kurtosis:                       3.299   Cond. No.                         322.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""

Simple plots

Now lets plot the fitted model:

fix, ax = plt.subplots()

ax.scatter(df["horsepower"], model.predict(), label='fitted')
ax.scatter(df["horsepower"], df["mpg"], label='original')

ax.legend()
ax.set_title("Linear model fitted values vs the original dataset")
ax.set_xlabel("horsepower")
ax.set_ylabel("mpg")
plt.show()

Prediction

If you want to predict a new value such as 98, you have to give it as an input like [1,98] for predicting because of the intercept, if you want to predict many values like a set called New_values, all you need to put in the argument is : exog=[1,New_vallues].

model.predict(exog=[1,98])

array([24.46707715])

Quick reference to multiple linear regression

Q: is there a relationship between the Response and Predictor?

Multiple case: \(p\) predictors; we need to ask whether all of the regression coefficients are zero: \(\beta_1=\dots=\beta_p=0\)?
- Hypothesis test: \(H_0:\beta_1=\dots=\beta_p=0\) vs. \(H_1:\) at least one \(\beta_i\neq 0\).
- Which statistic? \[F= \frac{(TSS-RSS)/p}{RSS/(n-p-1)}.\]
  - TSS and RSS defined as in simple case.
when there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. [this can be proved via expected values]
else \(>1\).
\(\rightarrow\) in case of large \(p\), may want to measure partial effects, and do some variable selection

Multiple linear regression example in `R`

Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto,lower.panel = NULL)

What are we looking at? (Read the help page on pairs.)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(subset(Auto, select=-name))

                    mpg  cylinders displacement horsepower     weight
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
             acceleration       year     origin
mpg             0.4233285  0.5805410  0.5652088
cylinders      -0.5046834 -0.3456474 -0.5689316
displacement   -0.5438005 -0.3698552 -0.6145351
horsepower     -0.6891955 -0.4163615 -0.4551715
weight         -0.4168392 -0.3091199 -0.5850054
acceleration    1.0000000  0.2903161  0.2127458
year            0.2903161  1.0000000  0.1815277
origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Print results.

fit.lm <- lm(mpg~.-name, data=Auto)
summary(fit.lm)


Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value    Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707     0.00024 ***
cylinders     -0.493376   0.323282  -1.526     0.12780    
displacement   0.019896   0.007515   2.647     0.00844 ** 
horsepower    -0.016951   0.013787  -1.230     0.21963    
weight        -0.006474   0.000652  -9.929     < 2e-16 ***
acceleration   0.080576   0.098845   0.815     0.41548    
year           0.750773   0.050973  14.729     < 2e-16 ***
origin         1.426141   0.278136   5.127 0.000000467 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Let’s refer to the output in this last code chunk:

Is there a relationship between the predictors and the response?
- There is a relationship between predictors and response.
Which predictors appear to have a statistically significant relationship to the response?
- weight, year, origin and displacement have statistically significant relationships
What does the coefficient for the year variable suggest?
- \(0.75\) coefficient for year suggests that later model year cars have better (higher) mpg.
Use the `plot() function to produce diagnostic plots of the linear regression fit.

par(mfrow=c(2,2))
plot(fit.lm)

* Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? * There is evidence of non-linearity

Multiple linear regression example in `Python`

#we are going to use all the features except 'name' and of course dropping the 'mpg' too
X = df.drop(labels=['name','mpg'],axis=1)  
y=df['mpg']

X = sm.add_constant(X)  #adding constant for intercept
model =sm.OLS(y,X).fit()

model.summary()

<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.821
Model:                            OLS   Adj. R-squared:                  0.818
Method:                 Least Squares   F-statistic:                     252.4
Date:                Mon, 12 Apr 2021   Prob (F-statistic):          2.04e-139
Time:                        14:51:12   Log-Likelihood:                -1023.5
No. Observations:                 392   AIC:                             2063.
Df Residuals:                     384   BIC:                             2095.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const          -17.2184      4.644     -3.707      0.000     -26.350      -8.087
cylinders       -0.4934      0.323     -1.526      0.128      -1.129       0.142
displacement     0.0199      0.008      2.647      0.008       0.005       0.035
horsepower      -0.0170      0.014     -1.230      0.220      -0.044       0.010
weight          -0.0065      0.001     -9.929      0.000      -0.008      -0.005
acceleration     0.0806      0.099      0.815      0.415      -0.114       0.275
year             0.7508      0.051     14.729      0.000       0.651       0.851
origin           1.4261      0.278      5.127      0.000       0.879       1.973
==============================================================================
Omnibus:                       31.906   Durbin-Watson:                   1.309
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.100
Skew:                           0.529   Prob(JB):                     2.95e-12
Kurtosis:                       4.460   Cond. No.                     8.59e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

Appendix

Some resources and further reading:

Read about model diagnostics in ISLR book.
Read about regression (more in depth) in ISLR book.
To view lm {stats} R Documentation, type:

help(lm)

and/or visit:
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm

License

This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contents of this lecture is based on the chapter 3 of the textbook Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, ’ An Introduction to Statistical Learning: with Applications in R’.

Python code for this lecture is developed by Amirreza Eshraghi, our TA in 2020/21.

Sonja Petrović, Associate Professor of Applied Mathematics, College of Computing, Illinios Tech. Homepage, Email.↩︎

Linear regression in R & Python

Estimating the coefficients in R; Model analytics and diagnostics