Linear regression in R & Python
Estimating the coefficients in R;
Model analytics and diagnostics
   
         
   
   
   
      
   
 
Model analytics and diagnostics
Background
What is the purpose of these notes?
- Provide an example of simple and multiple linear regression;
 - Learn the linear regression function in 
R’ - Learn the linear regression function in 
Python. 
What is the format of this document?
This document was created using R Markdown. You can read more about it here and check out a cheat sheet here, which will guide you through installing RStudio, and from there the moment you create a new .Rmd document, it will be a working template to start from. If you are used to using LaTeX, no worries: it can be embedded into Markdown, with overall simpler formatting. I hope you find this useful!
Installing and loading packages
Just like every other programming language you may be familiar with, R’s capabilities can be greatly extended by installing additional “packages” and “libraries”.
To install a package, use the install.packages() command. You’ll want to run the following commands to get the necessary packages for today’s lab:
install.packages("rmdformats")
install.packages("ggplot2")
install.packages("knitr")
install.packages("ISLR")
You only need to install packages once. Once they’re installed, you may use them by loading the libraries using the library() command. For today’s lab, you’ll want to run the following code
Context
Recall some important questions about linear regression model
- Is at least one of the predictors \(X_1, \dots, X_p\) useful in predicting the response?
 - Do all the predictors help to explain Y, or is only a subset of the predictors useful?
 - How well does the model fit the data?
 - Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
 
Review:
These concepts will be important to keep in mind: * what is the population regression line? * what is the estimated regression line? * what is the meaning of the estimated intercept when you estimate the regression line? * what is the meaning of the estimated coefficients?
Simple linear regression example in R
Throughout this handout, we will be working on the Auto data set that comes with the ISLR package in R.
lm and output
Here is how to do simple linear regression in R.
- Use the 
lm()function to perform a simple linear regression withmpgas the response andhorsepoweras the predictor using the following code . 
Wait a minute. Where is the output??
As usual, it’s stored in the object to the left of
<-and so it is not printed.
Let’s take a look at the fit.lm object. Use the summary() function to print the results.
Call:
lm(formula = mpg ~ horsepower, data = Auto)
Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
Interpreting the output
- Is there a relationship between the predictor and the response?
 
Yes.
- How strong is the relationship between the predictor and the response?
 
\(p\)-value is close to 0: relationship is strong
- Is the relationship between the predictor and the response positive or negative?
 
Coefficient is negative: relationship is negative
- What is the predicted 
mpgassociated with ahorsepowerof \(98\)? What are the associated 95% confidence and prediction intervals? 
First you have to make a new data frame object which will contain the new point:
       1 
24.46708 
       fit      lwr      upr
1 24.46708 23.97308 24.96108
       fit     lwr      upr
1 24.46708 14.8094 34.12476
- \(\rightarrow\) confidence interval vs. prediction interval \(\leftarrow\)
 
What is the difference between confidence and prediction intervals!? \(\rightarrow\) we will learn in the next lecture!!
Simple plots
Plot the response and the predictor. Use the abline() function to display the least squares regression line.
Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
- residuals vs fitted plot shows that the relationship is non-linear
 
Simple linear regression example in Python
Please note: In this section, all code chunks are
pythoncode chunks. This means we have written{python}rather than{r}at the top.
path_to_file="Desktop/auto.csv"
df= pd.read_csv(path_to_file)  # Remember to change the path of the file when you are running the code and use thep ath to file you downloaded to your computer 
Now that we have imported the data let’s take a peek:
    mpg  cylinders  displacement  ...  year  origin                       name
0  18.0          8         307.0  ...    70       1  chevrolet chevelle malibu
1  15.0          8         350.0  ...    70       1          buick skylark 320
2  18.0          8         318.0  ...    70       1         plymouth satellite
3  16.0          8         304.0  ...    70       1              amc rebel sst
4  17.0          8         302.0  ...    70       1                ford torino
[5 rows x 9 columns]
sm.OLS and output
The target variable that we are interested to study is “mpg” and we want to see the relationship of it with other variables.
First, a simple linear regression on “horsepower” and “mpg” is done like this:
For this purpose we are using statsmodels.api. One thing to note before making the model is to keep this in mind that for containing an intercept in our model we should add a new "constant variable" to our predictor(s) because statsmodels.api object does perform a regression without an intercept!
#Adding constant because we want the intercept
X = sm.add_constant(X)
#with the following command you can fit the model, remember to put the y(target) first!
model =sm.OLS(y,X).fit()<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.606
Model:                            OLS   Adj. R-squared:                  0.605
Method:                 Least Squares   F-statistic:                     599.7
Date:                Mon, 12 Apr 2021   Prob (F-statistic):           7.03e-81
Time:                        14:51:09   Log-Likelihood:                -1178.7
No. Observations:                 392   AIC:                             2361.
Df Residuals:                     390   BIC:                             2369.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         39.9359      0.717     55.660      0.000      38.525      41.347
horsepower    -0.1578      0.006    -24.489      0.000      -0.171      -0.145
==============================================================================
Omnibus:                       16.432   Durbin-Watson:                   0.920
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               17.305
Skew:                           0.492   Prob(JB):                     0.000175
Kurtosis:                       3.299   Cond. No.                         322.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
"""
Simple plots
Now lets plot the fitted model:
fix, ax = plt.subplots()
ax.scatter(df["horsepower"], model.predict(), label='fitted')
ax.scatter(df["horsepower"], df["mpg"], label='original')
ax.legend()
ax.set_title("Linear model fitted values vs the original dataset")
ax.set_xlabel("horsepower")
ax.set_ylabel("mpg")
plt.show()Prediction
If you want to predict a new value such as 98, you have to give it as an input like [1,98] for predicting because of the intercept, if you want to predict many values like a set called New_values, all you need to put in the argument is : exog=[1,New_vallues].
array([24.46707715])
Quick reference to multiple linear regression
Q: is there a relationship between the Response and Predictor?
Multiple case: \(p\) predictors; we need to ask whether all of the regression coefficients are zero: \(\beta_1=\dots=\beta_p=0\)?
- Hypothesis test: \(H_0:\beta_1=\dots=\beta_p=0\) vs. \(H_1:\) at least one \(\beta_i\neq 0\).
 - Which statistic? \[F= \frac{(TSS-RSS)/p}{RSS/(n-p-1)}.\]
- TSS and RSS defined as in simple case.
 
 
when there is no relationship between the response and predictors, one would expect the F-statistic to take on a value close to 1. [this can be proved via expected values]
else \(>1\).
\(\rightarrow\) in case of large \(p\), may want to measure partial effects, and do some variable selection
Multiple linear regression example in R
- Produce a scatterplot matrix which includes all of the variables in the data set.
 
 What are we looking at? (Read the help page on 
pairs.)
- Compute the matrix of correlations between the variables using the function 
cor(). You will need to exclude thenamevariable, which is qualitative. 
                    mpg  cylinders displacement horsepower     weight
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
             acceleration       year     origin
mpg             0.4233285  0.5805410  0.5652088
cylinders      -0.5046834 -0.3456474 -0.5689316
displacement   -0.5438005 -0.3698552 -0.6145351
horsepower     -0.6891955 -0.4163615 -0.4551715
weight         -0.4168392 -0.3091199 -0.5850054
acceleration    1.0000000  0.2903161  0.2127458
year            0.2903161  1.0000000  0.1815277
origin          0.2127458  0.1815277  1.0000000
- Use the 
lm()function to perform a multiple linear regression withmpgas the response and all other variables except name as the predictors. Print results. 
Call:
lm(formula = mpg ~ . - name, data = Auto)
Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 
Coefficients:
               Estimate Std. Error t value    Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707     0.00024 ***
cylinders     -0.493376   0.323282  -1.526     0.12780    
displacement   0.019896   0.007515   2.647     0.00844 ** 
horsepower    -0.016951   0.013787  -1.230     0.21963    
weight        -0.006474   0.000652  -9.929     < 2e-16 ***
acceleration   0.080576   0.098845   0.815     0.41548    
year           0.750773   0.050973  14.729     < 2e-16 ***
origin         1.426141   0.278136   5.127 0.000000467 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
Let’s refer to the output in this last code chunk:
Is there a relationship between the predictors and the response?
- There is a relationship between predictors and response.
 
Which predictors appear to have a statistically significant relationship to the response?
weight,year,originanddisplacementhave statistically significant relationships
What does the coefficient for the
yearvariable suggest?- \(0.75\) coefficient for 
yearsuggests that later model year cars have better (higher)mpg. 
- \(0.75\) coefficient for 
 Use the `plot() function to produce diagnostic plots of the linear regression fit.
 * Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? * There is evidence of non-linearity
Multiple linear regression example in Python
#we are going to use all the features except 'name' and of course dropping the 'mpg' too
X = df.drop(labels=['name','mpg'],axis=1)  
y=df['mpg']<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.821
Model:                            OLS   Adj. R-squared:                  0.818
Method:                 Least Squares   F-statistic:                     252.4
Date:                Mon, 12 Apr 2021   Prob (F-statistic):          2.04e-139
Time:                        14:51:12   Log-Likelihood:                -1023.5
No. Observations:                 392   AIC:                             2063.
Df Residuals:                     384   BIC:                             2095.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const          -17.2184      4.644     -3.707      0.000     -26.350      -8.087
cylinders       -0.4934      0.323     -1.526      0.128      -1.129       0.142
displacement     0.0199      0.008      2.647      0.008       0.005       0.035
horsepower      -0.0170      0.014     -1.230      0.220      -0.044       0.010
weight          -0.0065      0.001     -9.929      0.000      -0.008      -0.005
acceleration     0.0806      0.099      0.815      0.415      -0.114       0.275
year             0.7508      0.051     14.729      0.000       0.651       0.851
origin           1.4261      0.278      5.127      0.000       0.879       1.973
==============================================================================
Omnibus:                       31.906   Durbin-Watson:                   1.309
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.100
Skew:                           0.529   Prob(JB):                     2.95e-12
Kurtosis:                       4.460   Cond. No.                     8.59e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.59e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
Appendix
Some resources and further reading:
Read about model diagnostics in ISLR book.
Read about regression (more in depth) in ISLR book.
To view
lm {stats} R Documentation, type:
help(lm)
and/or visit:
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm
License
This document is created for Math 514, Spring 2021, at Illinois Tech. While the course materials are generally not to be distributed outside the course without permission of the instructor, this particular set of notes is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Contents of this lecture is based on the chapter 3 of the textbook Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, ’ An Introduction to Statistical Learning: with Applications in R’.
Python code for this lecture is developed by Amirreza Eshraghi, our TA in 2020/21.