Explaining the lm() Summary in R

Summary:

Residual Standard Error: Essentially standard deviation of residuals / errors of your regression model.
Multiple R-Squared: Percent of the variance of Y intact after subtracting the error of the model.
Adjusted R-Squared: Same as multiple R-Squared but takes into account the number of samples and variables you’re using.
F-Statistic: Global test to check if your model has at least one significant variable. Takes into account number of variables and observations used.

R’s lm() function is fast, easy, and succinct. However, when you’re getting started, that brevity can be a bit of a curse. I’m going to explain some of the key components to the summary() function in R for linear regression models. In addition, I’ll also show you how to calculate these figures for yourself so you have a better intuition of what they mean.

Getting Started: Build a Model

Before we can examine a model summary, we need to build a model. To follow along with this example, create these three variables.

#Anscombe's Quartet Q1 Data
y=c(8.04,6.95,7.58,8.81,8.33,9.96,7.24,4.26,10.84,4.82,5.68)
x1=c(10,8,13,9,11,14,6,4,12,7,5)
#Some fake data, set the seed to be reproducible.
set.seed(15)
x2=sqrt(y)+rnorm(length(y))

Just for fun, I’m using data from Anscombe’s quartet (Q1) and then creating a second variable with a defined pattern and some random error.

Now, we’ll create a linear regression model using R’s lm() function and we’ll get the summary output using the summary() function.

model=lm(y~x1+x2)
summary(model)

This is the output you should receive.

> summary(model)

Call:
lm(formula = y ~ x1 + x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.69194 -0.61053 -0.08073  0.60553  1.61689 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   0.8278     1.7063   0.485  0.64058   
x1            0.5299     0.1104   4.802  0.00135 **
x2            0.6443     0.4017   1.604  0.14744   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.141 on 8 degrees of freedom
Multiple R-squared:  0.7477,	Adjusted R-squared:  0.6846 
F-statistic: 11.85 on 2 and 8 DF,  p-value: 0.004054

Meaning Behind Each Section of Summary()

I’m not going to focus on the Call, Residuals, or Coefficients section. If you’re doing regression analysis, you should understand residuals and the coefficient section. Here’s a brief description of each as a refresher.

Call: This is an R feature that shows what function and parameters were used to create the model.
Residuals: Difference between what the model predicted and the actual value of y. You can calculate the Residuals section like so:
```
summary(y-model$fitted.values)
```
Coefficients: These are the weights that minimize the sum of the square of the errors. To learn how to calculate these weights by hand, see this page.
- Std. Error is Residual Standard Error (see below) divided by the square root of the sum of the square of that particular x variable.
- t value: Estimate divided by Std. Error
- Pr(>|t|): Look up your t value in a T distribution table with the given degrees of freedom.

With those sections out of the way, we’ll focus on the bottom of the summary output.

Residual Standard Error

In R, the lm summary produces the standard deviation of the error with a slight twist. Standard deviation is the square root of variance. Standard Error is very similar. The only difference is that instead of dividing by n-1, you subtract n minus 1 + # of variables involved.

#Residual Standard error (Like Standard Deviation)
k=length(model$coefficients)-1 #Subtract one to ignore intercept
SSE=sum(model$residuals**2) 
n=length(model$residuals)
sqrt(SSE/(n-(1+k))) #Residual Standard Error

> sqrt(SSE/(n-(1+k))) #Residual Standard Error
[1] 1.140965

Multiple R-Squared

Also called the coefficient of determination, this is an oft-cited measurement of how well your model fits to the data. While there are many issues with using it alone (see Anscombe’s quartet) , it’s a quick and pre-computed check for your model.

R-Squared subtracts the residual error from the variance in Y. The bigger the error, the worse the remaining variance will appear.

#Multiple R-Squared (Coefficient of Determination)
SSyy=sum((y-mean(y))**2)
SSE=sum(model$residuals**2)
(SSyy-SSE)/SSyy
#Alternatively
1-SSE/SSyy

> (SSyy-SSE)/SSyy
[1] 0.7476681

If you notice, numerator doesn’t have to be positive. If the model is so bad, you can actually end up with a negative R-Squared.

Adjusted R-Squared

Multiple R-Squared works great for simple linear (one variable) regression. However, in most cases, the model has multiple variables. The more variables you add, the more variance you’re going to explain. So you have to control for the extra variables.

Adjusted R-Squared normalizes Multiple R-Squared by taking into account how many samples you have and how many variables you’re using.

#Adjusted R-Squared
n=length(y)
k=length(model$coefficients)-1 #Subtract one to ignore intercept
SSE=sum(model$residuals**2)
SSyy=sum((y-mean(y))**2)
1-(SSE/SSyy)*(n-1)/(n-(k+1))

> 1-(SSE/SSyy)*(n-1)/(n-(k+1))
[1] 0.6845852

Notice how k is in the denominator. If you have 100 observations (n) and 5 variables, you’ll be dividing by 100-5-1 = 94. If you have 20 variables instead, you’re dividing by 100-20-1 = 79. As the denominator gets smaller, the results get larger: 99 /94 = 1.05; 79/94 = 1.25.

A larger normalizing value is going to make the Adjusted R-Squared worse since we’re subtracting its product from one.

F-Statistic

Finally, the F-Statistic. Including the t-tests, this is the second “test” that the summary function produces for lm models. The F-Statistic is a “global” test that checks if at least one of your coefficients are nonzero.

#F-Statistic
#Ho: All coefficients are zero
#Ha: At least one coefficient is nonzero
#Compare test statistic to F Distribution table
n&lt;-length(y)
SSE&lt;-sum(model$residuals**2)
SSyy&lt;-sum((y-mean(y))**2)
k&lt;-length(model$coefficients)-1
((SSyy-SSE)/k) / (SSE/(n-(k+1)))

> ((SSyy-SSE)/k) / (SSE/(n-(k+1)))
[1] 11.85214

The reason for this test is based on the fact that if you run multiple hypothesis tests (namely, on your coefficients), you’re likely to include a variable that isn’t actually significant. See this for an example (and an explanation).

You can now replicate the summary statistics produced by R’s summary function on linear regression (lm) models!

If you’re interested in more R tutorials on linear regression and beyond, take a look at the Linear Regression page.

Learn by Marketing

Data Mining + Marketing in Plain English