Simple Linear Regression Explained


Regression, in all its forms, is the workhorse of modern economics and marketing analytics.  It’s taught in introductory statistics classes and is used for predicting some “Y” given an “X”.

What is Simple Linear Regression

Linear regression finds the best fitting straight line through a set of data.

The formula for a line is Y = mx+b.

  • Y is the output or the prediction.
  • M is the slope or the “weight” given to the variable X.
  • X is the input you provide based on what you know.
  • b is the intercept.  Essentially given 0 for your input, how much of Y do we start off with.

Technically regression “minimizes the sum of the square of the error”.  By that, I mean it uses a formula that directly calculates the best fitting line.  This is based on the derivative of the sum of the squared errors function.

Regression is a form of supervised learning, i.e. you have existing data with an outcome / result.

  • Input(X) = Monthly marketing budget; Output(Y) = Total sales in given month.
  • Input(X) = Area of house in Sq ft; Output(Y) = Sale price of house.

Calculating Simple Linear Regression Weights (Coefficients)

There are two calculations you need to know.

SSxx = SUM( (avg(x) – x))^2 )

  • You subtract the average from each input and then square that difference.

SSxy = SUM( (avg(x) – x) * (avg(y) – y) )

  • Similar to the one above, this time you’re subtracting the average x and y from each x and y (respectively) and then multiplying those differences together.

Using those two calculations, you can plug them in to the regression calculations.

Y = mx+b

m = SSxy / SSxx = SUM( (avg(x) – x) * (avg(y) – y) )SUM( (avg(x) – x))^2 )

b = avg(y) – m * avg(x) = avg(y) – (SSxy / SSxx) * avg(x)

Measuring Performance with R-Squared and RMSE

While you now have the line of best fit, the one that minimizes the sum of the square of the error, but that doesn’t mean you have the best model.  You’ll need to calculate how much variance does the one predictor variable (x) in your model explain the response variable (y).

The most common measurement is the R-squared measure which looks at how much does the model results vary from the way the response variable (y) varies by itself.

Another measure, one that is applied to many more models, is the Root-Mean-Square-Error.  This produces a measure akin to standard deviation.  You can use the RMSE to construct a confidence interval around your predictions.  A smaller RMSE is better since it indicates that there’s a smaller amount of deviation.

Residual Analysis

A residual is the error, or the difference between prediction and actual results.  This is what separates the regression “boys from the men”.  Analyzing the residuals gives you insight into what is missing from your model of what area is your model biased toward.

There is a lot more to talk about with residual on another page.

Tutorials

Recommended Reading

  • Naked Statistics – Has some great explanation of how regression works.
  • Freakonomics – Lots of example applications of regression.
    • Whenever someone says “after controllling for variable a, b, …, and n” they usually ran their data through a regression model.