Linear Regression is a well-studied and frequently applied tool. In this checklist, I walk you through the steps you need to take to have a valid, useful model. If you’re interested in a “in-the-works” R package for this tool, check out lmade.
This is the thing no automation software can handle. You, as an analyst, need to know the background of the model and its importance.
- Do you know what you’re trying to model?
- Do you know why you’re trying to model?
- What is the potential outcomes of your model?
Examining your data, you’ll need to find patterns that you can take advantage of in your model. The majority of insight usually comes from plotting your data and toying with the results.
- Do you know what all of your data means?
- Plot your X and Y variables.
- Examine correlation among your X and Y variables.
- Look for outliers in your data.
- Look for missing data.
- Does that missing data have a pattern or meaning?
- Try simple models, based on business understanding first.
With a firm understanding of the attributes of your data, you’ll start applying some modification to your data and potentially creating new variables.
- Create dummy variables (if necessary) and identify your base case.
- Fill in missing data.
- Possibly drop outliers.
- Transform your X variables
- Transform your Y variable (e.g. natural log).
- Extract features (PCA, Clustering).
- Potentially find more data.
Linear Regression Assumptions
Traditionally, you’ll want to check these assumptions after building your model. If any of these assumptions aren’t met, you cannot be confident that your model’s coefficients are truly the best.
- Sum and mean of errors are zero.
- Errors are normally distributed.
- Errors are independent of each other.
- Variance of errors are constant.
Checking Model Assumptions
- Plot residuals on vertical axis and each X variable on horizontal axis.
- Plot residuals on vertical axis and actual Y variable on horizontal axis.
- Check if 5% or more of residuals are outside two standard deviations from zero.
- If some coefficients have unexpected signs, check for multi-collinearity.
- Plot residuals and look for expanding pattern. Apply log transformation to Y.
- QQ Plot and Histogram
- Jaque-Bera test for normality.
Influential / Outlier Points
- Standardized Residuals
- Studentized Residuals
Detecting Residual Correlation
- Durbin-Watson test
What’s the point of building a model if it isn’t accurate on new data. One of the most important tasks is to create a validation set: a set of data that the model has never seen. A validation set lets you report unbiased performance measures.
Many statistical tools will report these measures on your training data automatically.
- Multiple R-Squared and Adjusted R-Squared
- Root Mean Squared Error
- Mean Absolute Error