Summary: The caret package was developed by Max Kuhn and contains a handful of great functions that help with parameter tuning.
Purpose of the caret Package
The caret package lets you quickly automate model tuning. Using a training and holdout sample, the caret package trains a model you provide and returns the optimal model based on an optimization metric.
The oldest archive on CRAN is from October 2007 so it has been around for a while. Max Kuhn, the principal author of the package, goes around the country teaching courses in R and using this tool to aide model development.
The more I use caret, the more I like it. Users of this package are likely going to get more done as they spend less time tweaking their models manually.
Working with the caret Package
The best part of the caret package is its uniformity.
Despite there being (at the time of writing) 8,489 packages available on CRAN, the authors of caret have taken the time to incorporate over 210 models into the parameter tuning capabilities of the package.
As a result, it’s very likely that your favorite R model can be used inside the caret package and can be automatically tuned for you. Trying multiple models and using the same steps has never been easier.
Here’s how a caret training sessions breaks down:
- Split your data into a training and testing set (perhaps using createDataPartition)
- Set up your pre-processing steps: centering / scaling? PCA? Imputing missing values?
- Determine your parameter tuning strategy: Cross-validation? Bootstrapping?
- Are there particular parameter values you want to check?
- Train your model(s).
- Evaluate your model(s) on a holdout set.
We’ll be working with the bank marketing data set from the UCI machine learning repository.
#Reading in Bank Marketing dataset #https://archive.ics.uci.edu/ml/datasets/Bank+Marketing data <- read.table("~/in/bank/bank.csv",header=T, sep=";") #Remove any observation with blanks data <- data[complete.cases(data),] #Create a training set train_log <- createDataPartition(data$y,times = 1, p=0.75, list=F)
createDataPartition allows you to create stratified samples based on a single variable. The difference between
strata (from the sampling library) function is that you can also use numeric values as the stratification.
caret provides a handful of standardized pre-processing steps which automatically ignore factor / non-numeric variables.
With your set of data partitioned, you can pass a character vector with the pre-processing steps you want done. This vector gets passed into the preProcess option later on.
#We'll use this later in the train function preProc <-c("BoxCox", "center","scale")
- Order of the pre-processing methods are not controlled by the user.
- They run in this order:
- zero-variance filter, near-zero variance filter,
- Box-Cox/Yeo-Johnson/exponential transformation,
- centering, scaling, range, imputation,
- PCA, ICA then spatial sign.
- Imputation is built-in: K-means clustering, bagged trees.
- Personally, I did not have much success with the imputation and found a few people on the Kaggle forums who also had poor experiences with imputation.
Using the train function, the general process you should follow is:
- Pick a metric to optimize for.
- Pick a sampling strategy.
- Select model you want try.
- Select range of parameters to try.
#Setting up sampling strategy tctrl <- trainControl(method = "cv",number=10, repeats=10) #Formula to be used f <- y~. #Using the train function on multiple models rpart_model <- train(f, data, method="rpart", preProcess = preProc, metric = "Kappa", trControl = tctrl, subset = train_log) rf_model <- train(f, data, method="rf", preProcess = preProc, metric = "Kappa", trControl = tctrl, subset = train_log) #GLM Doesn't Vary Anything logit_model <- train(f, data, method="glm", family="binomial", preProcess = preProc, metric = "Kappa", trControl = tctrl, subset = train_log)
Like in any model function, train expects a formula and a dataset. Following that, you need to specify which model you are planning on using. Again, there are over 210 models that you can use. However, you need to be aware of some nuances. For example the randomForest package is referenced with
method = "rf".
Inside the train function, we pass the pre-processing options we defined earlier.
The metric you choose depends on whether you’re running classification or regression.
- Classification metrics include: Accuracy or Kappa.
- Regression metrics include: RMSE or Rsquared.
There’s also a
trControl option that we pass a trainControl list where we identify how we want to sample the data and how many repetitions do we want to use.
method = "cv", number = "10"means we’ll do 10-fold cross-validation.
repeats = 10means we’ll repeat the cross-validation 10 times.
- We could have chosen
method="boot", repeats=10to generate 10 bootstrap samples.
- caret uses this to train a model with a given set of parameters and then evaluate its performance on the hold-out group.
Finally, you can easily use the
subset option and provide a logical vector that selects certain observations to include in the training of the model.
Another option is to include an explicit
expand.grid, you can create a data.frame of parameters you want to try. Look for your model here and you’ll find the options you can manipulate. For example, if I wanted to have more control over the possible cp values in rpart, I can use the following.
rpart_opts <- expand.grid(cp = seq(0.0,0.01, by = 0.001)) rpart_model <- train(f, data, method="rpart", preProcess = preProc, metric = "Kappa", trControl = tctrl, tuneGrid = rpart_opts, subset = train_log)
This will test every complexity parameter between 0 and 0.01 and increment by 0.001. I’m not sure how caret determines which values it tests automatically but it typically does find significant parameters
Setting up Parallel Processing
Since caret does a search across many different models and does repeated modeling with cross-validation or repeated sampling, it’s clear that running it in parallel will be beneficial.
library(doMC) lets you register the number of cores on your pc.
library(doMC) #Register multiple cores registerDoMC(2)
That’s it! Now just run your train function(s) and caret does the rest!
Using and Comparing caret Models
You can use a trained model just by calling predict(trained_model, newdata).
The underlying model (rpart, gbm, randomForest, or whatever) will be used to make the prediction. If you have multiple models you’ve developed and want to try out on the test set, you can actually put the models in a list and then call the same predict function.
#Put all models into a list multi_models <- list(rpart = rpart_model, rf = rf_model, glm = logit_model) #Run prediction across models multi_predict <- predict(multi_models, data[-train_log,])
This results in a list of predictions on your test set. You’ll have to use lapply to run across each element of the list and apply caret’s confusionMatrix function.
#Create confusion matrix for each model lapply(multi_predict, FUN = confusionMatrix, reference =data[-train_log,]$y, positive = "yes")
confusionMatrixrequires a prediction and a reference set.
- Declare what your “positive” class looks like: “1”, “yes”, “y”, etc.
- If you don’t, caret assumes the first factor level is the positive class.
If you’re using caret for regression models, the
RMSE function is available to calculate the Root-Mean-Square-Error.
Bottom Line: The caret package provides a set of easy to use, consistent functions. If you’re looking to automatically explore a set of models and set of parameters, caret will automate a lot of the busywork.