Bootstrapping


In data mining, bootstrapping is a resampling technique that lets you generate many sample datasets by repeatedly sampling from your existing data.

Why Use Bootstrapping: Sometimes you just don’t have enough data!  Statistics requires large amounts of data and repeated samples to be confident in their results.  There are two applications of bootstrapping as far as we’re concerned.

  1. Repeated sampling to build a more confident measurement (a distribution, an average, a parameter of a model).
  2. Repeated sampling to build coordinated ensembles (bagging ensembles).

The “build a more confident measurement” relates to the central limit theorem.  If we have many samples, we can be more confident in the accuracy of the measurement – whether it’s the mean, distribution, weight, or whatever.

The “build coordinated ensembles” is a whole other school of data mining that combines multiple models to generate a more accurate prediction.

The Math Behind Bootstrapping: The entire point of bootstrapping is to sample with replacement on the entire dataset.  This has the neat property of pulling in about 63.2% of the data and leaving out the remaining 36.8%.  Bootstrapping is also called 0.632 bootstrapping for this very reason.  Here’s an example in R.

mydata<-1:10000 #Generate 1-10,000
#Sample 10,000 instances of "mydata" with replacement.
#With Replacement = Every instance is equally likely to be selected EACH time.
newsample<-sample(mydata,10000,replace=T)
#The newsample now holds 1,000 observations.
#But roughly 62% are unique
length(unique(newsample))
#Select the observations that were NOT selected in the sample
left.out<-mydata[-newsample]
#That leaves roughly 38
length(unique(left.out))

The actual math is interesting too.

  • The probability of each instance is 1 / N where N is the number of observations you have.
  • This is independent so the odds of picking a given instance each time can be multiplied together so that’s (1/N)^N.
  • The odds of NOT picking a given instance each time is (1-1/N)^N.
  • If we take the limit of the last equation we get e^-1 which is approximately 0.368.  COOL!

 How Do You Use Bootstrapping: For classic statistics, you’ll use bootstrapping to be more confident in your measurements.  An example of this is measuring the R-Squared on bootstrapped samples of the same data.  Quick-R has the best example of using the boot library from R.

For data mining, bootstrapping is used in bagging, or bootstrap-aggregation.  It essentially works like this:

  • Decide how many models you want to build (M)
  • For each model take a bootstrap sample and train the model.
  • For new data, use each trained model to predict what the new data should be classified as.
  • Each model gets a “vote” and the majority vote wins.  For example…
    • Train 100 models to classify a state as Republican or Democrat
    • On new data, 60 of the models predict Democrat and 40 predict Republican.
    • Majority vote says Democrat.

Summary: Bootstrapping takes repeated samples with replacement to generate a new sample dataset.  This new dataset is used to either build a more confident measurement (statistics) or as a part of bagging to generate an ensemble of classifiers.  It’s sometimes called 0.632 bootstrapping due to its mathematical property that it will take approximately 63.2% of the sample data repeatedly but 36.8% will be left out.