The Dangers of Automated Model Selection

Summary: R offers a handful of packages to automate building models. rpart, randomForest, MASS, and forecast packages help you search through a hypothesis space. The caret package helps crawl through the hyper parameter space.

Model Automation using R

Passcodes 1 thru 1,000 and Factors

The Passcode Riddle: A Parallel Example in R

Summary: The passcode riddle asks for three three whole positive numbers with each one being equal to or larger than the next. Turns out there are only a handful of numbers this could possibly work for. Browsing YouTube one morning, I came across the video from TED-Ed and I was intrigued! I’ll be honest, I […]


How does Parallel Processing Work

Summary: Data can be processed in parallel by using multiple threads on a single CPU or by passing code to the data in systems like the Hadoop Distributed File System. Imagine you’re driving your car on the way to work. You keep an eye on the road ahead, the side mirrors, the rearview mirror, and […]

A Basic Flow of Parallel Processing on a Single Core

New Analysts need the right tools and mentors to show the way.

On-Boarding New Analysts

Summary: Before your new analyst arrives, make sure they have access to the data, hardware, and software they’ll need.  Set and discuss expectations on what the analyst should know and learn.  Get others involved in the development with mentors and formalized “get to know” meetings.


Making Analysts More Productive: Tools and Ideas

Summary: Every organization should provide their analysts and data scientists with a few key tools: A Data Dictionary, a Metric Dictionary, a Research Repository, and a Code Repository.  All of these tools need to be searchable to make it easy for analysts to find and use previous work.

A possible process flow a research repo, data, and metric dictionaries