Category Archives : Programming


Pyspark ALS and Recommendation Outputs

Lately, I’ve written a few iterations of pyspark to develop a recommender system (I’ve had some practice creating recommender systems in pyspark). I ran into a situation where I needed to generate some recommendations on some different datasets. My problem was that I had to decipher some of the prediction documentation. Because of my struggles, […]


Writing Quality Data Mining Code

Summary: Writing better quality data mining code requires you to write code that is self-explanatory and does one thing at a time well. In terms of analysis, you should be cross-validating and watching for slowly changing relationships in the data.

Split Your Code Apart!

Test accuracy from using rpart in parallel foreach

Overview of Parallel Processing in R

Summary: The foreach package provides parallel operations for many packages (including randomForest). Packages like gbm and caret have parallelization built into their functions. Other tools like bigmemory and ff solve handling large datasets with memory management.


Get US Census Data with R

Summary: The US Census provides an API that lets you query any of their datasets. Includes population by race, gender, age, and more by zip code, state, congressional district, and a few other geographies.

Select Census Geographies include State, Zip Code, MSPA, Congressional District and More.