Yearly Archives: 2016

Pyspark ALS and Recommendation Outputs

This entry was posted in Python Spark on December 26, 2016 by Will

Lately, I’ve written a few iterations of pyspark to develop a recommender system (I’ve had some practice creating recommender systems in pyspark). I ran into a situation where I needed to generate some recommendations on some different datasets. My problem was that I had to decipher some of the prediction documentation. Because of my struggles, […]

Writing Quality Data Mining Code

This entry was posted in Programming on November 14, 2016 by Will

Summary: Writing better quality data mining code requires you to write code that is self-explanatory and does one thing at a time well. In terms of analysis, you should be cross-validating and watching for slowly changing relationships in the data.

Winning a Kaggle Competition Analysis

This entry was posted in Analytical Examples on November 7, 2016 by Will

Summary: XGBoost and ensembles take the Kaggle cake but they’re mainly used for classification tasks. Some tools like factorization machines and vowpal wabbit make occasional appearances.

Kaggle Winners and Algorithm Associations

Keeping a Sharp Analytical Mind

This entry was posted in Analyst Secrets on October 31, 2016 by Will

Summary: To stay on top of your personal development, try learning new things like a programming language, an instrument, or exposure to a new field (e.g. biology or accounting). Exposure to new ideas helps you avoid confirmation bias and increase you willingness to explore your analysis further.

Overview of Parallel Processing in R

This entry was posted in Code in R on October 24, 2016 by Will

Summary: The foreach package provides parallel operations for many packages (including randomForest). Packages like gbm and caret have parallelization built into their functions. Other tools like bigmemory and ff solve handling large datasets with memory management.

Learn by Marketing

Data Mining + Marketing in Plain English