Category Archives : Programming

Building a Recommender System in Spark with ALS

This entry was posted in Python Spark and tagged RecSys on May 1, 2016 by Will

Summary: Spark has an implementation of Alternating Least Squares (ALS) along with a set of very simple functions to create recommendations based on past data.

Working in Pyspark: Basics of Working with Data and RDDs

This entry was posted in Python Spark on April 23, 2016 by Will

Summary: Spark (and Pyspark) use map, mapValues, reduce, reduceByKey, aggregateByKey, and join to transform, aggregate, and connect datasets. Each function can be stringed together to do more complex tasks.

Text Mining Packages and Options in R

This entry was posted in Code in R on March 22, 2016 by Will

Summary: The tm and lsa packages provide you a way of manipulating your text data into a term-document matrix and create new, numeric features. The ngram package lets you find frequent word patterns (e.g. “The cow” is a bi-gram or 2-gram; “The cow said” is a tri-gram or 3-gram). Lastly, for a quick visualization (though […]

Wordcloud generated in R for Brother's Grimm Stories

Information Gain would Select the Number of Images variable while Gini Index would select the more compact Average Token Length.

Decision Tree Flavors: Gini Index and Information Gain

This entry was posted in Code in R and tagged decision tree on February 27, 2016 by Will

Summary: The Gini Index is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions. Information Gain multiplies the probability of the class times the log (base=2) of that class probability. Information Gain favors smaller partitions with many distinct values. Ultimately, you have to experiment with your data […]

GitHub: Merge Local Branches then Push to GitHub

This entry was posted in Programming on February 12, 2016 by Will

As I work on my python script to parse SAS Enterprise Guide projects, I’ve been using git and github to keep track of my changes and keep a stable project while I break things and try to make it better. However, as per my previous post on github, It can be frustrating to work with […]

Learn by Marketing

Data Mining + Marketing in Plain English