Category Archives : Programming

Pyspark Joins by Example

This entry was posted in Python Spark on January 27, 2018 by Will

Summary: Pyspark DataFrames have a join method which takes three parameters: DataFrame on the right side of the join, Which fields are being joined on, and what type of join (inner, outer, left_outer, right_outer, leftsemi). You call the join method from the left side DataFrame object such as df1.join(df2, df1.col1 == df2.col1, ‘inner’).

Pyspark Join Data with Two Tables (A and B)

Logistic regression probabilities follows a logistic curve and the differences form, what looks like a t distribution

Why Saying a ‘One Unit Increase’ Doesn’t Work in Logistic Regression

This entry was posted in Code in R on September 9, 2017 by Will

Summary: Logistic regression produces coefficients that are the log odds. Take e raised to the log odds to get the coefficients in odds. Odds have an exponential growth rather than a linear growth for every one unit increase. A two unit increase in x results in a squared increase from the odds coefficient. To get […]

Working with arules transactions and read.transactions

This entry was posted in Code in R on August 12, 2017 by Will

Summary: The simplest way of of getting a data.frame to a transaction is by reading it from a csv into R. An alternative is to convert it to a logical matrix and coerce it into a transaction object.

arules transaction creation from data.frames

Pyspark ALS and Recommendation Outputs

This entry was posted in Python Spark on December 26, 2016 by Will

Lately, I’ve written a few iterations of pyspark to develop a recommender system (I’ve had some practice creating recommender systems in pyspark). I ran into a situation where I needed to generate some recommendations on some different datasets. My problem was that I had to decipher some of the prediction documentation. Because of my struggles, […]

Writing Quality Data Mining Code

This entry was posted in Programming on November 14, 2016 by Will

Summary: Writing better quality data mining code requires you to write code that is self-explanatory and does one thing at a time well. In terms of analysis, you should be cross-validating and watching for slowly changing relationships in the data.

Learn by Marketing

Data Mining + Marketing in Plain English