Summary: Logistic regression produces coefficients that are the log odds. Take e raised to the log odds to get the coefficients in odds. Odds have an exponential growth rather than a linear growth for every one unit increase. A two unit increase in x results in a squared increase from the odds coefficient. To get […]
Summary: The simplest way of of getting a data.frame to a transaction is by reading it from a csv into R. An alternative is to convert it to a logical matrix and coerce it into a transaction object.
Lately, I’ve written a few iterations of pyspark to develop a recommender system (I’ve had some practice creating recommender systems in pyspark). I ran into a situation where I needed to generate some recommendations on some different datasets. My problem was that I had to decipher some of the prediction documentation. Because of my struggles, […]
Summary: Writing better quality data mining code requires you to write code that is self-explanatory and does one thing at a time well. In terms of analysis, you should be cross-validating and watching for slowly changing relationships in the data.
Summary: The foreach package provides parallel operations for many packages (including randomForest). Packages like gbm and caret have parallelization built into their functions. Other tools like bigmemory and ff solve handling large datasets with memory management.