Category Archives : Programming

Text Mining Packages and Options in R

Summary: The tm and lsa packages provide you a way of manipulating your text data into a term-document matrix and create new, numeric features.  The ngram package lets you find frequent word patterns (e.g. “The cow” is a bi-gram or 2-gram; “The cow said” is a tri-gram or 3-gram).  Lastly, for a quick visualization (though […]

Wordcloud generated in R for Brother's Grimm Stories

Information Gain would Select the Number of Images variable while Gini Index would select the more compact Average Token Length.

Decision Tree Flavors: Gini Index and Information Gain

Summary: The Gini Index is calculated by subtracting the sum of the squared probabilities of each class from one.  It favors larger partitions.  Information Gain multiplies the probability of the class times the log (base=2) of that class probability.  Information Gain favors smaller partitions with many distinct values.  Ultimately, you have to experiment with your data […]