Free Data Mining and Data Science Books

I’m on a bit of a reading kick as of late so I wanted to compile a short list of some useful and free data mining / data science books.  Most are of a technical nature and come from academia

Free Academic Texts on Data Mining

An Introduction to Statistical Learning with Applications in R: Covers a wide range of topics, all in R.  It’s (older) sister book is the Elements of Statistical Learning (which is also a free book and discussed below).

  • Focused on a selection of machine learning concepts.
  • Still technical but less so than the Elements of Statistical Learning
  • Still focused on teaching concepts but using R as a way to explore those concepts.
    • Not an R tutorial book (about 17 pages dedicated to teaching R basics)
    • Does not explore all the options of the various R functions used.

Elements of Statistical Learning: THE book on machine learning concepts.  This book is referenced often by academics and practitioners alike.  It’s used in graduate school programs across the world.

  • Very detailed.  Not for the casual reader.
  • If you want deep insights into how many different machine learning models work, this is your go-to source.
  • You could teach an entire graduate program using just this book.
  • If you’re looking for a softer introduction, see Introduction to Statistical Learning (above).

Introduction to Information Retrieval: I’m really interested in search engines and I’ll bet you are too.  This book serves as a detailed reference but also provides a smooth transition from novice to practitioner.

  • Quickly ramps you up from basic searching to more advanced methods (like finding the “latent topics” of documents)
  • No code samples to follow-along / experiment with.

Forecasting: Principles and Practices: If you’re interested in Time Series Analysis and you like to use R, this is the book for you.  Personally, this free text book got me through my Time Series class in grad school.

  • Makes use of the {forecast} package in R (designed by the author of the free textbook).
  • Probably the best integration of code and concepts I’ve seen (talks about the concepts, shows how you’d do it in R with the code right there).

Mining of Massive Datasets: If you want to understand how you might turn an algorithm into one that runs across machines to process data in parallel, this is the book for you.

  • Examines a handful of mining algorithms and how they are applied across machines.
  • Not for the uninitiated.  You should understand the basic algorithms before trying to understand how they can be modified for multiple machines.
  • Not a Hadoop / Spark walkthrough!  This is more of an academic reference and helps jumpstart your brain on distributed computing.

Free Practitioner Data Mining Books

R for Data Science: This is predominantly a book about working with data in R.  Written by a prolific R programmer, Hadley Wickham.

  • This book does not cover a wide range of statistical models.
  • The book does cover some advanced R programming topics.
  • Not all of the chapters are available as of right now (3/2016).
  • It will be available for purchase from O’Reilly.

Think Bayes: A programmer’s introduction to Bayesian statistics.

  • Struggling with grasping the bayesian way of thinking?  Might as well give this free book a try before you buy it.
  • It’s all in python so it’s fairly easy to jump into even with no programming background.

Learn Python the Hard Way: It takes some dedication and self-motivation to make you way through this book but it’s worth it in the end.  Python is an easy but powerful language that any analyst, data miner, or data scientist should have under the belt.

  • Focuses on showing you the minimum amount you need to get started.
  • Asks you to play around with the code and try different things.

A lot of great information is made free (which is a part of the Learn By Marketing philosophy).  In case you’ve already made your way through all of these books, try taking a look at my data science reading list for another resource in learning.