Reading through the Data Mining: Practical Machine Learning Tools and Techniques (WEKA) book (also part of the Data Science Reading list), I started wondering who was WEKA really for. There are a lot of analytics tools and plenty of open-source ones. So WEKA has to fill a niche somewhere.
The WEKA website describes it uses as “…a specialist in a particular field is able to use ML to derive useful knowledge from databases that are far too large to be analysed by hand. WEKA’s users are ML researchers and industrial scientists, but it is also widely used for teaching.”
Therein lies the main purpose of WEKA – to teach machine learning techniques. That’s why the book was published, that’s why academics built the software. However, I discovered others were using WEKA for real-world analyses. So putting aside the fact that WEKA was built to make ML widely available and to help teach ML to students, here are my top five reasons to use WEKA.
1.) You’re Using Pentaho Business Intelligence
While researching WEKA, I found out that Pentaho purchased an exclusive license to use WEKA in their software. I like the idea of Pentaho – interjecting a layer of analytics between IT and marketing all the while employing open source software where it fits.
2.) You’re Using Data Mining Algorithms in Other Software
I’m an analyst, not a software developer. However, if you’re building software that requires a machine learning algorithm (maybe using kNN to classify a new customer) and you don’t want to write the implementation yourself, WEKA has a full library available for developers.
This reminded me of Apache Mahout. Maybe WEKA just needs better PR, a WIRED article like Mahout, and a big company to make use of it.
3.) You’re Not Comfortable with Command Lines
Using R or SAS programming requires knowing the functions or the libraries to run these programs. WEKA has lots of buttons and is smart enough to gray out options that aren’t applicable for a particular analysis.
However, using tools like SAS Enterprise Miner or the workflow tools from Alteryx and Revolution Analytics will allow you to make pretty diagrams showing each step of your analysis without having to read or write a line of code.
4.) You Want to Explore Your Data or Build a Model Quickly
Barring the use of SAS Enterprise Miner, WEKA seems like a very fast tool which will let you build a model with a click of a few buttons. I see WEKA being used by people first learning about data mining. It’s something quick and easy and doesn’t require a lot of input from the user – aside from what type of model do they want to build and which variable is the class label.
5.) You Want One Way to Build the Model, Not a Dozen
I love R, but because it’s open-source and because anyone can contribute a new library of functions, you see redundancy. There are a dozen ways to create a decision tree in R. WEKA has the J48 implementation of the C4.5 algorithm (version 8). Instead of monkeying around with rpart, tree, party, or maptree, you just click a couple buttons in WEKA. The researchers developing WEKA control what goes in and what comes out, making for a consistent experience across tabs (algorithms).
I’m not a big fan of WEKA or the Data Mining book by Witten, Frank, and Hall. I think it’s an interesting tool and has a lot of promise as a teaching aid. However, if you’re teaching this to a business student or statistician, their goal is to be able to use whatever tool is out on the market today rather than a niche, toy software. I would rather have been challenged by R and say that I’ve used a leading open-source tool than ace an intro-to-data-mining class and walk away not knowing how to interpret model output from commonly used tools.