wekaMine

Retooling the Weka machine learning library for research and production.

View on Git github

What is Weka?

Weka is a collection of machine learning algorithms for data mining tasks that comes from The University of Waikato. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization and a GUI suitable for teaching.

What is wekaMine?

Weka is well known machine library due to it's long history, large collection of algorithms, and available GUI that makes it good for teaching. Weka is not well loved, however, among people who try to use it for larger research projects or production. There are many aspects of Weka that make it a bit clunky to use, from it's use of arff files for data storage to it's treating sample IDs as merely another feature which can't be passed to most Weka algorithms. While there are ways to deal with each of these issues in Weka, such as using a FilteredClassifier to remove the ID before the data is handed to a classifier, such clunky solutions become part of the pain of using Weka out-of-the-box.

wekaMine is a set of libraries and wrapper code to hide the rough edges of Weka from the user, providing a set of straight forward command line scripts to work with data in simple tab file formats, and a set of functions to make writing machine learning pipelines in Java or Groovy much easier. wekaMine provides automated model selection pipeline that makes searching a space of different filters, attribute selectors, classifier algorithms, and their parameters as easy as training with a single algorithm. wekaMine also adds a range of algorithms, from BalancedRandomForest to MixtureModel filtering.