14 open source tools to make the most of machine learning
- 24 September, 2020 08:13
Spam filtering, face recognition, recommendation engines — when you have a large data set on which you’d like to perform predictive analysis or pattern recognition, machine learning is the way to go.
Apache Mahout provides a way to build environments for hosting machine learning applications that can be scaled quickly and efficiently to meet demand.
Mahout works mainly with another well-known Apache project, Spark, and was originally devised to work with Hadoop for the sake of running distributed applications, but has been extended to work with other distributed back ends like Flink and H2O.
Mahout uses a domain specific language in Scala. Version 0.14 is a major internal refactor of the project, based on Apache Spark 2.4.3 as its default.
Compose, by Innovation Labs, targets a common issue with machine learning models: labelling raw data, which can be a slow and tedious process, but without which a machine learning model can’t deliver useful results.
Compose lets you write in Python a set of labelling functions for your data, so labelling can be done as programmatically as possible. Various transformations and thresholds can be set on your data to make the labelling process easier, such as placing data in bins based on discrete values or quantiles.
Core ML Tools
Apple’s Core ML framework lets you integrate machine learning models into apps, but uses its own distinct learning model format. The good news is you don’t have to pre-train models in the Core ML format to use them; you can convert models from just about every commonly used machine learning framework into Core ML with Core ML Tools.
Core ML Tools runs as a Python package, so it integrates with the wealth of Python machine learning libraries and tools. Models from TensorFlow, PyTorch, Keras, Caffe, ONNX, Scikit-learn, LibSVM, and XGBoost can all be converted. Neural network models can also be optimised for size by using post-training quantisation (e.g., to a small bit depth that’s still accurate).
Cortex provides a convenient way to serve predictions from machine learning models using Python and TensorFlow, PyTorch, Scikit-learn, and other models. Most Cortex packages consist of only a few files — your core Python logic, a cortex.yaml file that describes what models to use and what kinds of compute resources to allocate, and a requirements.txt file to install any needed Python requirements.
The whole package is deployed as a Docker container to AWS or another Docker-compatible hosting system. Compute resources are allocated in a way that echoes the definitions used in Kubernetes for same, and you can use GPUs or Amazon Inferentia ASICs to speed serving.
Feature engineering, or feature creation, involves taking the data used to train a machine learning model and producing, typically by hand, a transformed and aggregated version of the data that’s more useful for the sake of training the model.
Feature tools gives you functions for doing this by way of high-level Python objects built by synthesising data in data frames, and can do this for data extracted from one or multiple data frames. Feature tools also provides common primitives for the synthesis operations (e.g.,
time_since_previous, to provide time elapsed between instances of time-stamped data), so you don’t have to roll those on your own.
GoLearn, a machine learning library for Google’s Go language, was created with the twin goals of simplicity and customisability, according to developer Stephen Whitworth. The simplicity lies in the way data is loaded and handled in the library, which is patterned after SciPy and R.
The customisability lies in how some of the data structures can be easily extended in an application. Whitworth has also created a Go wrapper for the Vowpal Wabbit library, one of the libraries found in the Shogun toolbox.
One common challenge when building machine learning applications is building a robust and easily customised UI for the model training and prediction-serving mechanisms. Gradio provides tools for creating web-based UIs that allow you to interact with your models in real time.
Several included sample projects, such as input interfaces to the Inception V3 image classifier or the MNIST handwriting-recognition model, give you an idea of how you can use Gradio with your own projects.
H2O, now in its third major revision, provides a whole platform for in-memory machine learning, from training to serving predictions. H2O’s algorithms are geared for business processes—fraud or trend predictions, for instance—rather than, say, image analysis. H2O can interact in a stand-alone fashion with HDFS stores, on top of YARN, in MapReduce, or directly in an Amazon EC2 instance.
Hadoop mavens can use Java to interact with H2O, but the framework also provides bindings for Python, R, and Scala, allowing you to interact with all of the libraries available on those platforms as well. You can also fall back to REST calls as a way to integrate H2O into most any pipeline.
Oryx, courtesy of the creators of the Cloudera Hadoop distribution, uses Apache Spark and Apache Kafka to run machine learning models on real-time data. Oryx provides a way to build projects that require decisions in the moment, like recommendation engines or live anomaly detection, that are informed by both new and historical data.
Version 2.0 is a near-complete redesign of the project, with its components loosely coupled in a lambda architecture. New algorithms, and new abstractions for those algorithms (e.g., for hyper-parameter selection), can be added at any time.
When a powerful project becomes popular, it’s often complemented by third-party projects that make it easier to use. PyTorch Lightning provides an organisational wrapper for PyTorch, so that you can focus on the code that matters instead of writing boilerplate for each project.
Lightning projects use a class-based structure, so each common step for a PyTorch project is encapsulated in a class method. The training and validation loops are semi-automated, so you only need to provide your logic for each step. It’s also easier to set up the training results in multiple GPUs or different hardware mixes, because the instructions and object references for doing so are centralised.
Python has become a go-to programming language for math, science, and statistics due to its ease of adoption and the breadth of libraries available for nearly any application. Scikit-learn leverages this breadth by building on top of several existing Python packages—NumPy, SciPy, and Matplotlib—for math and science work.
The resulting libraries can be used for interactive “workbench” applications or embedded into other software and reused. The kit is available under a BSD license, so it’s fully open and reusable.
Shogun is one of the longest-lived projects in this collection. It was created in 1999 and written in C++, but can be used with Java, Python, C#, Ruby, R, Lua, Octave, and Matlab. The latest major version, 6.0.0, adds native support for Microsoft Windows and the Scala language.
Though popular and wide-ranging, Shogun has competition. Another C++-based machine learning library, Mlpack, has been around only since 2011, but professes to be faster and easier to work with (by way of a more integral API set) than competing libraries.
The machine learning library for Apache Spark and Apache Hadoop, MLlib boasts many common algorithms and useful data types, designed to run at speed and scale. Although Java is the primary language for working in MLlib, Python users can connect MLlib with the NumPy library, Scala users can write code against MLlib, and R users can plug into Spark as of version 1.5.
Version 3 of MLlib focuses on using Spark’s DataFrame API (as opposed to the older RDD API), and provides many new classification and evaluation functions.
Weka, created by the Machine Learning Group at the University of Waikato, is billed as “machine learning without programming.” It’s a GUI workbench that empowers data wranglers to assemble machine learning pipelines, train models, and run predictions without having to write code.
Weka works directly with R, Apache Spark, and Python, the latter by way of a direct wrapper or through interfaces for common numerical libraries like NumPy, Pandas, SciPy, and Scikit-learn. Weka’s big advantage is that it provides browsable, friendly interfaces for every aspect of your job including package management, preprocessing, classification, and visualisation.