ARN

Assessing Dataiku: data science fit for the enterprise

Dataiku’s end-to-end machine learning platform combines visual tools, notebooks, and code to address the needs of data scientists, data engineers, business analysts, and AI consumers.

Dataiku Data Science Studio (DSS) is a platform that tries to span the needs of data scientists, data engineers, business analysts and artificial intelligence (AI) consumers. It mostly succeeds. In addition, Dataiku DSS tries to span the machine learning process from end to end, i.e. from data preparation through MLOps and application support. Again, it mostly succeeds.

The Dataiku DSS user interface is a combination of graphical elements, notebooks, and code, as we’ll see later on in the review. As a user, you often have a choice of how you’d like to proceed, and you’re usually not locked into your initial choice, given that graphical choices can generate editable notebooks and scripts.

During my initial discussion with Dataiku, their senior product marketing manager asked me point blank whether I preferred a GUI or writing code for data science. I said “I usually wind up writing code, but I’ll use a GUI whenever it’s faster and easier.” This met with approval: Many of their customers have the same pragmatic attitude.

Dataiku competes with pretty much every data science and machine learning platform, but also partners with several of them, including Microsoft Azure, Databricks, AWS, and Google Cloud. I consider KNIME similar to DSS in its use of flow diagrams, and at least half a dozen platforms similar to DSS in their use of Jupyter notebooks, including the four partners I mentioned. DSS is similar to DataRobot, H2O.ai, and others in its implementation of AutoML.

Dataiku DSS features

Dataiku says that its key capabilities are data preparation, visualisation, machine learning, DataOps, MLOps, analytic apps, collaboration, governance, explainability, and architecture. It supports additional capabilities through plug-ins.

Dataiku data preparation features a visual flow where users can build data pipelines with datasets, recipes to join and transform datasets, plus code and reusable plug-in elements.

Dataiku does quick visual analysis of columns, including the distribution of values, top values, outliers, invalids, and overall statistics. For categorical data, the visual analysis includes the distribution by value, including the count and % of values for each value. The visualisation capabilities let you perform exploratory data analysis without resorting to Tableau, although Dataiku and Tableau are partners.

Dataiku machine learning includes AutoML and feature engineering, as shown in the figure below. Each Dataiku project has a DataOps visual flow, including the pipeline of datasets and recipes associated with the project.

dataiku 02 IDG

Dataiku DSS offers three kinds of AutoML models and three kinds of expert models.

For MLOps, the Dataiku unified deployer manages project files’ movement between Dataiku design nodes and production nodes for batch and real-time scoring. Project bundles package everything a project needs from the design environment to run on the production environment.

Dataiku makes it easy to create project dashboards and share them with business users. The Dataiku visual flow is the canvas where teams collaborate on data projects; it also represents the DataOps and provides an easy way to access the details of individual steps. Dataiku permissions control who on the team can access, read, and change a project.

Dataiku provides critical capabilities for explainable AI, including reports on feature importance, partial dependence plots, subpopulation analysis, and individual prediction explanations. These are in addition to providing interpretable models.

DSS has a large collection of plug-ins and connectors. For example, time series prediction models come as a plug-in; so do interfaces to the AI and machine learning services of AWS and Google Cloud, such as Amazon Rekognition APIs for Computer Vision, Amazon SageMaker machine learning, Google Cloud Translation, and Google Cloud Vision. Not all plug-ins and connectors are available to all plans.

Dataiku targets data scientists, data engineers, business analysts, and AI consumers. I went through the Dataiku Data Scientist tutorial, which seems to be the closest match to my skills, and took screen shots as I went.

dataiku 03 IDG

Dataiku currently offers quick start tutorials for four personas: business analysts, data scientists, data engineers, and AI consumers.

Dataiku data preparation and visualisation

The initial state of the flows in this tutorial reflects having some of the setup, data finding, data cleaning, and joining done by someone else, presumably a data analyst or data engineer. In a team effort, that’s likely. For a solo practitioner, it’s not. Dataiku may support both use cases, but has made a considerable effort to support teams in enterprises.

dataiku 04 IDG

The Dataiku DSS Data Scientist Quick Start tutorial has two flows, one for data preparation and one for model assessment.

Clicking into a dataset’s icon in a flow brings it up in a sheet.

dataiku 05 IDG

Dataiku DSS displays tabular data in a spreadsheet-like table. Note the shading on missing values.

Showing the data is useful, but exploratory data analysis is even more useful. Here we are generating a Jupyter notebook for a single dataset, which was in turn created by joining two prepared datasets.

I have to complain a little at this point. All of the prebuilt or generated notebooks I used were written in Python 2, but that’s no longer a valid DSS environment, since Python 2 has (at long last) been deprecated by the Python Software Foundation. I had to edit many notebook cells for Python 3, which was annoying and time-consuming. Fortunately, it was fairly simple: The most frequent fix was to add parentheses around the arguments of the print function, which are required in Python 3. Dataiku should really update its notebook templates for Python 3.

dataiku 06 IDG

Dataiku DSS has a number of pre-defined templates for notebooks that can visualise datasets.

The generated notebook uses standard Python libraries such as Pandas, Matplotlib, Seaborn, and SciPy to handle data, generate plots, and compute descriptive statistics.

dataiku 07 IDG

A couple of clicks and a few seconds of computation generated this notebook that does exploratory data analysis on a single dataset. The notebook goes on to display more interesting graphics and descriptive statistics, such as box plots and Shapiro-Wilk tests.

Dataiku machine learning and model assessment

Before I could do anything with the Model Assessment flow zone, I had to add a recipe to check whether a customer’s revenue is over or under a specific barrier variable, which is defined globally. The recipe created the high_value dataset, which has an additional column for the classification. In general, recipes in a flow (other than data preparation steps that remove rows or columns) do add a column with the new computed values. Then I had to build all the flow outputs reachable from the split step.

dataiku 08 IDG

The split step looks at the data_source column and uses it to split the output into test and train datasets. The right-click context menu gives access to, among other options, “Build Flow outputs reachable from here.”

Dataiku AutoML, interpretable models, and high-performance models

This tutorial moves on to creating and running an AutoML session with interpretable models, such as Random Forest, rather than high-performance models (just a different initial selection of model choices) or deep learning models (Keras/TensorFlow, using Python code). As it turns out, my Booster Plan Dataiku cloud instance didn’t have a Python environment that could support deep learning, and didn’t have GPUs. Both could be added using a more expensive Orbit plan, which also adds distributed Spark support.

I was restricted to in-memory training with Scikit-learn and custom models on two CPUs, which was fine for exploratory purposes. Most of the feature engineering options in the DSS AutoML model were turned off for the purposes of the tutorial. That was fine for learning purposes, but I would have used them for a real data science project.

dataiku 09 IDG

This session of AutoML using interpretable models, including custom models, showed that Random Forest gave the highest area under the ROC (receiver operating characteristic) curve. The price of the first item purchased and the customer’s age were the most import variables contributing to the prediction of high-value customers.

Dataiku deployment and MLOps

After finding a winning model in the AutoML session, I deployed it and explored some of the MLOps features of DSS, using Scenarios. The scenario supplied with the flow for this tutorial uses a Python script to rebuild the model, and replace the deployed model if the new model has a higher ROC AUC value. The exercise to test this capability uses an external variable to change the definition of a high-value customer, which isn’t all that interesting, but does make the point about MLOps automation.

Overall, Dataiku DSS is a very good, end-to-end platform for data analysis, data engineering, data science, MLOps, and AI browsing. Its self-service cloud pricing is reasonable, but not cheap; the basis for enterprise pricing is reasonable, although I have no concrete information about its actual enterprise pricing.

Dataiku tries hard to support non-programmers in DSS with a graphical UI and visual machine learning. The visual aspects of the product do generate notebooks with code a programmer can customise, which saves a lot of time.

I’m not totally convinced, however, that non-programming “citizen data scientists” can perform data engineering and data science effectively, even with all of the tools and training that Dataiku supplies. Data science teams need at least one member who can program and at least one member with an intuition for feature engineering and model building, not necessarily the same person. In the worst case, you might have to rely on Dataiku’s consultants for guidance.

It’s certainly worth doing a free evaluation of Dataiku DSS. You can use either the downloaded Community Edition (free forever, three users, files or open source databases) or the 14-day hosted cloud trial (five users, two CPUs, 16 GB RAM, 100 GB plus BYO cloud storage).