3 ways to apply agile to data science and dataops

Take an agile approach to dashboards, machine learning models, cleansing data sources, and data governance

Comments

Just about every organisation is trying to become more data-driven, hoping to leverage data visualisations, analytics, and machine learning for competitive advantages.

Providing actionable insights through analytics requires a strong dataops program for integrating data and a proactive data governance program to address data quality, privacy, policies, and security.

Delivering dataops, analytics, and governance is a significant scope that requires aligning stakeholders on priorities, implementing multiple technologies, and gathering people with diverse backgrounds and skills. Agile methodologies can form the working process to help multidisciplinary teams prioritise, plan, and successfully deliver incremental business value.

Agile methodologies can also help data and analytics teams capture and process feedback from customers, stakeholders, and end-users. Feedback should drive data visualisation improvements, machine learning model re-calibrations, data quality increases, and data governance compliance.

Defining an agile process for data science and dataops

Applying agile methodologies to the analytics and machine learning lifecycle is a significant opportunity, but it requires redefining some terms and concepts. For example:

Instead of an agile product owner, an agile data science team may be led by an analytics owner who is responsible for driving business outcomes from the insights delivered
Data science teams sometimes complete new user stories with improvements to dashboards and other tools, but more broadly, they deliver actionable insights, improved data quality, dataops automation, enhanced data governance, and other deliverables. The analytics owner and team should capture the underlying requirements for all these deliverables in the backlog
Agile data science teams should be multidisciplinary and may include dataops engineers, data modelers, database developers, data governance specialists, data scientists, citizen data scientists, data stewards, statisticians, and machine learning experts. The team makeup depends on the scope of work and the complexity of data and analytics required

An agile data science team is likely to have several types of work. Here are three primary ones that should fill backlogs and sprint commitments.

1. Developing and upgrading analytics, dashboards, and data visualisations

Data science teams should conceive dashboards to help end-users answer questions. For example, a sales dashboard may answer the question, “What sales territories have seen the most sales activity by rep during the last 90 days?”

A dashboard for agile software development teams may answer, “Over the last three releases, how productive has the team been delivering features, addressing technical debt, and resolving production defects?”

Agile user stories should address three questions: Who are the end-users? What problem do they want addressed? Why is the problem important? Questions are the basis for writing agile user stories that deliver analytics, dashboards, or data visualisations. Questions address who intends to use the dashboard and what answers they need.

It then helps when stakeholders and end-users provide a hypothesis to an answer and how they intend to make the results actionable. How insights become actionable and their business impacts help answer the third question (why is the problem important) that agile user stories should address.

The first version of a Tableau or Power BI dashboard should be a “minimal viable dashboard” that’s good enough to share with end-users to get feedback. Users should let the data science team know how well the dashboard addresses their questions and how to improve. The analytics product owner should put these enhancements on the backlog and consider prioritising them in future sprints.

2. Developing and upgrading machine learning models

The process of developing analytical and machine learning models includes segmenting and tagging data, feature extraction, and running data sets through multiple algorithms and configurations.

Agile data science teams might record agile user stories for prepping data for use in model development and then creating separate stories for each experiment. The transparency helps teams review the results from experiments, decide on the next priorities, and discuss whether approaches are converging on beneficial results.

There are likely separate user stories to move models from the lab into production environments. These stories are devops for data science and machine learning, and likely include scripting infrastructure, automating model deployments, and monitoring the production processes.

Once models are in production, the data science team has responsibilities to maintain them. As new data comes in, models may drift off target and require recalibration or re-engineering with updated data sets. Advanced machine learning teams from companies like Twitter and Facebook implement continuous training and recalibrate models with new training set data.

3. Discovering, integrating, and cleansing data sources

Agile data science teams should always seek out new data sources to integrate and enrich their strategic data warehouses and data lakes. One important example is data siloed in SaaS tools used by marketing departments for reaching prospects or communicating with customers.

Other data sources might provide additional perspectives around supply chains, customer demographics, or environmental contexts that impact purchasing decisions.

Analyst owners should fill agile backlogs with story cards to research new data sources, validate sample data sets, and integrate prioritised ones into the primary data repositories. When agile teams integrate new data sources, the teams should consider automating the data integration, implementing data validation and quality rules, and linking data with master data sources.

Julien Sauvage, vice president of product marketing at Talend, proposes the following guidelines for building trust in data sources.

“Today, companies need to gain more confidence in the data used in their reports and dashboards. It’s achievable with a built-in trust score based on data quality, data popularity, compliance, and user-defined ratings. A trust score enables the data practitioner to see the effects of data cleaning tasks in real time, which enables fixing data quality issues iteratively.”

The data science team should also capture and prioritise data debt. Historically, data sources lacked owners, stewards, and data governance implementations.

Without the proper controls, many data entry forms and tools did not have sufficient data validation, and integrated data sources did not have cleansing rules or exception handling. Many organisations have a mountain of dirty data sitting in data warehouses and lakes used in analytics and data visualisations.

Just like there isn’t a quick fix to address technical debt, agile data science groups should prioritise and address data debt iteratively. As the analytics owner adds user stories for delivering analytics, the team should review and ask what underlying data debt must be itemised on the backlog and prioritised.

Implementing data governance with agile methodologies

The examples I shared all help data science teams improve data quality and deliver tools for leveraging analytics in decision making, products, and services.

In a proactive data governance program, issues around data policy, privacy, and security get prioritised and addressed in parallel to the work to deliver and improve data visualisations, analytics, machine learning, and dataops. Sometimes data governance work falls under the scope of data science teams, but more often, a separate group or function is responsible for data governance.

Organisations have growing competitive needs around analytics and data governance regulations, compliance, and evolving best practices. Applying agile methodologies provides organisations with a well-established structure, process, and tools to prioritise, plan, and deliver data-driven impacts.