When developers deploy a new release of an application or microservice to production, how does IT operations know whether it performs outside of defined service levels? Can they proactively recognise that there are issues and address them before they turn into business-impacting incidents?
And when incidents impact performance, stability, and reliability, can they quickly determine the root cause and resolve issues with minimal business impact?
Taking this one step further, can IT ops automate some of the tasks used to respond to these conditions rather than having someone in IT support perform the remediation steps?
And what about the data management and analytics services that run on public and private clouds? How does IT ops receive alerts, review incident details, and resolve issues from data integrations, dataops, data lakes, etc., as well as the machine learning models and data visualisations that data scientists deploy?
These are key questions for IT leaders deploying more applications and analytics as part of digital transformations. Furthermore, as devops teams enable more frequent deployments using CI/CD and infrastructure as code (IaC) automations, the likelihood that changes will cause disruptions increases.
What should developers, data scientists, data engineers, and IT operations do to improve reliability? Should they monitor applications or increase their observability? Are monitoring and observability two competing implementations, or can they be deployed together to improve reliability and shorten the mean time to resolve (MTTR) incidents?
I asked several technology partners who help IT develop applications and support them in production for their perspectives on monitoring, observability, AIops, and automation. Their responses suggest five practice areas to focus on to improve operational reliability.
Develop one source of operational truth between developers and operations
Over the last decade, IT has been trying to close the gap between developers and operations in terms of mindsets, objectives, responsibilities, and tooling. Devops culture and process changes are at the heart of this transformation, and many organisations begin this journey by implementing CI/CD pipelines and IaC.
Agreement on which methodologies, data, reports, and tools to use is a key step toward aligning application development and operations teams in support of application performance and reliability.
Mohan Kompella, vice president of product marketing at BigPanda, agrees, noting the importance of developing a single operational source of truth. “Agile developers and devops teams use their own siloed and specialised observability tools for deep-dive diagnostics and forensics to optimise app performance,” he says. “But in the process, they can lose visibility into other areas of the infrastructure, leading to finger-pointing and trial-and-error approaches to incident investigation.”
The solution? “It becomes necessary to augment the developers’ application-centric visibility with additional 360-degree visibility into the network, storage, virtualisation, and other layers,” Kompella says. “This eliminates friction and lets developers resolve incidents and outages faster.”
Understand how application issues impact customers and business operations
Before diving into an overall approach to application and system reliability, it’s important to have customer needs and business operations at the front of the discussion.
Jared Blitzstein, director of engineering at Boomi, a Dell Technologies business, stresses that customer and business context are central to developing a strategy. “We have centered observability around our customers and their ability to gather insights and actions into the operation of their business,” he says. “The difference is we use monitoring to understand how our systems are behaving at a point in time, but leverage the concept of observability to understand the context and overall impact those items (and others) have on our customer’s business.”
Read more on the next page...