ARN

What site reliability engineers want application developers to know

5 best practices to keep everyone on the same page and avoid problems

It’s important for everyone working in IT to accept critical feedback and advice on improving processes, quality, and collaboration.

For agile development teams, that feedback often comes from product owners, business relationship managers, stakeholders, customers, and end-users of the applications in development and being supported. If an application is hard to use, performs slowly, or doesn’t address the workflow needs, agile teams must receive this critical feedback and adjust backlog priorities.

Equally critical is to receive feedback from the operational teams supporting applications in development, test, and production environments. SREs (site reliability engineers) are the people most responsible for the reliability and performance of production applications and are a critically important source of best practices and feedback to development teams.

In the spirit of living in your colleagues’ shoes, developers should consider the responsibilities, tools, and activities of SREs. Here is some of their advice on how developers can improve applications, development processes, and tools that impact performance.

Collaborate with SREs as one devops team

Technology organization leaders assign SREs to work with one or a handful of agile development teams. In many cases, the number of developers and development teams is significantly higher than the number of SREs. It’s common for SREs to split their time across multiple domains and teams, and they must learn the business and technical specifics of many applications.

Regardless of the organization and team structure, developers must consider SREs as part of the team with aligned objectives.

I spoke with Jason Walker, field CTO of BigPanda, about the required alignment since SREs spend most of their time addressing production incidents and investigating performance issues, while developers are likely to be working on the next feature. Walker suggests, “It’s not enough to form an SRE team and assume they will chase down all the issues alone. Developers have to modify and modernise their processes, toolsets, and mindset at the same time.”

In practice, this means developers should address nonfunctional issues and take feedback from SREs on what types of problems to address. I recommend development teams dedicate 30 per cent of a release’s velocity to technical debt, performance issues, security gaps, and reliability improvements.

Most important, developers, test engineers, and SREs must collaborate as a responsible devops team by balancing the pressures to release more features faster with the work necessary to ensure reliability, performance, and security.

Understand the infrastructure, environment, and components

If developers and SREs are partners, they each have to understand the other’s roles and environments better. For developers, this means understanding the infrastructure, environments, cloud services, and application components that their application or service has dependencies on or is running in.

I spoke with Will Cappelli, CTO of Europe, Middle East, and Africa and VP of product strategy at Moogsoft about this challenge. “Development needs to become more ‘mindful.’ This is not about a return to rigid, top-down development processes.

Instead, it means that development must continuously anticipate, observe, and respond to the behaviour of components that it releases into the production environment. This, in turn, means the aggressive application of AI to the metrics, logs, and traces being generated by those components.”

Cappelli is suggesting that even though many development teams are developing microservices, automating their testing, deploying with CI/CD (continuous integration/continuous deployment), and configuring runtime environment with infrastructure as code, developers still must understand the environment and anticipate the different types of problems.

Ensure code, log messages, and exceptions are understandable

Developers should also take steps to help SREs learn the applications, services, and development environments.

When a major incident occurs in the production environment, SREs must review all the monitoring alerts, log messages, and exceptions leading up to and during the incident. Their goal is to restore service quickly to minimise the impact on the business and end-users and also perform a root cause analysis.

When developers don’t provide easy to understand log messages, exceptions, or code comments, the task becomes more difficult.

Walker of BigPanda agrees and suggests that developers should address the question, “What should monitoring this app require when I have to hand it to someone else? Otherwise, they can forward the error logs to their SRE, but what does it even mean?”

Label reliability, performance, and security impacting stories

Let’s take this one step further and also consider how best to engage SREs during the development process. If the ratio of developers to SREs is high, the implication is that the number of agile user stories planned or active in the sprint is even higher. It’s unrealistic to expect SREs to read through every requirement and evaluate their operational risks.

Development teams and application architects can help by defining, labelling, and increasing their estimates of higher-risk user stories and defects. I’ve implemented processes that include the following steps:

  • Architects should define criteria that help development teams understand what types of implementations to flag for reliability, performance, and security considerations
  • Product owners and agile technical leads should label stories that meet these risk criteria. Labelling issues and cards can be done easily in agile tools such as Jira Software and Azure DevOps. This makes it simpler for SREs, architects, and infosec to identify which ones to review
  • Development teams should adjust their agile estimates to reflect the nonfunctional acceptance criteria based on the risks identified
  • Developers should implement sufficient exception handling, testing, and monitoring appropriate to the implementation and risk type
  • Scrum masters should ask SREs, architects, or infosec to participate in the relevant sprint reviews so that they can evaluate the risk remediations implemented

These steps reflect a balance between achieving business goals, ensuring the reliability of applications, and acknowledging the staffing limitations of many IT organisations.

Shift-left testing and investing in application monitoring

Acknowledging development risks and implementing remediations at the story level is one tactic in reducing operational risk. This should be part of an overall philosophy of shift-left testing where most of the testing is automated, and agile teams, including developers and test automation engineers, implement an appropriate level of continuous testing in the CI/CD pipeline.

This level of testing is complicated by the pandemic and the shift to remote work. In a recent study by Kobitonon Covid-19’s impact on mobile QA, 55 per cent of respondents suggest investing in remote-working culture, and 50 per cent recommend that IT organisations should evaluate tools that enable remote testing teams.

Remote working also impacts agile development, and distributed teams adopting devops cultures and practices must also adapt collaboration practices.

While shift-left testing and implementing security practices during agile development are best practices, implementing application monitors and deploying AIops solutions such as BigPanda or Moogsoft also require development team support. These systems bridge the world of knowns that development teams can test with the world of unknowns impacting production environments.

Development teams should consider feedback from SREs and others working in IT operations. Fewer operational issues mean that everyone can focus more on delivering capabilities, satisfying end-users, and researching new technologies.