What is Chaos Monkey? Chaos engineering explained

By wreaking random havoc on your systems in production, Chaos Monkey teaches you how to make those systems stronger

Comments

Pioneered out of the halls of Netflix during its shift from distributing DVDs to building distributed cloud systems for streaming video, Chaos Monkey introduced an engineering principle that has been embraced by software development organisations of all shapes and sizes: namely, that by intentionally breaking systems you can learn to make them more resilient.

According to the original Netflix blog post on the topic, published in July 2011 by Yury Izrailevsky, then director of cloud and systems infrastructure, and Ariel Tseitlin, director of cloud solutions at the streaming company, Chaos Monkey was designed to randomly disable production instances on its Amazon Web Services (AWS) infrastructure, thus exposing weaknesses that Netflix engineers could eliminate by building better automatic recovery mechanisms.

The catchy name came from “the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption,” the blog post states.

In practice this would involve a simple application “picking an instance at random from each cluster, and at some point during business hours, turn it off without warning. It would do this every workday,” as detailed by ex-Netflix engineers Nora Jones and Casey Rosenthal in their comprehensive book on the topic, Chaos Engineering, published by O’Reilly Media.

The idea is that by learning where your weakest spots are, engineers can set automated triggers to combat an issue, saving them a call in the middle of the night if something were to go wrong. Chaos Monkey has since evolved into a whole range of chaos principles, under the banner of chaos engineering.

Chaos Monkey at Netflix

Chaos Monkey grew out of engineering efforts at Netflix around 2010, when Greg Orzell — now leading chaos engineering at Microsoft-owned GitHub — was tasked with building resiliency into the company’s new cloud-based architecture.

“The way I think about Chaos Monkey isn’t a major feat of engineering,” Orzell told InfoWorld. “The value it brings is a change in mindset that was critical at the time as we went from shipping DVDs to streaming via the internet.”

In the early days, Netflix engineers introduced a whole range of outages and issues into systems using a “Simian Army” of open source tools, each accounting for certain types of failures, starting with Chaos Monkey taking out AWS clusters.

The original army (now mostly retired in favour of new tools) included the likes of Latency Monkey, which would induce artificial delays to the RESTful client-server communication layer, and Doctor Monkey, which would tap into the health checks that run on each instance, as well as monitors for other external signs of health (e.g. CPU load) to detect unhealthy instances and remove them from service if required.

Chaos Kong took Chaos Monkey to the next level by simulating an outage to an entire AWS availability zone. “It is very rare that an AWS Region becomes unavailable, but it does happen,” a Netflix blog post from 2015 outlines.

“By running experiments on a regular basis that simulate a Regional outage, we were able to identify any systemic weaknesses early and fix them,” the post continues. “When US-EAST-1 actually became unavailable, our system was already strong enough to handle a traffic failover.”

As Jones and Rosenthal outline in their book, letting Chaos Kong loose on the infrastructure was “a white-knuckle affair with a ‘war room’ assembled to monitor all aspects of the streaming service, and it lasted hours.”

Two years later, in July 2017, Netflix introduced ChAP, the Chaos Automation Platform, which “interrogates the deployment pipeline for a user-specified service. It then launches experiment and control clusters of that service, and routes a small amount of traffic to each,” the blog post states.

Chaos engineering principles

Basic Chaos Monkey practices have quickly evolved, with bigger and bigger deployments through Chaos Kong, to what was later formalised as chaos engineering. Netflix didn’t build its own formal chaos engineering team until 2015. That team was headed up by Bruce Wong, now director of engineering at Stitch Fix.

The principles of chaos engineering have been formally collated by some of the original authors of Chaos Monkey, defining the practice as: “The discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production.”

In practice this takes the form of a four-step process:

Defining the “steady state” of a system to set a baseline for normal behaviour
Hypothesise that this steady state will continue in both the control group and the experimental group
Introduce variables that reflect real world events like servers that crash, hard drives that malfunction, or network connections that are severed
Try to disprove the hypothesis by looking for a difference between the control group and the experimental group

If the steady state is difficult to disrupt, you have a robust system; if there is a weakness then you have something to go and fix.

“In the five years since ‘The Principles’ was published, we have seen chaos engineering evolve to meet new challenges in new industries,” Jones and Rosenthal observe. “The principles and foundation of the practice are sure to continue to evolve as adoption expands through the software industry and into new verticals.”

Chaos engineering with Chaos Monkey

To run the open source version of Chaos Monkey your systems will have to meet a certain set of prerequisites, as outlined on GitHub.

Chaos Monkey does not run as a service, so you will have to set up a cron job as outlined on the GitHub page, which then calls Chaos Monkey once a weekday to create a schedule of terminations.

To use this version of Chaos Monkey, you must be using Netflix’s own, open source, continuous delivery platform, Spinnaker, which can limit the ability of certain organisations to adopt the method. Chaos Monkey also requires a MySQL-compatible database, version 5.6 or later.

Service owners set their Chaos Monkey configs through Spinnaker. Chaos Monkey works through Spinnaker to get information about how services are deployed and terminates instances — virtual machines or containers — at random on a frequency and schedule you specify.

Of course, implementing Chaos Monkey is only the beginning of the difficult and complex task of resolving system resiliency issues. Chaos Monkey merely uncovers the weaknesses in the system; it’s then up to devops or systems engineering teams to identify their causes and come up with solutions.

“The tooling itself is not expensive, but the investment you have to make to react to the tooling is,” as Orzell puts it. Committing to chaos engineering also requires shifting resources from building new features to beefing up resilience. “Every business is at a different point on that spectrum and they each have to decide how much to dial up or down in that space,” he adds.

Jones and Rosenthal say that in the early days, Netflix engineers “received a lot of pushback from financial institutions in particular.”

Despite the stakes being higher for banks, they still suffered outages, so by carefully implementing a “proactive strategy like chaos engineering to understand risks in order to prevent large, uncontrolled outcomes,” many of those organisations changed their mindset, with Capital One an early adopter, as detailed in the book.

Chaos engineering resources

Again, the latest and definitive book on the topic is Chaos Engineering by ex-Netflix engineers Nora Jones and Casey Rosenthal, published in April 2020, which builds on a lot of the work those authors, and others, compiled in the 2017 book Chaos Engineering. For a more practical overview, see Russ Miles’s Learning Chaos Engineering.

Netflix provides a wealth of resources on the topic on GitHub, including a tutorial, lots of documentation, an error counter, outage checker, and decryptor tools.

Gremlin — a provider of commercial tools for running chaos engineering experiments — offers its own comprehensive set of resources, which are available for free online and in PDF format. The company also backs various community efforts including Chaos Conf and a Slack channel.

O’Reilly also has a wealth of resources, including this handy playlist of books and videos on the topic.