The bulk update behind Google Cloud’s global service disruptions

The trigger of the incident was a bulk update of group memberships that expanded to an unexpectedly high number of modified permissions

Comments

In the late afternoon on 26 March (US Pacific time), Google warned users that several of its cloud services were experiencing disruptions due to issues with Google Cloud infrastructure components, while its Cloud Composer environment creations had been failing globally.

At its peak, Google Cloud said the issues had impacted no fewer than 20 of its cloud services in multiple regions, including Dataflow, BigQuery, DialogFlow, Kubernetes Engine, Cloud Firestore, App Engine, Cloud Functions, Cloud Monitoring, Cloud Console and more.

It wasn’t until the following day that Google Cloud said it had resolved the issue with “Google Cloud infrastructure components” underpinning the disruptions, which would have impacted users across the globe for several hours.

Now, after conducting an internal investigation and taking steps to improve the resiliency of our service, the cloud vendor has revealed the core issue that caused the global disruptions.

“Many cloud services depend on a distributed Access Control List (ACL) in Identity and Access Management (IAM) for validating permissions, activating new APIs [application programming interfaces], or creating new cloud resources,” Google Cloud said in a post dated 1 April. “These permissions are stored in a distributed database and are heavily cached.

“Two processes keep the database up-to-date; one real-time, and one batch. However, if the real-time pipeline falls too far behind, stale data is served which may cause impact operations in downstream services.

“The trigger of the incident was a bulk update of group memberships that expanded to an unexpectedly high number of modified permissions, which generated a large backlog of queued mutations to be applied in real-time.

“The processing of the backlog was degraded by a latent issue with the cache servers, which led to them running out of memory; this in turn resulted in requests to IAM timing out. The problem was temporarily exacerbated in various regions by emergency rollouts performed to mitigate the high memory usage,” it said.

According to Google Cloud, on Thursday 26 March at 4:14PM (US Pacific time), cloud IAM experienced elevated error rates which caused stale data and disruption across many services for a duration of three-and-a-half hours, resulting in continued disruption in administrative operations for a subset of services for 14 hours.

Additionally, multiple services experienced bursts of cloud IAM errors, Google Cloud said, adding that the spikes were largely clustered around a handful of specific times between 4:35PM and 7:40PM on 26 March.

“Error rates reached up to 100 per cent in the later two periods as mitigations propagated globally,” Google Cloud said. “As a result, many cloud services experienced concurrent outages in multiple regions, and most regions experienced some impact.”