ARN

No, you don’t have to run like Google

Just because Google, Amazon, or Facebook does it doesn’t mean you should. Here are four ‘best practices’ of the hyperscalers you have permission to ignore
  • Matt Asay (InfoWorld)
  • 12 October, 2020 17:13
Sundar Pichai (Google)

Sundar Pichai (Google)

Years ago, Google struggled with how to pitch its cloud offerings. Back in 2017 I suggested that the company should help mainstream enterprises to “run like Google,” but in a conversation with a senior Google Cloud product executive, he suggested that the company shied away from this approach.

The concern? That maybe mainstream enterprises didn’t share Google’s needs, or maybe Google would simply intimidate them.

For the mere mortals that run IT within such mainstream enterprises (read: almost everyone), fear not. It turns out there are many things that Google might do that make no sense for your own IT needs.

Just ask Colm MacCárthaigh, AWS engineer and one of the authors of the Apache HTTP Server, who asked for “examples of technical things that don’t make sense for everyone just because Amazon, Google, Microsoft, Facebook” do them. The answers—excessive uptime guarantees, site reliability engineering, microservices, and mono-repos among the highlights—are instructive.

Excessive uptime guarantees

“Five or five-plus nines availability guarantees,” says Pete Ehlke. “Outside of medicine and 911 call centres, I can’t think of anything shy of FAANG [Facebook, Amazon, Apple, Netflix, and Google] scale that actually needs five nines, and the ROI pretty much never works out.”

I remember this one well from the variety of start-ups for which I worked, as well as when I was at Adobe (whose service-level commitments tend not to be five nines, but are arguably higher than necessary). Are you going to be OK if the multi-player game goes down? Yep. What about Office 365 for a few minutes, or even hours? Yes and yes.

Site reliability engineering

A bit of a spin on devops (though it predates the devops movement), SRE (named in multiple replies to MacCárthaigh) came out of Google in 2003, and was designed to infuse engineering with an operational focus. A few core principles guide SRE:

  • Embrace risk
  • Utilize service level objectives (SLOs)
  • Eliminate toil
  • Monitor distributed systems
  • Leverage automation and embrace simplicity

Or, as Ben Traynor, who developed Google’s SRE practice, describes it:

SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labour. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

SREs spend much of their time on automation, with the ultimate goal being to automate away their job. They spend considerable time on “operations/on-call duties and developing systems and software that help increase site reliability and performance,” says Silvia Pressard.

This sounds important, and even more so if you equate “site reliability” with “business availability.” But do most companies really need their developers to become operational experts? SRE might be critical at Google or Amazon, but it’s arguably a heavy lift for most enterprises, tasking developers with too much of an operational load for them to manage it successfully.

Microservices architecture

As commentator “Buzzy” tells it, “Definitely microservices. The number of 20-staff-in-total companies I’ve had to talk down from that ledge….” Nor is he the only one to call out microservices as a needless complication for most enterprises. Many of the replies to MacCárthaigh’s tweet mentioned microservices.

As Martin Fowler has argued, “While [microservices] is a useful architecture—many, indeed most, situations would do better with a monolith.” Wait, what? Aren’t monoliths an evil relic of the past? Of course it’s not that simple. As I’ve written,

The great promise of microservices is freedom. Freedom to break up an application into distinct services that are independently deployable. Freedom to build these disparate services with different teams using their preferred programming language, tooling, database, etc. In short, freedom for development teams to get stuff done with minimal bureaucracy.

But for many applications, that freedom comes at unnecessary costs, as Fowler highlights:

  • Distribution: Distributed systems are harder to program, since remote calls are slow and are always at risk of failure
  • Eventual consistency: Maintaining strong consistency is extremely difficult for a distributed system, which means everyone has to manage eventual consistency
  • Operational complexity: You need a mature operations team to manage lots of services, which are being redeployed regularly

This last point is underlined by Sam Newman: “For a small team, a microservice architecture can be hard to justify, as there is work required just to handle the deployment and management of the microservices themselves.”

It’s not to say that a microservices approach is always wrong. No, it’s simply a suggestion that we shouldn’t default to a more complicated (but scalable) approach simply because the hyperscalers use it (generally because scale is so critical).

A mono-repo to rule them all

Whether microservices or monolithic in nature, you probably shouldn’t store your code in a “mono-repo.” This was a common response to MacCárthaigh’s request. Mono-repos store all of a company’s code in a single version control system (VCS), to seize on the (supposed) benefits of reducing duplication of code and increasing collaboration between teams.

That’s the theory.

The practice, however, is very different. “It quickly becomes unreasonable for a single developer to have the entire repository on their machine, or to search through it using tools like grep,” says Matt Klein, an engineer who has built some of the most sophisticated systems at Amazon, Twitter, and now Lyft.

“Given that a developer will only access small portions of the codebase at a time, is there any real difference between checking out a portion of the tree via a VCS or checking out multiple repositories? There is no difference.”

Klein continues:

In terms of collaboration and code sharing, at scale, developers are exposed to subsections of code through higher layer tooling. Whether the code is in a mono-repo or poly-repo is irrelevant; the problem being solved is the same, and the efficacy of collaboration and code sharing has everything to do with engineering culture and nothing to do with code storage.

You be you

Of course, some companies may benefit from mono-repos or five-nines availability or microservices or SRE. They might also benefit from rolling their own framework, building their own infrastructure, or any of the other things that commentators on MacCárthaigh deride.

The point is that just because Google, Facebook, Amazon, or another hyperscaler does it, doesn’t mean you should. When in doubt, doubt. Start with the individual needs of your company and figure out the right approach to building and managing software according to who you are, not who you wish you were.