ARN

How to build an internal developer platform, from those who have done it

Twitter, Two Sigma, Yelp, and Zalando explain why they built their own software development platforms and share what lessons they learned along the way

For organisations that want to build and deploy software faster, internal developer platforms (IDPs) have emerged as a key component of their software engineering culture.

Every IDP is different, but what they have in common is a goal: to abstract away cumbersome infrastructure decisions for software developers, easing the operations burden on overstretched devops teams.

That doesn’t mean every organization should build its own internal developer platform, but for those that find themselves drowning in complexity, constantly wrestling with legacy systems, or unable to scale their engineering team to meet the demands of the business, an IDP could be the answer.

“You have to start at the grassroots level,” said Kaspar von Grünberg, CEO of Humanitec, a startup aimed at helping organisations build IDPs. “We usually see organisations take a small group of their best engineers and ask them to be the glue across segregated toolchains. Then you start to centralise this around a common API that teams can work against and bring structure to that sea of unstructured tools.”

The cultural shift required to move to an IDP—complete with its own internal platform team—should not be underestimated. Transparency, regular communication, and adopting a product-first mindset are all required to ensure the platform achieves its intended goals. Even engineering powerhouses like Netflix will tell you how tough it can be.

“There were moments where application developers felt the platform team was not focused appropriately on their needs, and other times when platform teams felt overtaxed by user demands,” wrote Frank San Miguel, a senior software engineer at Netflix, in a blog post. “We got through these tough spots by being open and honest with each other.”

InfoWorld talked to four companies that have built their own internal developer platforms to hear why they did it, where best to start, what they learned along the way, and what can be achieved if you pull it off.

Zalando: Fast growth and too many systems created the pain that led to an IDP

German e-commerce giant Zalando has thousands of developers spread across the world, all of whom use some form of internal platform to deploy their code. But that wasn’t always the case.

Back in 2014, the company was growing at an extraordinary pace, adding as many as 70 engineers a week to meet growing demand. This growth quickly led to internal bottlenecks, with an IT operations team starting to drown in requests. Simply hiring more people wasn’t going to solve the problem long-term.

“If you need to release faster, you play this game of unblocking impediments and removing bottlenecks and create a strategy to solve this root cause,” said Jan Loeffler, CTO at Plex and former head of platform at Zalando. “It starts with trying things and shortening your lead time to ship software and quickly getting feedback.”

At that time, the tech stack at Zalando was predominantly Java and Python running on all sorts of infrastructure, with no central platform for compiling, building, and testing apps and services. Each team had its own way of doing CI/CD, with limited control or audit capabilities across the whole organisation.

The first approach to solving this was a big bet on the public cloud, Docker containers, and a central CI/CD pipeline. Over years of iteration this eventually coalesced into what we now understand as an IDP.

“Cultural changes were required in how Zalando developed software and how the company can grow from a fast follower to being the market leader,” Loeffler said. “There was a lot of change required in how we hire and onboard people and foster a culture of innovation, and that requires a platform that enables scale and innovation.”

Fortunately, the pain of the existing way of doing things was enough motivation for the business to buy into the idea of an IDP.

So the company identified key engineers to start a platform team to collect requirements. “Don’t have a separate team working alone in a dark corner,” Loeffler said. “They need to be involved early on and meeting the developer teams if they want to gain that credibility and trust.”

The results have been impressive. When Loeffler left the company in 2016, there was a team of about 70 managing the central platform, which was powering 170 production releases a day across thousands of internal developers.

Two Sigma: A sprawl of approaches required a product mentality to create an IDP

New York-based hedge fund and financial services firm Two Sigma has $58 billion in assets under management and is best known for its use of technology in driving trading strategies.

Five years ago, the firm found itself struggling to harness the complexity that comes with having hundreds of developers working on everything from legacy homegrown software running on-premises to complex machine learning projects built on Google Cloud or AWS, and everything in between.

“It tends to become obvious when you need to build your own platform,” said Camille Fournier, head of platform engineering at Two Sigma. “If you are using something like Heroku, you will hit scaling limits and see teams peel off and do their own thing. If a team is supposed to support this platform and you see them leave the paved paths of your current offering, you know you have an opportunity that you need to solve for.”

At Two Sigma today, that platform comprises a Git environment for building, testing, and reviewing code and an internal execution environment for packaging that code in a container, with all of the underlying operational, monitoring, and compliance considerations abstracted away for the developers.

“The most important thing is to approach this from a product perspective,” Fournier said. “Engineers don’t always think about their tools as products and how they work together. That is where internal platform teams tend to really stumble.”

Once that internal team is up and running, the next task is finding developer’s key pain points and identifying the right carrots to dangle in front of them to gain widespread adoption, such as easier operability and reduced toil in getting code deployed, all with enough training and support to bring them along on the journey.

Then there is the problem of technical debt. “A lot of the challenges are around legacy systems that will not easily be mappable to an internal platform,” Fournier noted. “You will have to work with teams to understand how we get them onto this platform without forcing every line of code at your company to be rewritten.”

Twitter: Expecting to double developer productivity by using an IDP

The social network Twitter started to centralise its build team as far back as 2011, before forming its internal Engineering Effectiveness team in 2014 to improve developer productivity and happiness.

Today, “we start by looking for velocity,” said Nick Tornow, platform lead at Twitter. “We define that as the number of features an engineer can deliver in a unit of time, and we want to double that by the end of 2023.”

Achieving that ambitious goal at scale will be a challenge, even for an organization with as much engineering muscle as Twitter has. As with most companies working with IDPs, the key is to break the problem down into manageable chunks.

“You look for commonalities and common concerns engineers have to deal with,” Tornow said. Like many platform-oriented organisations, Twitter thinks of its IDP as providing a set of paved paths for developers to follow. If those paths have already been built by a piece of open source software, like Bezel for testing or Kafka for streaming data pipelines, then all the better. “Only go your own way when there isn’t an alternative,” he said.

Overall, Tornow and his team want to abstract away fundamental concerns like security, reliability, and compliance for developers to focus solely on their code.

“Platform is charged with making those fundamentals free,” Tornow said. “We want developers to be able to write code quickly and then automate the steps for testing, canary deploys, monitoring, all of that. Even though we have thousands of microservices here, it is almost impossible to not be confident in that deploy process."

That doesn’t mean tension doesn’t arise between developers and the platform team from time to time. “The art of the whole thing is you are talking about people with complicated objectives,” Tornow said.

Listening to each other before clearly and transparently explaining each other’s needs can help ease some of that tension and find common ground. “If people understand why those decisions are being made, you build empathy,” he said.

Tornow’s parting piece of advice is to build around what you already have, instead of trying to reinvent the wheel with a big shiny new platform. “It is easier to think about incrementally expanding your platform and starting with the tools you have now,” he said. “Carve out some people and build around that—that’s where you start.”

Yelp: The evolution of an IDP

Popular reviews site Yelp’s internal developer platform is so well established that it even comes with its own delicious name: PaaSTA.

Initially developed in 2014, Yelp’s IDP came about as a way for engineers to move away from largely manual deployment processes performed by a dedicated operations team.

“It was obvious that we needed [an IDP] because non-infrastructure developers were spending too much time on infrastructure, we weren’t moving as fast as we wanted to, and that tech debt was getting out of hand, with everything tying back to a slow release process,” said George Bashi, vice president of engineering for infrastructure at Yelp.

As the name would suggest, PaaSTA is Yelp’s own take on a platform as a service. “It allows developers to declare, in config files, exactly how they want the code in their Git repo to be built, deployed, routed, and monitored,” wrote Kyle Anderson, a former site reliability engineer at Yelp who now works at Netflix, in a November 2015 blog post.

The resulting platform was a mix of Docker for code delivery and containment, Apache Mesos for code execution and scheduling, Mesosphere Marathon for managing long-running services, Chronos for batch jobs, SmartStack for service registration and discovery, Sensu for monitoring and alerting, and Jenkins (optionally) for continuous deployment.

Since then, the platform has “evolved a lot, in that we have replaced every single component,” Bashi said. “Mesos is now Kubernetes, Spark is now Flink, SmartStack is now Envoy. That is one of the reasons we build this stuff, as it lets the infrastructure team replace the wings of the plane while we are flying and the feature developers can just build stuff.”

Yelp wants there to be a certain level of trust between the platform team and developers, but if a team wants to go off on its own then it has the autonomy to do so. “If that happens, we have to ask how we have lost their trust and invest in fixing that issue,” Bashi said.

A lot of that comes down to “basic product management,” he added. “Be in touch with your users and don’t build an ivory tower.”