Platform engineering at N26: how we planned and launched it

This article is part of a two-piece series on platform engineering. The first piece explains platform engineering and its capabilities, roles, and benefits for DevOps maturity.

At N26, we can spin up a new service in up to 1 hour in the production environment. You read it right. A new service with all its infrastructural dependencies provisioned and fully operational, with a dedicated database cluster and with its API accessible for authenticated requests.

In 2021, when we started to assemble the team to work in our Brazilian operation, we knew that we would need to build a robust culture and environment to match the business’ ambitions. In a crowded neobank market, being able to release new products and services quickly is vital to acquiring and retaining customers.

Our current startup funding winter shifted the attention to survivability and, thus, to profitability. Delivering fast isn’t enough anymore. Startups must be able to anticipate customers’ needs instead of using their nimbleness to experiment randomly just to discover failure retrospectively. Anticipation is faster and cheaper than experimentation.

That’s why platform engineering is in all the rage now. Platforms are essential to raising the productivity of product teams, letting them focus on what matters, which is how to satisfy better customers’ needs and deliver business outcomes. Robust platforms take the pain away from dealing with infrastructure by automating repetitive tasks and by dealing with cross-cutting concerns like security, compliance, and observability.

Two years later, our platform engineering strategy is paying off. We’re a high-mature DevOps organization, performing better than 97% of the 2022 State of DevOps Report respondents in the financial services industry. Our journey may provide you with interesting insights on how to bootstrap your engineering platform. In advance, you’ll need to flex your product management muscles. After that, we’ll see how N26’s platform works.

Our Software Engineers deployed to production at least once every two days (deployment frequency). Each deployment took 9 minutes on average (lead time). Only 1.2% of the deploys led to incidents (change failure rate) that had a restoration time of 21 hours on average (time to restore) — data from July 2023.

Culture eats strategy for breakfast

Let’s time travel again. In July 2021, when we started to lay out the Engineering strategy for N26’s Brazilian operation, we were at the peak of activity to deliver the first version of our mobile app in Brazil. Unlike the European expansion, in Brazil, we built everything from scratch. This first version would be a closed beta available only for family and friends. Our goal was to test our operation and set the groundwork to comply with BACEN regulations (Brazilian Central Bank).

You may be wondering if this is a fortunate case of a greenfield project. It isn’t. When I joined N26, all the product and engineering efforts were outsourced. We even depended on our partners’ infrastructures.

This arrangement was good enough for the first release of our app, as it was a closed beta experiment. However, it wouldn’t suit our business ambitions. In a crowded neobank market, the unique value proposition requires a dedicated staff with skin in the game to learn continuously and propose solutions to actual customers’ problems.

At the time, we were a five-person Engineering team. I set time apart to run a three-day workshop to formalize our strategy and culture. Before diving into our strategy, we discussed our culture. Defining the social game was vital to us as we were on the brink of scaling up the Engineering team for the next phase of our operation. Hiring quickly and onboarding people while speeding up productivity is a challenging feat.

While discussing our principles, we have found ourselves repeating how responsibility mattered. We wanted autonomous and responsible teams at N26. For us, teams must maintain what they deliver. This approach brings teams in contact with the operational part of their applications and the customers, both feedback loops that lead to building quality into what is delivered. This approach is known as you build it, you run it.

Other principles came naturally to us. One was that all the infrastructure should be abstracted in a platform. Another important principle was to use proven management practices, including product management and techniques, to manage our Engineering department. So, platform engineering was codified organically into our culture.

After setting our culture and with the pillars of our strategy in place, it was time to start the strategic planning.

Strategic planning

To start the strategic planning, I ran an Impact Mapping session. The technique is highly collaborative and consists of answering four questions: “Why?”, “Who?”, “How?” and “What?”. It’s a simple yet effective technique. Let’s learn by following our impact map.

The most important question is, why are we doing this? Knowing the goal lets people adapt to unforeseen circumstances. An impact map always starts with the why.

We were on the path to rapid growth. The business plan was aggressive on customer growth, and we anticipated a 20 times increase in our team, from 5 to 100 people¹ until the end of 2022. It is a massive growth in just 1.5 years. More than a robust platform, we needed excellent developer experience.

Our goal was to make it virtually easy to deploy on day one.

Our answer to the “Why?” question was to make it virtually easy to deploy on day one². But before jumping into which features we needed to develop to achieve this goal, we discussed who would be impacted by this goal. Who was our customer?

Our customer was the product engineer.

Our answer to the “Who?” question was the product engineer. It may seem odd to identify an internal customer, but at this phase, I knew we should refrain from accruing organizational debt. As Steve Blank — the Lean Startup movement’s godfather — points out, organizational debt can kill a company quicker than technical debt. Our main contribution to the business as an Engineering department was to staff the teams adequately and make them productive. The teams would contribute directly to the business by solving specific customer problems.

Then, we followed up to understand which behaviors we wanted the product engineers to have to help us achieve our desired goal.

We wanted the product engineers to maintain what they built and to be productive on their first working day.

Our answer to the “How?” was twofold. First, we wanted our product engineers to be responsible for maintaining what they built. Secondly, we wanted them to be productive on their first working day. These were the impacts we desired to create. Finally, we discussed what we could do to support the expected impacts.

Deliverables or features support impacts (click or tap to enlarge).

Our answer to “What?” was a list of features. But note, these features weren’t a shopping list of what we wanted to do. These were fully contextualized features aligned with the desired impacts and the business goal. The presence of non-technological features (e.g., getting started instructions and on-call documentation) makes it more evident.

With an essential part of our strategy well understood by the attendants, it was time to understand better which kind of experience we envisioned delivering to our fellow product engineers.

Mapping the Developer Experience

The impact map we created revealed two features that made us wonder which kind of developer experience we wanted to provide to the product engineers. They reinforced the infrastructure abstraction principle (the low-level details of the infrastructure should be abstracted for the product engineers) and the homogeneity and consistency³ principle that we set in our culture document.

The features that guided the mapping of the developer experience.

So, we started discussing the vision we had for the platform. We wanted to provide an experience that would let product engineers be productive on their first working day. We then started discussing our previous experience with environment provisioning processes.

Have you ever stopped to think about how the provisioning of new services works at your company? Is it entirely automated? How much manual work does it require?

I worked before in a neobank with almost twenty million customers and 350+ services when I joined. The few who knew the required steps of a new service provisioning were proud: it seemed like tribal knowledge. The engineering platform was incredible, but this part of the developer experience lacked self-service capabilities. I remember people saying it took at least two weeks on average to deploy a new service in production due to the amount of undocumented manual work.

But if two weeks seems too much, do you imagine waiting three months to have a working environment to deploy your service? A recent study by Rafay Systems found that one in four organizations takes three months or longer to deploy an application or service from code-complete to production. 9% of organizations take six months or longer. That’s a huge waste of resources.

A quick provisioning process seemed a good idea, as we could even deploy a new temporary service to production on the onboarding classes if we wanted to. We then mapped the journey of our product and platform engineers for the provisioning of a new service. We imagined a journey to provide a provisioning experience with the least effort possible to the product engineers.

The user story map of the developer experience we envisioned creating (click or tap to enlarge).

What we found in our User Story Mapping session was a vision where our platform would solve this provisioning problem through extensive automation. We wanted a PaaS-like experience: product engineers would need to open a pull request with some lines of YAML code in a repository that would orchestrate the bootstrapping of new services in our cloud environment. We even discussed a draft of how this file would look.

After the pull request approval, the orchestration would prepare the production environment. We wanted it to take care of things like computer cluster management, database provisioning, setting up the messaging infrastructure, and so on. Also, we wanted the orchestration to bootstrap the project. It should create the repository of the project based on a service chassis and make the first deployment into production to guarantee the environment was configured correctly. Then, the product engineers would clone the repository of the newly created service and start working on it. Every push into the trunk branch would deploy the service into production.

With a broad overview and shared understanding of the developer experience, we were ready to dive deeper into what we needed to build.

Planning to build less and the Thinnest Viable Platform

Our deep dive started with the provisioning process and the service chassis as the starting point. We used the Story Workshop technique to decompose both features into three slices: good enough (what are the minimum characteristics to make the feature functional?), better (what would make it better?), and best (what would make it fabulous or ideal?). Slicing techniques help to understand the scope size and to prioritize better.

Slicing the platform helps to prioritize the Thinnest Viable Platform (click or tap to enlarge).

The workshop reminded us of how complex software engineering is. We discussed simple concerns like database migrations and logging and more advanced things like circuit breakers and security on the Software Delivery Lifecycle (SSDLC). We had some idea that part of the discussed features would be required early on, like adopting the Money pattern to prevent rounding bugs in operations like installment calculation and credit financing. They’re costly and more frequent than we think.

However, we also knew we neither had enough people nor had time to build everything we wanted. That’s the harsh truth of software development: you never have enough people, time, and money to build everything you want. So, focus and prioritization are key. And that’s the strength of the Story Workshop technique, as it helps us define the Thinnest Viable Platform (TVP).

A TVP is the simplest platform you must aim to build. In some cases, the TVP may be a documentation of components or services that serve as building blocks for software development. Nonetheless, defining a TVP is important because software engineers love to build technical things, and a platform has the potential to be a never-ending journey of half-baked features. Without focus, there’s a risk of delivering a bloated platform that ignores the developer experience.

We set a goal to prioritize which features to deliver in our TVP. We wanted to create a provisioning process that could be run within a week⁴. The initial feature set would help us build the backbone of our platform.

Our first goal was to create a TPV that supported provisioning a service within a week.

So, we prioritized the setup of the computing cluster (Kubernetes), the orchestration automation⁵ (which we named “Project Bootstrapper”), and the first version of the service chassis. But how could someone deliver something without any kind of database support? We taught the team that they could use in-memory repositories for these purposes. Shortly after the TVP release, the platform team released the database provisioning feature. From there on, the product teams could deploy fully operational services in production.

That’s an important product management lesson for anyone doing platform engineering: your releases can solve the problem partially. Plan to build less and provide alternatives for the customers and stakeholders that they may implement while you work on the definitive solution. Alignment and communication are essential to work around temporary dependencies.

Our platform evolved dramatically since then. But every step of its progress was marked by the diligent work of the platform team, who always planned to advance the platform’s capabilities iteratively. Central to this process is the understanding of the business goals, product engineers’ needs, and regulatory requirements to prioritize features that deliver value to the company.

How our platform works

Fast forward to July 2023. Our platform evolved, and the orchestration makes everybody’s lives easier. To create a new service, the product engineer clones the repository that manages the infrastructure and runs a make command:

$ git checkout -b overnight
$ make create-team spaces

The command scaffolds a directory where the product engineer’s team will edit a single configuration file to bootstrap the new service. The file⁶ allows configuring things like the Service Level Objectives (SLOs) and the database cluster size:

locals {
    projects = {
        overnight = {
            description = "Calculate the balances' CDI-indexed yield."
            repository = {
                bootstrap = true
                language = "kotlin"
            }
            service = {
                bootstrap = true
            }
            monitoring = {
                bootstrap = true
                error_rate = "0.5"
                read_latency = "0.4"
                write_latency = "4"
                apdex = "0.94"
            }
            database = {
                bootstrap = true
                instance = "db.r6g.large"
                replicas = 4
                engine = "aurora-postgresql"
            }
            queues = {
                bootstrap = true
                names = ["invoice-paid", "yield-calculated"]
            }
            logs = {
                archive = true
            }
            cdn = {
                bootstrap = false
            }
        }
    }
}

Then, the product engineer commits the files and creates a pull request, which the platform team reviews. We have a policy in place to prevent microservices envy. As Domain-Driven Design practitioners, we prefer to design our services around bounded contexts and split the services only after thoughtful consideration. The platform team review checks the conformity to the policies.

After the pull request is approved, the orchestration is triggered. The orchestration:

Creates a service user for the application in AWS IAM.
Creates a Kubernetes service with horizontal and vertical pod autoscaler enabled.
Integrates the API Gateway with the Kubernetes ingress to expose the service’s API.
Creates an Aurora cluster in AWS RDS with horizontal autoscaling enabled.
Creates messaging queues/topics.
Creates an S3 directory integrated with a Content Delivery Network (CDN).
Stores credentials (API keys, database passwords, and so on) in the secrets management system.
Bootstraps the new service codebase using the service chassis.
Creates the new service repository in BitBucket.
Configures the BitBucket repository variables.
Creates the CI/CD pipelines.
Configures the quality and security gateway.
Runs the first deployment of the service.
Runs the first deployment of service workers in sidecars.
Creates a Service Level Objective (SLO) in Datadog.
Configures service alarms in Datadog.
Runs health checks to guarantee the provisioning worked.

The service chassis is a Ktor-based opinionated framework that packages a curated set of open-source libraries and internally developed components to reduce the cognitive load on concerns like persistence and logging.

A simplified diagram of how the orchestration works.

After the provisioning finishes, the product engineer clones the repository and starts working on the new service. The service chassis automatically generates a Docker Compose file to support local development. All projects share the same bootstrapping structure, which helps the product engineers when they switch between projects. Furthermore, less variability means faster optimization cycles for the platform engineers when enhancing the underlying tooling used by the Engineering team.

The services are also entirely instrumented. Early on, product engineers can monitor the application performance and troubleshoot issues like slow queries and low Apdex scores. Alarms are configured automatically and sent to standardized Slack channels (each team has a standard set of Slack channels, one dedicated to observability alarms).

To fine-tune the available computing capacity, the product engineer may update the configuration file to increase the number of Kubernetes pods required to run its service. The only way to set up the production environment from a product engineer’s standpoint is through this configuration file that abstracts the underlying cloud infrastructure. This guarantees that all changes to production are transparently audited.

We also have a set of things that require manual work but must follow guidelines or standards. Some guidelines have emerged from peer-reviewing processes, like the API design review. We strive to design our API semantically. Initially, we peer-reviewed changes to the API contract. Currently, the team follows the conventions that are available in a design document.

We can’t stress enough the importance of documentation. We documented the most critical parts of the product engineer journey, including the entire Software Development Lifecycle (SDLC). For example, product engineers can learn how to create new credentials and store them secretly in our cloud environment or adjust alarm thresholds by reading the documentation. Automated processes and documentation are the base of a self-service platform.

Summary table of the platform’s features with links for further information (click or tap to enlarge).

When a new person joins our team, our identity provider sends a welcome email with instructions on how to activate the laptop that is sent to the person’s house some days before their first day of work. The laptop is fully configured and updated to the latest version of the operating system and development tools. The only work the software engineer needs to do is to clone the repositories and run the starting scripts (generally Docker Compose commands) available in the README files.

To make our web-based tools readily available (e.g., Rancher, Datadog, Jira, Confluence, 1Password, Slack), we used Okta as the identity provider. With a single password, the person unlocks the laptop and access all the tools. Expanding our platform scope to issues like access management and laptop setup helped us to implement Engineering-wide changes like the rolling of tooling updates and enhancements without interrupting people to run setup wizards manually.

Closing remarks

Platform engineering at N26 has a long and winding road. From custom-built orchestration tools to standardization of manual processes, we found a way to make deployment virtually easy on the first day. In the same journey, we eliminated the dependency on third-party infrastructure, which resulted in an incredible drop in our API latency. The final payoff was an immense increase in developer productivity and a highly mature DevOps organization.

The main takeaway is to manage your platform strategically. There is much more to build than enough people, time, and money. A product mindset is paramount. Take advantage of available tools, especially open-source tools, to cut out scope. Focus on the developer experience you want to deliver and the outcomes your company pursues. Invest in building a resilient team and a developer experience that enables it to anticipate customer needs.

Acknowledgments

The platform engineering journey described here was possible due to the efforts of many people. Our strong foundation was born from the minds of seasoned professionals. Antonio Spinelli, Henrique Sloty, and Thiago Costa contributed heavily to strategic planning. All of them worked on the first features of the platform before moving on to work as the first engineering managers of the company.

Henrique Sloty has been managing the Engineering Platform area since the release of our TVP. Abner Maioralli joined the team and worked as the Technical Product Manager, supporting the planning, discovery, and communication processes. Although not discussed in this article, we also have a platform for mobile software development. Matheus Schmidt and Lucas Calandrine are the minds behind this platform.

Footnotes

In late-stage and geographically distributed startups like N26, it is common practice to use business plans to define the funding needs for projects. The leadership of the Brazilian operation had to plan and determine what products and services we envisioned to deliver to our local customers. I established a simple guideline and explained to the CEO and the leadership team that teams must have a minimum size to work on the entire SDLC. The projected growth was a tenfold increase in the Brazilian staff, increasing to 300 employees by the end of 2022. The Engineering team would grow to 100 people or a 20 times increase. ↩
The inspiration came from Etsy, one of the early DevOps pioneers. ↩
At the time, we had previous experience in both homogeneous and heterogeneous environments. We found ourselves biased towards homogeneous environments due to hurdles we felt in previous experiences, like services built around personal technology preferences. The lack of strategic management in Engineering may leave a trail of chaos: services that nobody wants to support because of inexperience in the programming language, database technology, and so on. We were not against the usage of technologies best suited for the problem at hand (use the right tool for the job), but we would bet on building a homogenous and consistent platform first and then branch out as needed. ↩
Remember that we depended on third-party infrastructure? We had three main issues with it: it was a proprietary platform (which created non-portable workflows), it had high latency and a slow development cycle. Provisioning a new service within a week seemed fair. ↩
Initially, the platform engineers of the Cloud Platform were unsure if they could deliver this orchestration. Then, Henrique Sloty, who would be the Engineering Manager of the entire platform team, ran a spike solution. Basically, he developed a bash script that made the orchestration possible using the BitBucket pipelines. The result was that the first release of the platform provisioned new services in under 15 minutes. ↩
The TVP used YAML for this file. Later, the file syntax was ported to HCL to simplify the platform maintenance. ↩

References

Building an Engineering culture
Platform engineering: capabilities, roles, and its benefits for DevOps maturity
Crunchbase. Christine Kilpatrick, 2023. The Party's Still Over: The VC Downturn In 6 Charts
Sifted. Aruni Sunil, 2023. The VC slowdown: Is it set to get even worse?
David Anderson, 2022. Fit for Purpose book tour presentation
Gojko Adzic, 2012. Impact Mapping
Starbase. Feliz Leitmeyer, 2022. N26 plans to create 300 more jobs in Brazil
Etsy. John Goulah, 2012. Making it Virtually Easy to Deploy on Day One
Steve Blank, 2015. Organizational Debt is like Technical debt — but worse
Jeff Patton, 2014. User Story Mapping
Rafay Systems, 2023. Environment provisioning for application development & delivery survey
Inc. Bill Murphy Jr, 2017. Uber's Simple Math Mistake Will Cost It Tens of Millions of Dollars
Slate. Lav Varshney, 2019. The Deadly Consequences of Rounding Errors
Tecnoblog. Pedro Knoth, 2022. Nubank has a bug that prevents you from transferring R$ 17.99 via Pix
Martin Fowler, 2005. InMemoryTestDatabase
ThoughtWorks Technology Radar., 2015. Microservice envy
Microsoft. Will Tsay and Aaron Crawfis, 2023. Enabling developer collaboration with Radius
Microsoft, 2023. Radius