Platform engineering: capabilities, roles, and its benefits for DevOps maturity

This article is part of a two-piece series on platform engineering. The second piece describes how N26 planned a robust platform, improved productivity, and reduced costs.

Imagine you work in a modern technical environment, be it a startup or not. It’s Wednesday, and you’re ready to release a much-waited feature to the customer base.

You and your team then start checking out item by item of the release plan. You then check the last one just after switching on the feature flag. Now, it’s time to monitor the customers using the feature.

Some minutes later, you spot some warnings in the observability tool. You investigate the issue while the CX personnel open tickets with customer complaints. You check if the number of running Kubernetes pods is healthy and see that the cluster is starting more pods to cope with the load increase caused by the spike of errors. The database load is adequate, but the dead letter queue accumulates messages in a repeated pattern.

You add some extra lines to log what is happening. You copy and paste error-handling code from another project using the same programming language used by your microservice project. Then, you commit the code and wait while the CI/CD tool deploys it to the cluster.

Later on, you find that the access policy of the production environment is stricter than the sandbox environment. It seems the service isn’t able to publish messages to a secondary queue, which is used to eventually migrate required data for the new feature to work with a specific segment of the customer base. It isn’t working because you didn’t follow the naming rules.

You then decide to rename the queue by changing some lines of Terraform code and submit it to the approval of the Infrastructure team. Once approved, you redeploy the service with the fixed queue name, and everything starts working as expected. You’re happy because you solved the incident in under a couple of hours but also frustrated with how social media complaints overshadowed the launch campaign. Besides all, you all felt overwhelmed.

You build it, you run it

Our introductory tale has shown many attributes of companies running modern technical environments and practices. The team was able to deliver software using automated processes and was also able to set up the production environment by making changes to Terraform, an Infrastructure as Code tool. The Operations are available as a self-service, automated, and on-demand available dependency. A DevOps culture is established.

Besides DevOps, the team developing this feature is fully responsible for its software’s operations. They monitor it actively and immediately act when things go wrong since smooth operation is the top priority. Investments are made in quality to lower the maintenance burden.

The team owns the entire product cycle. The team creates a sense of how regular operation feels by working over the whole software development lifecycle (SDLC). Also, it can understand better how customers react to changes by correlating operational (application telemetry) and business metrics (KPIs, health indicators, and improvement drivers). This company adopts the “You build it, you run it” (YBIYRN) approach.

However, even with these practices, our tale spots hardships common even in highly mature DevOps organizations. The lack of a comprehensive engineering platform increases cognitive load, affecting productivity. Let’s understand what is the cognitive load before discussing platform engineering.

Cognitive load

Have you ever felt mentally exhausted after learning something new, like a new language, technology, or a different way of working? While learning, have you ever felt tempted to check a notification on your mobile phone?

That’s the forces of the different cognitive loads acting on your cognitive system. The Cognitive Load Theory was introduced in the 1980s to explain how our learning ability is heavily constrained by our working memory. Learning is a process that requires our cognitive system to process information in the working memory and then store it in long-term memory.

However, the working memory is highly constrained in capacity and duration and only holds information briefly. The exception is when the working memory deals with abilities previously learned since it retrieves information from long-term memory. That’s why while you’re learning something new, you practice it until you do it automatically. You experienced it many times in your life while learning to speak, read, write, dance, you name it.

The sensory memory filters out most of the incoming information. Learning is the act of encoding new schemas in the long-term memory.

Understanding the kinds of cognitive load is vital to designing an environment that prioritizes learning by adjusting the load on the working memory:

Intrinsic Load is imposed by the complexity of the information and the previous expertise of the person in the subject matter. That’s why we break down projects into smaller deliverables since building something by joining simpler parts is more manageable
Extraneous Load is also known as ineffective load. It diverges cognitive resources to irrelevant activities that do not contribute to learning. That’s why we must create workplaces that allow people to focus by diminishing distractions and improving processes
Germane Load is the effective load imposed on the working memory by the process of learning. Transferring information from the working memory to the long-term memory requires effort. This effort is the Germane Load. In the workplace, this happens when discovering how to solve a (business, customer, programming) problem

Extra Intrinsic and Extraneous Load must be eliminated or minimized to optimize learning. Back to our tale, we want more time for the team to learn: did the new feature create value for the customers? What should we optimize based on the gathered feedback? Working on these things is more valuable than fiddling with infrastructure configuration. Providing an engineering platform is critical to increasing Germane Load¹ by working on value-adding activities.

Platform engineering

Platform engineering is an emerging practice that improves the developer experience and productivity by providing a compelling integrated product — an engineering platform² — that will reduce cognitive load by delivering self-service software engineering capabilities.

Put simply, this platform will package everything from developer tools to processes and standards. Imagine, for example, that the team in our introductory tale works in a fintech startup, and their working domain deals with calculating installment payments. What happens when you divide USD 100 into three equal installments? You get three installments, each one valued at USD 33.33. Where’s the missing cent?

No penny must be lost when doing monetary calculations. It may seem silly, but rounding errors cost Uber tens of millions of dollars in 2017 and devalued a Canadian stock index by 50% in 22 months. So, instead of leaving each team in this startup to repeatedly program code that does monetary calculations, a single library must be provided to do it for them.

If monetary calculations seem an obvious example, we can move on to compliance issues. Have you ever had problems complying with GDPR-like privacy laws? Imagine the team from our tale wrongly logging Personal Identifiable Information (PII) like the social security number or credit card for debugging purposes. Now you have in your log history data that shouldn’t be there.

The solution is implementing ways to deal with PII in the engineering platform, preventing them from leaking into the logs. The platform team can redact PII on the application level (in a service chassis) and log stream. This way, the teams are freed from remembering which data they may add to the log. If they mistakenly add anything holding PII data to the log call, they will rest assured that it won’t appear in the records.

In our example, the engineering platform improves the last-mile developer experience of the existent DevOps platform. Remember, in our tale, the team was able to provision infrastructure through self-service tools. The engineering platform in this example would also package utilitarian packages (e.g., ORMs, monetary calculation libraries) and cross-cutting concerns (e.g., logging, security, error handling) in a compelling and standardized service chassis used when bootstrapping new projects in the company.

The team consumers the platform using self-service tools. Some services like the CI/CD, container cluster, and database may be super standardized. However, the team glues everything else, like logging, quality gateway, and authentication.

The team uses a service chassis that exposes the platform’s services and features with opt-in or opt-out toggles. The platform services are standardized and consistently consumed throughout the organization.

With a comprehensive engineering platform like this, the company would evolve its DevOps approach. The teams could focus more on the applications they need to develop and on learning what is driving the business forward and what is solving their customer needs. They can focus on value-adding and learning activities instead of working on extraneous things (reduced cognitive load). It is worth noting that platform engineering improves productivity, leading to more motivated individuals due to better developer experience.

For me, platform engineering is an organically evolutionary step of different industry-wide experiences and adopted approaches like DevOps. Speaking of DevOps, let’s expand the relationship of platform engineering with it and how it leads to higher DevOps maturity.

Higher DevOps maturity

If platform engineering is an evolutionary step of DevOps and other practices³, how can it help to lead to higher maturity? First of all, DevOps practitioners have been creating self-serviced platforms for the last ten years. The DevOps culture evolved into an approach where automation became self-serviced platforms managed as products. From the DevOps Handbook:

Instead of IT Operations doing manual work that comes from work tickets, it enables developer productivity through APIs and self-serviced platforms that create environments, test and deploy code (CI/CD), monitor and display production telemetry, and so forth. By doing this, IT Operations become more like Development (…), engaged in product development, where the product is the platform that developers use to safely, quickly, and securely test, deploy, and run their IT services in production.

Moving from the packaging of operations as a platform to the entire developer experience (i.e., packaging more capabilities in the platform, like logging, security, and error handling) means less cognitive load for the teams working on products and features. Enhanced developer experience will drive the DevOps metrics to higher levels.

Indeed, the Puppet 2021 State of DevOps Report found a high degree of correlation between DevOps evolution and the use of internal platforms: 48% of the highly mature organizations used internal platforms against 25% of the mid-level group, and only 8% of the low-level group.

High performers use more internal platforms compared to mid-level and low performers.

I published N26 Brasil DevOps Metrics on LinkedIn earlier this year. We scored higher than 97% of respondents from the Financial Services industry. Our Software Engineers deployed to production at least once every two days (deployment frequency). Each deployment took 9 minutes on average (lead time). Only 1.2% of the deploys (change failure rate) led to incidents that had a restoration time of 21 hours on average (time to restore). A comprehensive engineering platform was prioritized back in the early days of our operation, contributing to developing a strong DevOps culture.

Platform engineering is an evolutionary step of DevOps. However, it requires expanding the scope of the platform to embrace the entire developer experience. Before diving into the platform’s capabilities, let’s discuss what is needed to create one. Not surprisingly, you’re going to need a platform team.

Organizational design

Platform engineering also has roots in organizational design, specifically after the groundwork set by Team Topologies, an approach to organize business and technology teams for fast flow.

Root to the approach is the acknowledgment that organizational design starts with a team-first mindset and that teams need to minimize the cognitive load on them by establishing good boundaries and clear responsibilities. That’s why one of the most important contributions of the approach was the introduction of its topologies and interaction modes. Discussing organization design with this precise vocabulary is refreshing. The topologies are four:

Stream-aligned team: aligned to a flow of work, which is usually a segment of the business domain
Complicated subsystem team: provides libraries to stream-aligned teams that solve computationally complex problems
Enabling team: helps stream-aligned teams with specialized skills (e.g., testing, Agile practices, database management)
Platform team: provides internal services to reduce the cognitive load, freeing the stream-aligned teams to accelerate their delivery rate

The four topologies.

The stream-aligned teams are responsible for the entire software delivery lifecycle; they’re YBIYRI teams. Engineering platforms are essential because they let teams own the whole lifecycle, eliminating handoffs and waste of manual processes. Platforms minimize the cognitive load while enabling stream-aligned teams to be fully autonomous.

Counter-intuitively, autonomy means restricting collaboration by reducing the dependencies between the teams. The approach defines three modes to understand the interaction between teams:

Collaboration: working together for some time to discover new things (products, APIs, technologies)
X-as-a-Service (XaaS): one team provides something as a service to another team
Facilitation: one team mentors another team

The interaction modes.

Your goal should be to restrict the collaboration interaction mode to specific periods, like new business opportunities that require discovering nouveau solutions. Two or more teams will collaborate closely to define a solution. Once the solution is delivered, the relationship changes to an XaaS in an upstream/downstream perspective.

The evolving organization landscape will trigger interaction mode rearrangements that will guide the relationships between the teams of all the topologies types. But to frame an example closer to our subject, imagine that a stream-aligned team found that they would need to use some new services available in the cloud provider to implement a new feature. At this moment, changing the relationship with the platform team to collaboration mode could be helpful to understand if this new technology needs only a change in the orchestration (i.e., changes in the Infrastructure as Code tool to provision the new service) or if it is something that may be abstracted for later reuse for the other teams.

A team topology with two teams in collaboration mode while consuming services from the Platform team.

In this setting, imagine that the platform team’s first deliverable is the provisioning of the new cloud service, freeing up the stream-aligned team to deal solely with value delivery to the end customer. After this delivery, the platform team may prioritize exposing the aforementioned service as a self-service tool or place it in the roadmap for a future release. Nonetheless, the interaction mode of both teams changes back to an XaaS mode.

After some time, the teams resume their relationship in XaaS mode.

It is worth understanding the approach before designing your teams. Also, the platform team may be composed of multiple teams of other topologies. As your organization grows, you may specialize the team in the different capabilities that compound the platform: numerous teams will work under the “platform team” or “platform area” umbrella.

The Platform team may be composed of multiple teams.

If you're just starting with platform engineering, I suggest having at least one platform team with Cloud Infrastructure and Software Engineers in your organization. You'll be able to plan and build your platform's first version, which may consist of a service chassis, CI/CD pipelines, and an abstracted self-serviceable cloud runtime. In an upcoming article, I will discuss how to plan a platform with an actual example.

Product Mindset

A platform can be a huge endeavor. As such, it is easy to get lost in the details and deliver something that is neither compelling nor minimizes the cognitive load of the stakeholders. Wait a minute? Stakeholders?

Building a platform means having a clear understanding of the platform’s users and their needs. As with any product, you’ll never have the time, money, and people required to build your product vision. So, platforms must be strategically managed like any other product. You’ll need a product vision and management processes like roadmapping and goal setting to align the stakeholders and communicate with them about the upcoming features and priorities.

Product management goes beyond strategic planning. The platform team needs to constantly survey users to see if the provided solutions are solving their problems. Every product must be delivered with complete support, including up-to-date documentation. Simplified onboarding mechanics should be in place to minimize manual tasks, granting a smooth self-service experience.

Starting the platform may be daunting. However, it may not need to be. You only need to discover the minimum viable product for your platform (or the shortest path to value). A handful of agile practices like Impact Mapping and User Story Mapping help you set a goal and prioritize just what is needed to deliver the hypothesized value. In an upcoming article, I will discuss how to plan a platform with an actual example.

Roles

You may have heard the job title “Platform Engineer” or “Product Engineer.” Translating in Team Topologies terms, the Platform Engineer works in a platform team while the second works in a stream-aligned team.

Is the Platform Engineer a different kind of engineer? What’s the difference when comparing with a Product or “regular” Software Engineer? I like to keep things simple. They’re different roles instead of job titles⁴. A Platform Engineer will develop platform-based solutions, which other Software Engineers will use. Product Engineers will develop solutions for end users, which may be other Software Engineers if the developed product is technical (e.g., a payments API). That’s why this distinction is more about the work being done, i.e., the role, than a job title. I don’t mind.

However, job titles aside, people working on platform teams require a different mentality. First, their motivation must be aligned with the fact that a platform team delivers value indirectly to the business. They’ll help to improve the organization’s productivity and also help to lower costs and standardize compliance and security controls.

Secondly, they must know data structures, algorithms, programming languages, operational systems, databases, shell scripting, and so on. It isn’t a different skill set from a “regular” Software Engineer working in a stream-aligned team. However, they must hone their API and software design skills, delivering technical products with excellent developer experience. These products must have well-designed APIs that are intuitive and easy to use. The code must set a high bar of measurable quality and solve well-defined use cases.

People who contributed to open-source projects are great candidates to work in platform teams. The only catch is to create a mindset of solving well-defined use cases. Open-source projects may have too broad APIs anticipating an unbounded number of use cases. But as a platform is for internal use, it is better to solve well-known cases than to anticipate future needs that may never happen. For SOLID practitioners, I'm not against the Open-Closed Principle. The opposite is true: I'm in for sound design, mainly if it reflects a rich domain model.

Capabilities of engineering platforms

Each engineering platform will evolve differently due to the pressing needs of the organization. However, some capabilities are valuable to present in the earliest versions of your product platform. Let’s take a cue from the DevOps Handbook:

By adding the expertise of QA, IT Operations, and Infosec into delivery teams and automated self-service tools and platforms, teams are able to use that expertise in their daily work without being dependent on other teams.

Most internal engineering platforms already package some of the mentioned capabilities in their self-service tools. We generally see some kind of automated infrastructure provisioning that spans a working environment in a cluster like Kubernetes. The same automation typically configures the CI/CD pipeline, running the automated tests for every code commit pushed to the version control system. The same pipeline may run tools like Snyk or Sonarcloud to inspect the code quality and security.

That’s a solid groundwork. However, an engineering platform requires the delivery of an integrated product. We must integrate the entire software delivery lifecycle processes, practices, and components into a compelling product. Software Engineers are constantly working with third-party code to ease dealing with the HTTP stack, database connections, and cross-cutting concerns (e.g., logging, authentication, authorization, caching, etc.) Packaging the frameworks and libraries in a standardized services chassis creates vast benefits.

A service chassis will help you standardize practices and processes while optimizing software dependencies’ governance if you’re using a microservices approach. The teams will also benefit from an easier way to navigate between the different software projects due to the standardization. Internal mobility of Software Engineers will also be improved as the ramp-up to learn a different business context will be smoother as the software development stack is well-known⁵.

Another benefit of this standardization is that platform teams can automate the platform further as they now control part of the application layer. For example, in our introductory tale, the platform team may fix the leaking of PII data to the log by changing the log component to redact a standardized data set. After building up the feature, the CI/CD would trigger a rebuild of every service. In a matter of minutes, the entire production environment would have a new compliance and security policy applied.

Closing remarks

Platform engineering is a trend with great potential for organizations willing to reach higher DevOps maturity levels. Also, it is an approach with the added benefits of improving cross-cutting concerns by packaging in compelling product capabilities like security, compliance, and observability, leading to more productive teams and motivated individuals.

Footnotes

Recent formulations of the Cognitive Load Theory state that Germane Load is not an independent source of cognitive load. John Sweller (2010) explains that Germane Load is a function of working memory resources devoted to the interacting elements that determine Intrinsic Load. Germane Load is merely the quantity of working memory resources available to learn and thus is indirectly affected by external sources of information. ↩
Commonly referred to as Internal Developer Platform (IDP). ↩
Big Tech companies are used to having platform teams and internal platforms. For example, Google developed communication protocols (Protobuf) and CI/CD tools (including its own version control — Piper).

But Big Techs aren’t an exception. Big banks like Goldman Sachs and Bradesco have had platforms since the 2000s (at least). These examples predate DevOps. ↩
Do you remember the old discussions about DevOps being a job title or not? The argument against it was that DevOps was a cultural movement and a set of practices (which is correct). Despite that, the market will always capture a trend and transform it into business opportunities. DevOps certifications, DevOps tools, DevOps jobs.

I prefer to stay as far as I can and use more neutral names because I think they communicate better. A DevOps Engineer is an Infrastructure or Cloud Engineer for me. Trends come and go and now Platform Engineering is on the rise, with many arguing it is not DevOps despite the strong correlation between them. My recommendation is to name things after trends only if you’re a vendor in the related space. ↩
You may be wondering if this kills one of the advantages of a microservices architecture. Indeed, I suggest having a single framework and using a curated set of libraries instead of leaving the teams with open-ended technological choices.

The benefits outlined are amplified with the passing of time due to the constant optimization of this homogeneous set of tools. Also, from the business management perspective, hiring people to work with a smaller set of programming languages is easier, even when not using a mainstream language.

I worked at Nubank in 2019, and they’re huge advocates of Clojure, a functional programming language running on the JVM. While sourcing people with past Clojure experience was practically impossible, people joined the company and were able to develop Clojure after some months. All the learning resources were concentrated on Clojure learning. Also, the service chassis was mature enough to let people use pre-existing building blocks to develop new services. More importantly, the toolset is heavily constrained.

At N26 Brasil, we used a similar strategy (further explained in an upcoming article). We chose Kotlin as our backend programming language and ktor as the base framework for our service chassis. A minority of our Software Engineers had prior experience with the language. But even so, the service chassis helped the development of projects with rich domain models that are easy to navigate due to the sharing of the same package structure. ↩

References

Richard Atkinson and Richard Shiffrin, 1986. Human Memory: A Proposed System and its Control Processes
John Sweller et al, 2019. Cognitive Architecture and Instructional Design: 20 Years Later
Slava Kalyuga, 2011. Cognitive Load Theory: How Many Types of Load Does It Really Need?
John Sweller, 2010. Element Interactivity and Intrinsic, Extraneous, and Germane Cognitive Load
Luca Galante, 2022. What is platform engineering?
Bill Murphy Jr, 2017. Uber's Simple Math Mistake Will Cost It Tens of Millions of Dollars
Lav Varshney, 2019. The Deadly Consequences of Rounding Errors
W. Edwards Deming, 2000. Out of Crisis
Patrick Debois et al, 2016. The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations
Nigel Kersten et al, 2021. 2021 Puppet State of DevOps Report
N26 DevOps Metrics
Manuel Pais and Matthew Skelton, 2019. Team Topologies: Organizing Business and Technology Teams for Fast Flow
Chris Richardson. Service chassis pattern
Felipe Hummel, 2020. The value of canonicity