Stuff about Software Engineering

Month: November 2024

Gaia: How We Built a Platform to Transform Infrastructure Creation

The Evolution of Infrastructure Management

In Software Engineering in Growth Products at Carlsberg, our journey towards modern infrastructure management began with a familiar challenge: as the number of development teams grew, the traditional approach of manually provisioning and managing infrastructure became a significant bottleneck. The DevOps team, tasked with building and maintaining infrastructure for multiple development teams, found themselves overwhelmed by the increasing demand for infrastructure resources.

Each new project required careful setup of networking, permissions, and cloud resources, all of which had to be manually configured by a small DevOps team. As development velocity increased and more teams came onboard, this model proved unsustainable. New projects faced delays waiting for infrastructure provisioning, while the DevOps team struggled to keep pace with mounting requests.

Reimagining Infrastructure Creation

The solution emerged when the DevOps team envisioned a different approach: what if developers could create their own infrastructure while adhering to organizational standards? The challenge was to enable self-service infrastructure without requiring developers to understand the complexities of building secure, scalable, and compliant cloud resources.

This vision led to the creation of Gaia, a platform that automates infrastructure creation while maintaining strict security and compliance standards. Built by the DevOps team, Gaia represents a fundamental shift in how infrastructure is provisioned and managed at Carlsberg.

The Platform Engineering Approach

Infrastructure as Code Evolution

Gaia elevates infrastructure creation beyond basic scripting by providing a comprehensive platform engineering solution. The platform utilizes Terraform for infrastructure provisioning but abstracts its complexity through a higher-level interface. This approach allows developers to focus on their applications while ensuring infrastructure deployments follow organizational best practices.

Standardized Module Library

The platform provides an extensive library of pre-built, production-ready modules covering the complete spectrum of AWS infrastructure components:

  • Compute Services: EC2, ECS, EKS, Lambda
  • Data Stores: Aurora, RDS, DynamoDB, DocumentDB, Redis, Elasticsearch
  • Networking: VPC, Load Balancers, API Gateway, Route53
  • Security: IAM, ACM, Secrets Manager
  • Messaging: SQS, Kafka
  • Monitoring: CloudWatch, Managed Grafana

Each module encapsulates best practices, security controls, and compliance requirements, ensuring consistent infrastructure deployment across the organization.

Developer Experience

Simplified Workflow

Gaia integrates seamlessly with existing development workflows through GitHub. Developers request infrastructure by:

  1. Creating a configuration file with simple key-value pairs
  2. Submitting a pull request
  3. Awaiting automated validation and deployment

Example configuration for a serverless function with a storage layer and an API Gateway:

The API Gateway configuration for the API is very simple:

Once the developer is ready to create the infrastructure a Pull Request is created and a “code owner” (in this case a Platform Engineer Team Member) approves the request and the infrastructure is deployed automatically.

Automated Compliance

The platform automatically enforces organizational standards and security policies. Developers don’t need to worry about:

  • Network configuration
  • Security group settings
  • Access control policies
  • Compliance requirements

All these aspects are handled automatically by Gaia’s pre-configured modules.

Technical Architecture

Terragrunt Integration

Gaia leverages Terragrunt as a wrapper around Terraform to provide enhanced functionality:

  • Automatic variable injection based on environment context
  • Template-based module generation
  • Configuration reuse across environments
  • Simplified state management

Monitoring and Observability

The platform includes native integration with monitoring tools:

  • Automated Datadog dashboard creation
  • Standardized monitoring configurations
  • Built-in health checks and alerts
  • Custom metric collection

Organizational Impact

DevOps Transformation

  • Reduced manual infrastructure work by approximately 80%
  • Shifted focus from repetitive tasks to platform improvements
  • Enabled scaling of development operations without proportional increase in DevOps resources

Development Velocity

  • Eliminated infrastructure provisioning bottlenecks
  • Reduced time-to-deployment for new projects
  • Enabled consistent implementation across teams

Governance and Security

  • Centralized policy enforcement
  • Automated compliance checking
  • Infrastructure drift detection and remediation
  • Standardized security controls

Future Directions

Multi-Cloud Strategy

While currently focused on AWS, Gaia is being extended to support Azure. This expansion presents unique challenges due to fundamental differences in how cloud platforms implement similar services. The team is working to maintain the same simple developer experience while adapting to Azure’s distinct architecture.

Platform Evolution

Planned enhancements include:

  • Enhanced monitoring capabilities
  • Expanded multi-cloud support
  • Deeper integration with development tools
  • Advanced automation features

Conclusion

Gaia represents a successful transformation from traditional DevOps to platform engineering. By providing developers with self-service infrastructure capabilities while maintaining security and compliance, the platform has eliminated a major organizational bottleneck. The success of this approach demonstrates how well-designed abstractions and automation can make infrastructure management accessible to development teams while maintaining enterprise-grade standards.

The platform has fundamentally transformed how Carlsberg manages cloud infrastructure. As cloud infrastructure continues to evolve, Gaia’s modular architecture and focus on developer experience position it well for future adaptations and enhancements. The platform serves as a testament to how modern platform engineering can effectively bridge the gap between development velocity and operational excellence.

Gaia was conceived by https://www.linkedin.com/in/josesganjos/ and built with the help from:

Evolve Commerce Club Expert Session #031

I was honoured to be invited to speak at the 31st Expert Session at the https://www.evolve-community.com.

We touched on a lot of subjects around Software Engineering and mostly AI.

Thank you Carlos Monteiro and Gustavo Valle for inviting me.

Here are some links to posts which covers some of the topics we spoke about in more detail:

Why 100% Utilization Kills Innovation: The Mathematical Reality

Imagine a highway at 100% capacity. Traffic doesn’t just slow down—it stops completely. A single broken-down car causes massive ripple effects because there’s no buffer space to absorb the variation. This isn’t just an analogy; it’s mathematics. And the same principle explains why running teams at full capacity mathematically guarantees the death of innovation.

The Queue Theory Reality

In 1961, mathematician J.F.C. Kingman proved something remarkable: as utilization approaches 100%, delays grow exponentially. This finding, known as Kingman’s Formula, demonstrates that systems operating at full capacity don’t just slow down linearly—they break down dramatically. Hopp and Spearman’s seminal work “Factory Physics” (2000) further established that optimal system performance occurs at around 80% utilization, giving rise to the “80% Rule” in operations management.

This isn’t opinion or management theory—it’s mathematics. When utilization exceeds 80-85%, systems experience:

  • Exponentially increasing delays
  • Inability to handle normal variation
  • Cascading disruptions from small problems
  • Deteriorating performance across all metrics

The Human System Connection

Just as a machine’s productivity is limited by its operational capacity, humans too are constrained by cognitive load. People and teams are systems too. When cognitive load research pioneers Sweller and Chandler demonstrated how mental capacity follows similar patterns, they revealed something crucial: minds at 100% capacity lose the ability to process new information effectively. Just as a fully utilized highway can’t absorb a single additional car, a fully utilized mind can’t absorb new ideas or opportunities.

The implications are profound: innovation requires spare capacity. This isn’t about working less—it’s about maintaining the mental and temporal space required for creative thinking and problem-solving. Studies of innovation consistently show that breakthrough ideas emerge when people have the bandwidth to:

  • Notice unexpected patterns
  • Explore new connections
  • Experiment with different approaches
  • Learn from failures

The Three Horizons Impact

McKinsey’s Three Horizons Framework provides a useful lens for understanding innovation timeframes:

  • Horizon 1: Improving current business
  • Horizon 2: Extending into new areas
  • Horizon 3: Creating transformative opportunities

Here’s where queue theory delivers its killing blow to innovation: At 100% utilization, everything becomes Horizon 1 by mathematical necessity. When a system (human or organizational) operates at full capacity, it can only handle what’s already in the queue. New opportunities, no matter how promising, must wait. Over time, Horizons 2 and 3 don’t just suffer—they become mathematically impossible.

To keep Horizons 2 and 3 viable, companies need to intentionally limit Horizon 1 resource utilization and leave room for creative and exploratory projects.

The Innovation Impossibility

Queue theory proves that running at 100% utilization:

  • Makes delays inevitable
  • Eliminates flexibility
  • Prevents absorption of variation
  • Blocks capacity for new initiatives

Therefore, organizations face a mathematical certainty: maintain 100% utilization or maintain innovation capability. You cannot have both. This isn’t a management choice or cultural issue—it’s as fundamental as gravity.

The solution isn’t working less—it’s working smarter. Just as highways need buffer capacity to function effectively, organizations need spare capacity to innovate. The 80% rule isn’t about reduced output; it’s about maintaining the space required for sustainable performance and growth.

The choice is clear: accept the mathematical reality that innovation requires spare capacity, or continue pushing for 100% utilization while wondering why transformative innovation never seems to happen.

References:

  • Kingman, J.F.C. (1961). “The Single Server Queue in Heavy Traffic”
  • Hopp, W.J., & Spearman, M.L. (2000). “Factory Physics”
  • McKinsey & Company. “Three Horizons of Growth”
  • Sweller, J., & Chandler, P. “Cognitive Load Theory and the Format of Instruction”

Beyond DevOps: The Rise of Full-Stack Platform Engineering

The Evolution of Infrastructure Management

DevOps promised to bridge the gap between development and operations, aiming to deliver infrastructure faster and more efficiently. However, in many organizations, the reality often fell short of this ideal. DevOps frequently became a practice where operations teams learned to script infrastructure without fully embracing key software engineering principles. It became more about scripting than true engineering.

The Need for a Higher Abstraction

As infrastructure needs grew more complex, it became clear that traditional DevOps approaches were not scaling effectively. Tools like Terraform, while powerful, often proved to be terse and not particularly developer-friendly. They got the job done, but they weren’t providing the streamlined experience that developers needed. A new approach was necessary – one that would raise the level of abstraction and make infrastructure more accessible.

The Golden Path as a Product

Enter the concept of the “golden path” – a set of pre-built, standardized infrastructure solutions that developers can easily use and customize. This approach treats infrastructure as a product, designed with the end-user – the developer – in mind. 

The golden path isn’t just a set of scripts or configurations; it’s a carefully crafted product that encapsulates best practices, security considerations, and organizational policies. It automates infrastructure creation while maintaining alignment with company standards, allowing developers to provision cloud resources without needing to worry about governance, security, or configuration inconsistencies.

Raising the Abstraction Level

To understand the significance of this shift, consider this analogy: Terraform, while powerful, is often like the assembly language of infrastructure. Platform engineering, and the golden path approach, is about raising that abstraction, creating reusable and maintainable infrastructure solutions that developers can work with seamlessly. 

Just as high-level programming languages made software development more accessible and efficient compared to assembly language, the golden path aims to do the same for infrastructure management. By creating higher-level abstractions, we’re making infrastructure more understandable, manageable, and aligned with modern software development practices.

The Role of Full-Stack Platform Engineers

This new approach requires a new kind of professional: the full-stack platform engineer. These engineers think like developers while solving infrastructure challenges. They build scalable, reliable, and developer-friendly infrastructure that empowers teams.

Full-stack platform engineers focus on creating robust, scalable infrastructure solutions that directly support business needs, rather than getting bogged down in low-level configuration details. They apply the same rigor expected in software development to infrastructure design, treating infrastructure truly as code.

Enhancing Developer Experience and Security

The golden path approach significantly enhances the developer experience. By integrating infrastructure provisioning directly into familiar development workflows (like those in GitHub), it allows developers to request and manage infrastructure as part of their normal process, without delays or context switching.

This approach also allows for the seamless integration of security practices. By baking security considerations into the golden path from the start, organizations can shift security left in the development process, addressing vulnerabilities at their source without compromising developer productivity.

A New Era of Infrastructure Management

The rise of full-stack platform engineering and the golden path approach represents a significant evolution in how we think about and manage infrastructure. It’s not just DevOps 2.0; it’s a fundamental shift in mindset that treats infrastructure as a product designed for developer success.

By raising the abstraction level, applying software engineering principles to infrastructure, and focusing on creating reusable, maintainable solutions, this approach promises to make infrastructure more accessible, secure, and aligned with modern development practices. As organizations continue to grapple with increasing complexity, the golden path offers a way forward – empowering developers, enhancing security, and ultimately accelerating innovation.

At Carlsberg, this approach has been embodied in Gaia, our golden path platform built by full-stack platform engineers. Gaia exemplifies how treating infrastructure as a product can transform development processes, making them more efficient and developer-friendly. It stands as a testament to the power of full-stack platform engineering in creating solutions that truly serve the needs of modern development teams.

As more organizations embrace this shift, we can expect to see a new landscape of infrastructure management emerge – one where the golden path, crafted by skilled full-stack platform engineers, leads the way to more innovative, secure, and efficient software development practices.

© 2024 Peter Birkholm-Buch

Theme by Anders NorenUp ↑