Peter Birkholm-Buch

Stuff about Software Engineering

The Evolution of AI: From Frontier Models to Specialized Small Language Models

Where We Came From: The Frontier Model Plateau

Over the past 12–18 months, the large language model (LLM) ecosystem has continued to advance—but largely in an incremental, not disruptive, fashion. Models from OpenAI, Anthropic, and Google have steadily improved across reasoning, multimodality, and scientific benchmarks, yet the relative ordering and qualitative capabilities have remained broadly stable.

Public benchmark suites such as MMLU (Massive Multitask Language Understanding), GPQA (Graduate‑Level Google‑Proof Q&A), and HELM (Stanford Holistic Evaluation of Language Models) show year‑over‑year gains measured in percentage points rather than step‑function breakthroughs. This is not a criticism—these are remarkable systems—but it does indicate a phase of maturation rather than rupture. Frontier models are converging: better, more reliable, more general—but not fundamentally different.

For scientific research, this means frontier GenAI has become a dependable horizontal capability: excellent for literature synthesis, reasoning assistance, explanation, and orchestration—but no longer the sole locus of rapid innovation.

Where We Are Now: The Rise of Small and Specialized Models

In parallel, a very different dynamic is unfolding.

Small Language Models (SLMs) and domain‑specific foundation models are advancing rapidly, particularly in scientific domains such as genomics, protein science, chemistry, and materials research. These models fall broadly into two categories:

  1. Domain‑adapted language models – smaller LLMs fine‑tuned on specific scientific corpora (e.g. chemistry, biology, materials science).
  2. Non‑linguistic foundation models – transformer‑based models trained on alternative “languages” such as DNA, protein sequences, or molecular graphs (e.g. Evo2, ESM, AlphaFold‑class models).

These models are not generalists—and that is precisely their strength. They encode deep inductive bias for their domain, deliver strong signal from sparse data, and increasingly outperform general LLMs on narrowly scoped scientific tasks.

Critically, most of these models do not fit the SaaS GenAI paradigm. They are rarely available via Azure AI Foundry, AWS Bedrock, or similar managed services. Running them typically requires:

  • Dedicated GPU infrastructure (often NVIDIA‑specific)
  • Local fine‑tuning or adaptation
  • Tight coupling to data and experimental context

This creates a structural mismatch between where scientific model innovation is happening and where traditional enterprise AI platforms operate.

External Validation: SLMs as First-Class Scientific Tools

Recent academic work explicitly supports this shift toward small, specialized models. A 2025 paper, “SLMs as Scientific Tools” (arXiv:2512.15943), argues that capability in scientific AI is task-relative rather than size-relative. The authors show that domain-specialized SLMs can match or outperform frontier LLMs on constrained scientific tasks when correctness, structure, and tool integration matter more than linguistic breadth.

Several conclusions from the paper closely align with CRL’s direction:

  • Inference locality beats central intelligence: running models close to data improves latency, reproducibility, validation, and cost control—supporting local, HPC-adjacent, and desk-side deployment.
  • SLMs scale scientifically, not just economically: smaller models are easier to interpret, benchmark, and falsify—critical properties for hypothesis generation and experimental decision-making.
  • Tool integration matters more than prompt engineering: structured inputs and deterministic tool calls outperform free-form prompting in scientific workflows.

The paper ultimately reinforces a hybrid architectural stance: LLMs orchestrate; SLMs execute. This provides external, peer-reviewed validation that SLMs are not a compromise, but the correct abstraction for scientific computing.

A Practical Shift: From Cloud‑Only to Desk‑Side AI

This is where a meaningful, practical shift is occurring.

With the arrival of systems such as NVIDIA DGX Spark, small language models become physically accessible to individual researchers. Instead of renting over‑provisioned H100 or Grace‑Blackwell cloud instances, scientists can:

  • Run and fine‑tune SLMs locally
  • Experiment rapidly without cloud friction or cost surprises
  • Work directly with models that are otherwise unavailable as managed services

In effect, this enables a “small model on every scientist’s desk” paradigm. The value is not raw scale, but immediacy, ownership, and experimentation velocity.

At CRL, this aligns tightly with how scientific progress actually happens: iterative, exploratory, domain‑specific, and data‑proximate.

Looking Toward 2026: A Hybrid, Orchestrated Future

Looking ahead—without making speculative predictions—the most plausible trajectory is not LLMs versus SLMs, but LLMs plus SLMs.

A likely pattern is:

  • Frontier LLMs acting as generalist reasoning, planning, and orchestration layers
  • Specialized small models performing high‑fidelity domain work (genomics, proteins, chemistry, simulation)
  • Tool‑ and model‑calling as the primary integration mechanism

In this model, the LLM does not replace scientific models—it coordinates them. It becomes the interface and glue, while the real scientific signal is generated by specialized systems running locally or on targeted infrastructure.

This is not speculative technology. The building blocks already exist:

  • Tool‑calling and agent frameworks
  • Domain foundation models
  • Local GPU systems capable of running serious scientific workloads

What changes in 2026 is not the theory, but the accessibility.

Summary

  • Frontier LLMs are improving steadily, but incrementally
  • Scientific innovation is accelerating fastest in small, specialized models
  • These models do not fit cloud‑only GenAI platforms
  • Desk‑side systems like DGX Spark make SLMs practically accessible
  • The near‑term future is hybrid: generalist orchestration + specialist execution

Appendix: The Emerging Scientific SLM Ecosystem (snapshot as of 2026-01-21)

Vendor / OriginDomain FocusRepresentative ModelsTypical Scientific Use Cases
NVIDIABiology, Chemistry, ClimateBioNeMo, ChemGPT, MegaMolBART, FourCastNetMolecule generation, QSAR, virtual screening, protein design, weather & climate modeling
DeepMindHigh-impact scientific modelingAlphaFold 3, GraphCastProtein structure prediction, climate forecasting, large-scale simulation
MetaProteins, Scientific LiteratureESMFold, ProtBERT, SciBERTProtein folding, sequence modeling, scientific text analysis
Arc Institute / ProfluentDNA & Protein DesignEvo2, E1DNA sequence design, protein design, strain optimization
Academic & Research ConsortiaGenomics, Materials ScienceOpenFold, MaterialsBERT, MatSciBERTCrystal property prediction, materials discovery
Emerging VendorsSupply Chain & OptimizationSCGPT, Logistics-LLaMA, OR-LLMDemand forecasting, route optimization, constraint planning

Notes

  • Most models listed above are open, open‑weight, or research‑licensed, and evolve in close collaboration with the scientific community.
  • The ecosystem is interoperable and tool‑oriented, designed to be embedded into pipelines rather than accessed via chat interfaces.
  • In contrast, enterprise GenAI platforms primarily target closed, managed, productivity‑oriented workloads.
  • NVIDIA’s role is increasingly that of a horizontal scientific AI platform provider, spanning models, tooling, and local compute rather than acting as a single‑model vendor.
  • Unlike enterprise GenAI platforms, which are predominantly closed and productivity-oriented, the scientific SLM ecosystem is characterized by open models, research licensing, and composability— properties that align naturally with exploratory research environments such as CRL.

SpecOps: A Middle Layer for Verifiable AI-Assisted Software Engineering


Abstract

AI-assisted development has increased implementation throughput but not correctness. This paper introduces SpecOps, a middle-layer approach that bridges informal specifications and formal models. SpecOps defines a machine-interpretable, executable specification layer that constrains AI agents, enables continuous conformance checking, and integrates with modern development workflows. It positions specifications as operational artifacts—compiled into tests, policies, and governance—rather than static documentation.


Motivation

Current systems optimize for:

  • generating code
  • refining code
  • reviewing code

But not for:

  • defining correctness

This leads to a structural mismatch: high-capability implementation systems operating on low-fidelity intent.


Concept: SpecOps

Intent → Structured Spec → Executable Constraints → Implementation → Continuous Validation

Key idea:
Specifications are not read—they are executed.


Positioning

SpecOps sits between:

  • informal specs (natural language, SpecKit)
  • formal methods (RAISE, TLA+, Z)

It provides:

  • more structure than the former
  • more usability than the latter

Core Principles

Constrained Semantics

Specifications must be structured enough to eliminate ambiguity.

Executability

All elements of a spec must compile into something testable or enforceable.

Continuous Conformance

Validation is not a phase—it is enforced on every change.

Traceability by Construction

Every implementation artifact must link back to a spec element.

Bounded Solution Space

Agents operate within constraints, not open-ended search.


Architecture Overview

SpecOps consists of four layers:

Specification Layer

Defines:

  • use cases
  • invariants
  • non-functional constraints

Compilation Layer

Transforms spec into:

  • tests
  • contract checks
  • policy rules

Implementation Layer

AI agents and developers generate code within constraints.

Governance Layer

Continuously enforces:

  • spec conformance
  • traceability
  • drift detection

Operational Model

For each change:

  1. Spec is updated or referenced
  2. Compilation layer regenerates constraints
  3. Implementation is produced or modified
  4. Governance layer evaluates:
    • test outcomes
    • invariant satisfaction
    • traceability completeness
  5. Change is accepted, rejected, or flagged

Role of AI Agents

  • implement within constraints
  • propose spec refinements (subject to governance)
  • cannot bypass invariants or policies

Agents are no longer decision-makers on correctness—only executors.


Relationship to Existing Methods

  • extends SpecKit with formal structure and enforcement
  • adapts ideas from RUP/UML into executable artifacts
  • retains compatibility with CI/CD and GitHub workflows
  • avoids the complexity barrier of full formal methods

Expected Outcomes

  • reduced implementation drift
  • higher alignment between intent and system behavior
  • less reliance on post-hoc review
  • more predictable delivery through integrated estimation


Conclusion

SpecOps reframes software development as a constrained synthesis problem rather than an open-ended search process. By making specifications executable and continuously enforced, it aligns AI-assisted implementation with formally defined intent.


Skills as a Supply Chain Risk

We’ve Seen This Before

We’ve been here before. First with open source packages, then CI/CD, then infrastructure-as-code. Each time we optimized for speed and reuse, and only later realized the real risk wasn’t what we built, but what we pulled in.

Now it’s happening again. This time with “skills.”

Skills Are a Supply Chain

Skills are emerging as reusable units in the AI stack—installable capabilities executed by agents with access to tools, data, and decisions.

They can contain code. Which means the moment you install and execute them, you’ve created a supply chain.

Early Evidence, Familiar Patterns

A recent large-scale study analyzed more than 238,000 skills across marketplaces and GitHub and found a measurable fraction to be malicious [1]. The numbers are not dramatic, but they are real. Roughly half a percent of skills were confirmed malicious after filtering noise.

More importantly, the attack patterns are familiar. The same study identifies hijacking of skills hosted in abandoned GitHub repositories as an active attack vector [1].

In other words, this is not new risk. It is old risk in a new place.

The Difference Is Execution

What is new is how these components run.

Skills are not just libraries sitting in your build. They are instructions plus executable code, often running with the same privileges as the agent invoking them, and selected dynamically at runtime [2].

That changes the boundary. You are no longer just managing dependencies. You are allowing a system to choose and execute code on your behalf.

Why This Matters

Traditional controls assume stable systems: known dependencies, predictable execution paths, and validation at build time.

That model breaks here.

When selection is dynamic and execution happens at runtime, static analysis and dependency scanning still help—but they no longer describe the system you are actually running. Broader studies of the ecosystem already show a significant portion of skills contain security weaknesses, including supply chain-style vulnerabilities and privilege escalation paths [3].

This Is Still Fixable

None of this requires new principles.

Treat skills as untrusted code.

  • Use only skills from trusted sources with security code scanning
  • Limit what agents can do by default
  • Isolate execution
  • Require provenance
  • Observe behavior at runtime

This is just software engineering discipline applied at the right boundary.

Final Thought

Skills are not just features, they are code executing on your behalf.

We’ve learned how to manage this before. The only question is how quickly we apply those lessons this time.

References

[1] Malicious or Not: Measuring the Security of Agent Skill Ecosystems. https://doi.org/10.48550/arXiv.2603.16572

[2] Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study. https://doi.org/10.48550/arXiv.2602.06547

[3] Agent Skills in the Wild: Vulnerabilities and Supply Chain Risks. https://doi.org/10.48550/arXiv.2601.10338

[4] On the Security of LLM Agents: Prompt Injection and Skill-Based Attacks. https://doi.org/10.48550/arXiv.2602.20156

When the Model Breaks

Introduction

Over the past year, I’ve written three posts that—at the time—felt consistent.

First, I described four categories of AI solutions, arguing that complexity determines where AI works and then, I introduced the trade-off between speed and precision, where fast systems are imprecise and precise systems are slow.

Both were true at the time.

Lastly I introduced the Wiggum Loop which argues that institutional memory is useless.

The original model

The underlying assumption in the two first posts was simple. AI is most effective when problems are well-bounded, precision requirements are low, and iteration costs are small. It struggles when precision is critical, domain knowledge is deep, and errors are expensive. In other words, AI accelerates simple work, while humans remain essential for complex work.

The crack in the model

The Wiggum Loop challenges that assumption. If solutions can be reached through repeated iteration rather than upfront understanding, then precision is no longer a prerequisite—it becomes something you converge on. This changes the equation. Complexity no longer blocks AI in the same way; it simply increases the number of iterations required.

From capability to convergence

The original model was about capability—what AI can do well. The emerging model is about convergence—how quickly a system can explore the solution space and arrive at something that works. Once iteration is cheap and automated, the constraint shifts. It is no longer about whether we can solve a problem, but whether we can recognize when it has been solved.

Reinterpreting the three posts

Seen together, the three posts describe a transition. 

The model does not disappear—it shifts.

The new boundary

The real boundary is no longer complexity or precision. It is whether a problem can be expressed in a way that supports iteration. That requires a clearly defined outcome, explicit constraints, and a way to evaluate results. If those exist, iteration can often replace deep understanding; if they do not, it cannot.

This does not remove expertise—it relocates it. The hard part is no longer solving the problem directly, but defining what success looks like, encoding the right constraints, and deciding how results are evaluated.

What this means for organizations

This is not just a technical shift—it changes how organizations create value. Historically, value came from expertise, experience, and accumulated knowledge. Increasingly, it comes from defining problems clearly, encoding constraints explicitly, and running and governing iterative systems. The center of gravity moves.

The uncomfortable alignment

Taken together, the three posts lead to a slightly uncomfortable conclusion. Much of what we treat as essential organizational knowledge is actually context-bound constraint—decisions made under conditions that no longer apply.

If iteration can rediscover solutions faster than we can recall them, then memory becomes less valuable than exploration. That has consequences. Expertise shifts from knowing answers to defining problems and constraints. Institutional memory becomes less of an authority and more of a hypothesis archive—useful, but not decisive. Roles built around recall and experience start to erode, while roles focused on framing, validation, and governance become more central.

This does not remove humans, but it changes what humans are for—from remembering why things failed to defining what success looks like.

Where this leaves us

The original model still holds, but it is no longer the full picture. AI is not just a tool for solving known problems faster—it is becoming a system for exploring unknown solutions through iteration.

There is a subtle tension here. This trilogy itself depends on cumulative understanding, where each post builds on the last—a small act of institutional memory arguing against institutional memory. Exploration does not replace memory entirely; it changes what kind of memory matters. Constraint-memory becomes less valuable, while model-building and interpretation become more important.

Final thought

We started by asking where AI works. We then asked how precise it needs to be. The emerging question is different: how fast can we iterate—and how well can we recognize success?

That is the thread connecting all three posts, and it is where the model begins to break.

The Wiggum Loop: Brute-Forcing Business with AI

What if persistence beats knowledge?

We’ve spent decades optimizing how organizations think. We built processes, governance structures, architecture reviews, and layers of institutional knowledge. Entire careers are built on knowing why something won’t work.

But what if the fastest path to solving a problem is no longer thinking harder—but trying more? Not smarter. Not deeper. Just… more.

This pattern—often referred to as the Ralph Wiggum loop in AI coding circles—is already well established (https://www.leanware.co/insights/ralph-wiggum-ai-coding). What’s interesting is not the name, but what happens when we apply the same idea outside of coding.

The shift: from knowing to looping

AI coding agents, orchestration platforms, and cheap, elastic compute have changed the economics of problem solving. What used to require deep domain expertise and careful design can now be approached differently. Instead of relying on understanding upfront, we can define the outcome, set guardrails—legal, ethical, and architectural—let agents iterate, and then select what works. This can be repeated at scale.

It is already visible in modern coding workflows, where agents generate, test, and refine code in loops, where skills and tools extend capabilities dynamically, and where tasks can be scheduled, retried, and recomposed. We are no longer limited by how fast we can think, but by how fast we can iterate.

The Wiggum Loop

Named after Ralph Wiggum from The Simpsons, this approach embraces a simple idea:

Try. Fail. Try again. Repeat until something works.

At scale, this stops being naive and starts becoming powerful.

Because the world changes. What failed before may succeed now as technology evolves, constraints shift, data improves, costs drop, and interfaces change. Organizational memory often encodes past constraints as permanent truths, but the Wiggum Loop ignores that and re-attempts relentlessly.

Removing the wrong human from the loop

This is not about removing humans entirely. It is about removing a specific role humans play in organizations—the carrier of historical constraints.

This is the person who says, “We’ve tried that before.” In many cases, that statement is technically correct and strategically wrong.

The Wiggum Loop removes this layer from execution. Humans define the goal and the boundaries, while machines explore the solution space. Humans still decide, but they no longer prematurely constrain.

From knowledge-driven to search-driven organizations

Traditionally, organizations solve problems by gathering expertise, modeling the problem, designing the solution, and then executing.

The Wiggum Loop flips this. Instead, we define the outcome, encode constraints—a kind of “constitution”—generate and test many solutions, and keep what works.

This represents a shift from knowledge-driven systems to search-driven systems. Where knowledge is incomplete or outdated, search wins.

When search beats knowledge—and when it doesn’t

This only works under specific conditions.

Search dominates when outcomes are testable, feedback loops are fast, and failures are cheap or reversible. This describes a large portion of business problems—optimization, configuration, planning, and software-enabled processes.

But the loop breaks when failures are silent or slow, when consequences are irreversible, or when correctness cannot be evaluated. In these cases, iteration can outrun detection, and brute force becomes risk.

The point is not that knowledge disappears. It is that in many domains, it is no longer the primary constraint.

Why this is suddenly viable

Three things have changed at the same time.

  1. Agents can act. They do not just generate outputs but can execute, test, retry, and adapt.
  2. Loops are native, meaning iterative workflows can be run programmatically rather than manually.
  3. Compute is cheap enough that brute force is no longer absurd—it is often practical.

Together, these changes enable systematic, automated exploration of solution spaces at scale.

A practical example: procurement

Consider procurement. Traditionally, sourcing decisions rely heavily on experience, supplier relationships, and historical outcomes, which also means they inherit historical biases and constraints.

Now imagine a Wiggum Loop approach. The objective is defined in terms of cost, reliability, sustainability, and risk. Constraints such as contracts, regulations, and policies are encoded. Agents then explore supplier combinations, simulate scenarios, generate negotiation strategies, and rerun the process with variations.

This results in thousands of iterations, where most will be wrong, but some will be better than anything previously attempted. Crucially, no one needs to remember why something didn’t work in 2018.

Governance without paralysis

This approach only works if guardrails are explicit—and this is the hard part.

Think of it as a constitution that defines what is allowed, what is forbidden, and what must be optimized. Instead of embedding constraints in people, we embed them in systems.

In practice, this means turning intent into executable constraints—tests, policies, specifications, and evaluation criteria that can be applied automatically at scale. We are early in this transition, and most organizations are not yet good at it.

Without this, the loop becomes chaos. With it, the loop becomes power.

The uncomfortable implication

If this works, it challenges something fundamental: how much of organizational value is knowledge, and how much is inertia?

A significant portion of what we call “knowledge” is accumulated constraint—decisions made under conditions that no longer apply. When those constraints are encoded in people, they persist long after the world has changed.

If problems can be solved through clear intent, explicit constraints, and massive iteration, then much of that embedded knowledge becomes optional.

This does not remove humans, but it changes what humans are for—from remembering why things failed, to defining what success looks like.

So the real question is not technical

We already have agents, loops, orchestration, and compute.

The real question is cultural: do we have the courage to try again? To ignore “we’ve done that before,” to let systems explore without prematurely shutting them down, and to trust iteration over intuition—at least long enough to see what emerges.

Final thought

The Wiggum Loop is not about being careless. It is about being relentless in a changing world.

And maybe—just maybe—the organizations that win won’t be the ones that know the most, but the ones that search the best.

From Roles to Work: What Each IT Architect Actually Does

Introduction

In a previous post (Different Roles and Responsibilities for an IT Architect), I outlined the different roles in architecture. The natural next question is: what work actually sits with each role?

This is where I see organizational struggle—not because roles are unclear, but because the work boundaries are.

A useful lens here comes from Svyatoslav Kotusev’s The Practice of Enterprise Architecture, where architecture is described not as a set of roles, but as practices operating at different levels of the organization.

What follows is a practical way to make that explicit.

Note: In my previous post I also included Infrastructure Architects. They are intentionally left out here to keep the focus on how application and solution-level architecture work is split. Infrastructure Architecture operates with similar principles, but across platform and environment concerns.

The Core Principle

For clarity on naming:

  • Enterprise Architect (EA)
  • Domain Architect (DA) — equivalent to what many organizations call Solution Architect
  • Software Architect (SA) — equivalent to Tech Lead

The SA abbreviation is overloaded in many organizations, so in this post SA refers to Software Architect, not Solution Architect.

Each role operates on a different level of abstraction and time horizon:

  • Enterprise Architecture (EA) → direction and constraints  — Sets business-driven direction and guardrails that shape all downstream decisions.
  • Domain Architecture (DA) → alignment and structure  — Translates direction into coherent structures and boundaries across a business area.
  • Software Architecture (SA) → design and execution  — Turns structures into concrete, implementable systems and makes final design decisions.

Enterprise is horizontal across the organization (cross-cutting capabilities, standards, and direction), while Solution/Software is vertical (aligned to specific business areas and initiatives).

Examples:

  • Enterprise looks at things like Customer Management, Product Management, Order Management, Finance, or Supply Chain across all business areas.
  • Domain Architects works within a specific area or initiative and ensures systems in that context fit together.
  • Software Architects decides on software architecture implementation patterns.

If those are confused, enterprise architects turn into domain or software architects—and everything fragments.

Enterprise Architect — The Direction Layer

This layer focuses on business-driven direction and constraints.

Primary work:

  • Define architectural principles and guardrails
  • Align architecture with business strategy and operating model
  • Set direction based on business capabilities and needs
  • Establish governance and decision frameworks

Artifacts:

  • Principles
  • Target architecture (at capability level — e.g. Customer Management, Product Management, Order Management, Finance, or Supply Chain as cross-cutting business capabilities shared across the organization — not specific systems or tools)
  • Strategic direction

What it’s not:

  • Deciding architectural styles (e.g. event-driven vs request/response)
  • Choosing integration patterns or technologies
  • Designing systems or interactions
  • Translating direction into technical solutions

Enterprise architecture answers why and in which direction, not how.

Domain Architect — The Alignment, Design, and Execution Layer

This is where architecture becomes concrete.

Primary work:

  • Shape how business capabilities are realized across systems in a given domain or initiative
  • Ensure consistency and coherence across solutions
  • Design the solution end-to-end
  • Translate enterprise direction into a working architecture
  • Make concrete design choices (e.g. event-driven vs request/response)
  • Define APIs, data flows, and interactions
  • Make trade-offs under real constraints
  • Ensure compliance with standards and principles

This is where architectural intent meets real delivery and must align with defined rules and processes.

Artifacts:

  • Solution designs
  • Architecture decision records
  • Reference patterns (within the context of the domain/initiative)

What it’s not:

  • Defining enterprise-wide principles
  • Working purely at strategy level without delivery responsibility
  • Escalating every decision upward

This is the level where decisions like event-driven vs request/responseKafka vs RESTdata ownership, and consistency models are actually made.

Software Architect — The Reality Check

This is where architecture meets code.

Primary work:

  • Translate architecture into implementation
  • Own technical quality and execution
  • Challenge designs based on reality
  • Ensure operability

What it’s not:

  • Redefining architecture because it’s inconvenient
  • Ignoring constraints set at higher levels
  • Acting only as a senior developer

How the Work Connects

  1. Enterprise (EA) defines direction
  2. Domain (DA) shapes, designs, and makes decisions
  3. Software Architect (SA) ensures it works in practice

The key is that decisions are made at the lowest responsible level.

If Enterprise work is not protected, it will collapse into Solution work.

Final Thought

Architecture breaks down when decisions are made at the wrong level:

  • If enterprise architects decide on Kafka, you lose flexibility.
  • If solution architects define enterprise principles, you lose coherence.

Kotusev’s point is simple: architecture is a system of practices and the value comes from keeping those practices separate—and connected.

AI doesn’t create advantage -distribution does

Introduction

In the early 20th century, factories did not gain much by simply replacing steam engines with electric motors. The real gains came later, when they reorganized how work was done—redesigning layouts, workflows, and roles to take advantage of distributed power [9]. AI is following the same pattern. The technology itself is not the differentiator. How it is distributed inside the organization is.

Across domains, the pattern is already visible.

In scientific research, systems like AlphaFold and other AI models in biology and chemistry are shifting the frontier of what good looks like. Researchers who integrate these tools into their workflows move faster, explore more hypotheses, and expand output. Others are not just slower—they are operating below a moving baseline.

In software engineering, the dynamic is different but related. AI compresses the time required to produce code, but also compresses the time required to produce failure. Teams that combine strong engineering practices with AI accelerate safely. Teams that rely on generated output without discipline introduce risk at speed.

In both cases, the effect is not uniform improvement. It is divergence.

The hidden failure mode: uneven distribution

What is emerging is not a lack of AI capability, but an uneven distribution of it.

Some individuals and teams gain early access, experiment, and build fluency through use. Others wait for guidance, are constrained by governance, or never fully integrate AI into how they work. Over time, this creates a gap in capability that compounds.

This is where the A and B teams begin to appear—not as a deliberate strategy, but as a consequence of how access, learning, and incentives are structured.

AI literacy beats AI elites

Organizations that scale AI successfully distribute capability rather than concentrate it [1][2].

When AI is centralized, teams depend on specialists. Demand exceeds capacity, and most of the organization remains passive. When capability is distributed, teams solve problems locally, and learning happens through application rather than instruction.

McKinsey consistently finds that only a minority of companies capture meaningful value from AI, and those that do embed it across functions rather than isolating it [1]. Experimental evidence reinforces that productivity gains depend on how individuals integrate AI into their work, not just whether they have access to it [11][12].

The constraint is not the model. It is whether people know how to use it effectively in context.

The Center of Excellence trap

The default enterprise response is to centralize AI into a Center of Excellence. This improves oversight and consistency, but it also creates a structural bottleneck. Every team now depends on a central unit for access, prioritization, and delivery, which does not scale with demand.

More importantly, it concentrates knowledge. Patterns, practices, and hard-won lessons accumulate inside the CoE rather than flowing through the organization. Capability becomes something you request, not something you build.

This is why many organizations are exploring federated and embedded operating models [3][4], though the transition is often incomplete and uneven. The goal is not just to distribute execution—it is to distribute capability.

This is where platform engineering provides a better mental model. Instead of acting as a delivery function, the central team builds golden paths: paved, opinionated ways of working that make the right thing the easy thing. Tooling, templates, guardrails, and reusable components are exposed directly to teams, enabling them to move independently while staying within defined boundaries.

The difference is fundamental. A CoE pulls work toward itself. A platform pushes capability outward. One creates queues. The other creates flow.

If AI is treated as a centralized service, it will scale linearly at best. If it is treated as a platform, it can scale with the organization.

AI creates uneven gains, not uniform uplift

Research consistently shows average productivity gains in the range of 10–20%, combined with substantial variation across users and tasks [5][10][12]. The variation is the important part.

In some contexts, less experienced workers benefit significantly because AI transfers best practices and reduces barriers to entry. In others, highly skilled workers gain more when operating within the effective frontier of the technology. Outcomes depend on skill, task, and how well AI is integrated into the workflow.

The result is not a level playing field, but a changing gradient. People and teams that adapt effectively accelerate. Those who do not fall behind, even when they have access to the same tools.

Governance is becoming the bottleneck

Organizations respond to AI risk by increasing control: approvals, restrictions, and policy layers. While necessary, this often introduces systemic friction.

Industry and institutional research consistently identify organizational barriers—not technical limitations—as the primary constraint on AI value creation [1][3]. The issue is less about building capability and more about enabling its use.

A more effective approach is proportional governance. Low-risk, individual use cases require minimal control. Team-level workflows benefit from lightweight oversight. High-impact, enterprise-critical systems require full governance. This aligns with risk-based approaches such as those from the OECD [8].

Without this proportionality, governance becomes a bottleneck rather than a safeguard.

How the divide compounds

The gap between A and B teams develops through small, compounding differences in access, learning environments, and culture.

Some teams have direct access to tools and are encouraged to experiment. Others operate through restricted interfaces and formal processes. Some learn through iteration; others wait for approval.

Over time, these differences accumulate. One part of the organization develops new capabilities and ways of working, while another continues with established practices. Eventually, they are no longer operating at the same level.

Distribution requires giving up some control

Avoiding this outcome requires accepting a degree of decentralization. Teams need the ability to experiment locally, and organizations need to tolerate variation in tools and approaches.

This introduces a temporary phase where things feel less controlled and less consistent. That phase is where learning happens. Eliminating it too early suppresses adoption and reinforces the divide.

AI as infrastructure

If AI remains confined to specialists, organizations create internal inequality and limit their ability to adapt. If it becomes embedded in everyday work—more like electricity than expertise—it enables continuous, distributed improvement.

The objective is not to build a stronger AI team, but to remove the distinction altogether. Because the organizations that benefit most from AI will not be those with the most advanced models, but those where its use is widespread, routine, and integrated into how work gets done.

References

[1] McKinsey & Company – The State of AI
https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai

[2] Boston Consulting Group – Artificial Intelligence Capabilities
https://www.bcg.com/capabilities/artificial-intelligence

[3] Deloitte – State of AI in the Enterprise
https://www2.deloitte.com/us/en/insights/focus/cognitive-technologies/state-of-ai-and-intelligent-automation-in-business-survey.html

[4] Gartner – How to Scale AI in the Enterprise
https://www.gartner.com/en/articles/how-to-scale-ai-in-the-enterprise

[5] National Bureau of Economic Research – Generative AI at Work
https://www.nber.org/papers/w31161

[8] OECD – AI Principles
https://oecd.ai/en/ai-principles

[9] Paul A. David – The Dynamo and the Computer: An Historical Perspective on the Modern Productivity Paradox
https://doi.org/10.3386/w5099

[10] Quarterly Journal of Economics – Generative AI at Work
https://academic.oup.com/qje/article/140/2/889/7990658

[11] MIT Sloan – How Generative AI Can Boost Highly Skilled Workers’ Productivity
https://mitsloan.mit.edu/ideas-made-to-matter/how-generative-ai-can-boost-highly-skilled-workers-productivity

[12] MIT Economics – Experimental Evidence on Generative AI
https://economics.mit.edu/sites/default/files/inline-files/Noy_Zhang_1.pdf

Move the Security Boundary to the Software Supply Chain

Introduction

Something interesting is happening in software engineering right now. For a long time, infrastructure was the constraint. In the early days of enterprise IT, creating an environment meant ordering hardware, waiting for deliveries, configuring networks, and physically installing machines in data centers. It was slow, expensive, and operationally heavy.

Cloud computing changed that. With infrastructure-as-code and software-defined infrastructure, environments could suddenly be created in minutes. For the first time, infrastructure could move faster than the software being built on top of it. Developers could spin up databases, networks, and compute resources almost instantly, and many of the traditional operational bottlenecks disappeared.

But something has shifted again.

With the arrival of AI coding agents and increasingly powerful developer tooling, software can now be produced faster than compliant infrastructure can be created inside many enterprises. A developer with modern tools can explore ideas and produce working solutions at a remarkable pace. Meanwhile, creating environments that satisfy internal compliance requirements, governance processes, and security reviews can take days or weeks.

Infrastructure has become slow again—not because of technology, but because of process.

The result is a growing mismatch between the speed at which developers can innovate and the speed at which corporate governance allows experimentation to happen.

The Traditional Model

Most corporate security models are built on a simple assumption: control must begin at the developer machine. Developers work inside tightly managed environments, with locked-down laptops, restricted networks, controlled development environments, and tightly governed access to infrastructure.

The intention is understandable. These controls are meant to reduce risk and protect corporate systems and data.

In practice, however, they often produce the opposite outcome. Developers end up spending significant time navigating internal restrictions rather than experimenting with new ideas. The environment becomes optimized for compliance rather than exploration.

The problem is not security itself. The problem is where security is applied.

A Different Boundary

There is another way to think about this.

Instead of trying to tightly control the environments in which developers work, we can move the security boundary. Developers can operate in open, flexible environments outside the corporate firewall—using their own machines, cloud sandboxes, or experimental infrastructure where they can explore ideas quickly.

In this model the corporate firewall does not attempt to contain developer experimentation. Instead, it protects production systems and enterprise infrastructure.

The boundary between these two worlds becomes the one artifact that truly matters: the code.

Code as the Gateway

If code becomes the mechanism through which innovation enters the enterprise, then the logical place to apply security controls is at that gateway.

Platforms such as GitHub already provide the building blocks for this approach. Modern development platforms make it possible to apply automated verification whenever code enters a repository. Static analysis, secret scanning, dependency checks, policy enforcement through workflows, automated testing, and mandatory peer review can all be applied before code moves further downstream.

Security moves away from controlling developer workstations and toward controlling the software supply chain.

This shift aligns closely with several modern security frameworks. The recommendations from the Open Source Security Foundation and the model defined in SLSA both focus on protecting the integrity of builds, artifacts, and deployment pipelines rather than attempting to control the environments where developers write code. The same philosophy is reflected in the NIST Secure Software Development Framework.

In these models, the build pipeline itself becomes the security boundary.

Platform Engineering Inside the Enterprise

Once code passes these verification gates, it can move into the enterprise environment where platform engineering and DevOps teams take over. At this stage the organization can apply its full set of governance controls. Infrastructure patterns can be standardized, network policies enforced, runtime security monitoring enabled, and additional compliance checks applied.

Governance does not disappear in this model. It simply moves to a more effective location in the process.

Instead of governing experimentation, the organization governs what ultimately runs in production.

Why This Matters Now

The pace of technological change has accelerated dramatically. AI-assisted development means that developers can prototype ideas, test technologies, and explore new architectures faster than ever before.

If corporate processes require weeks to create compliant environments for experimentation, developers simply cannot move at the speed modern tools allow. When that happens, organizations risk something more serious than slow development. They risk becoming unable to explore new technologies at all.

Innovation requires the ability to try things quickly, discard ideas that do not work, and double down on the ones that do. When experimentation becomes difficult, innovation quietly disappears.

Trust the Process, Not the Laptop

Traditional enterprise security assumes that control must begin with the developer workstation. Modern software supply chain thinking suggests a different perspective.

What matters most is not where code is written. What matters is how code is verified before it reaches production systems.

The open source ecosystem has operated this way for decades. Thousands of developers contribute code from anywhere in the world, yet the most critical infrastructure software on the planet is built using this model. The security controls focus on review, testing, and artifact verification rather than on controlling contributor laptops.

Enterprises can adopt the same principle.

A Practical Balance

Allowing developers to experiment outside the firewall while enforcing strong controls on the code entering the enterprise creates a more balanced system. Developers retain the freedom required to explore ideas and work with modern tooling, while organizations maintain governance, compliance, and security verification where it matters most.

In an age where AI is accelerating the speed of software creation, the most effective place to apply control is no longer the developer machine.

It is the software supply chain.

References

  • OpenSSF – Software Supply Chain Security: https://openssf.org
  • SLSA – Supply-chain Levels for Software Artifacts: https://slsa.dev
  • NIST Secure Software Development Framework (SSDF): https://csrc.nist.gov/Projects/ssdf
  • GitHub Advanced Security: https://github.com/security/advanced-security

From Tools to Orchestrators: A General Architecture for AI-Native Scientific Research

Abstract

Scientific computing has reached an inflection point. High-performance computing, cloud-native data platforms, and foundation models have dramatically accelerated individual steps in research workflows. Yet most scientific environments remain structurally fragmented: data is generated in one system, workflows execute in another, analytical summaries live elsewhere, and interpretation remains largely manual.

This post argues for a general architecture for AI-native scientific research in which artificial intelligence functions not as a standalone analytical tool, but as an orchestration layer across computation, metadata, and analytics systems. Rather than replacing existing infrastructure, this approach integrates it through structured interfaces and provenance-aware data layers. Although the architecture is illustrated through a genomics example, the principles generalize to any domain in which in-silico methods accelerate discovery.

The Real Bottleneck: Fragmentation

Across disciplines such as genomics, metabolomics, sensory science, spectroscopy, materials research, and fermentation science, a common pattern appears. Experimental data is generated within specialized platforms. Computational workflows are executed in separate environments. Results are stored as files in object storage or local servers. Cross-experiment comparison is often manual, and metadata capture is inconsistent. Reproducibility depends more on institutional memory than on system design.

In most research environments today, computational power is not the limiting factor. The constraint lies in orchestration, integration, and structured interpretation. Scientific acceleration increasingly depends on how effectively systems connect, not on how fast individual tools operate.

A Layered Architecture for AI-Native Research

The proposed architecture separates responsibilities into four conceptual layers, each with a clearly defined role.

The first layer is the execution layer, which remains the authoritative source of computational truth. This layer is responsible for heavy computation, workflow execution, and the generation of primary artifacts. Depending on the domain, it may consist of cloud-based genomic pipelines, HPC clusters, digital twin simulations of fermentation processes, robotics-controlled experimentation, or large-scale analytical workflows. The central principle is that this layer computes deterministically and preserves reproducibility. It is not replaced by AI; it is coordinated by it.

The second layer is the structured interpretation layer. Raw artifacts such as alignment files, chromatograms, spectral matrices, or process simulations are rarely suitable for reasoning across experiments. This layer extracts structured summaries, registers parameters and reference versions, and links findings to explicit provenance. In doing so, it transforms scientific reasoning from file-centric to finding-centric. The layer must remain lightweight, rebuildable, and explicit about version identity. Without it, any AI system attempting cross-run reasoning would be forced to reconstruct context from heterogeneous raw files, a fragile and non-scalable approach.

The third layer is the analytical layer. Here, structured outputs are aggregated, modeled, visualized, and integrated across domains. Statistical workflows, machine learning pipelines, and reporting systems operate at this level. It supports exploration and synthesis but does not execute primary experimental computation. It complements the execution layer rather than replacing it.

The fourth and most transformative layer is the conversational orchestration layer. A large language model, connected through structured tool interfaces, interprets researcher intent and coordinates actions across the other layers. It translates natural language questions into structured queries, triggers workflows when appropriate, integrates results across systems, and documents reasoning paths. Importantly, it does not modify raw data or override execution engines. It orchestrates rather than computes.

When these layers are properly separated, AI evolves from a chatbot into a scientific coordinator.

From Queries to Long-Running Co-Scientist Workflows

The next frontier is not single-prompt interaction but long-running, goal-directed research processes. An AI-native orchestrator can maintain contextual awareness across sessions, track hypotheses over time, coordinate multi-step analyses, and integrate intermediate results into evolving reasoning chains.

When domain-specific reasoning patterns are formalized into versioned and reusable “skills,” scientific workflows become auditable and collaborative. Instead of isolated prompts, research evolves into structured AI-mediated projects in which multiple scientists interact with shared computational guardrails. The system preserves reproducibility while accelerating iteration.

In this model, AI becomes a persistent scientific co-orchestrator rather than a transient assistant.

A Genomics Reference Implementation

One instantiation of this architecture can be observed in a genomics context. In that environment, a cloud-based execution engine processes sequencing data and generates alignment and variant artifacts. A lightweight, provenance-first interpretation layer structures variant findings across runs, capturing reference identities and parameter differences. An analytical platform aggregates results for cross-project exploration. A conversational AI interface connects them through structured tool interfaces.

Within such a system, scientists can compare variants across strains, identify changes in reference genomes between runs, detect parameter differences, trigger new workflows, and iteratively refine hypotheses without reopening raw alignment files or reconstructing workflow logs manually. Raw data remains immutable. Provenance remains explicit. Every step is traceable.

Although domain-specific in its implementation, the architectural principles are domain-agnostic.

Generalization Across Scientific Domains

The same structure applies well beyond genomics. Laboratories working with LC-MS and GC-MS data face persistent challenges in analytical reproducibility and cross-instrument transfer. Sensory science groups contend with variability and latent structure in panel data. Spectroscopy platforms require ongoing calibration maintenance across instruments and environments.

In fermentation and ingredient characterization, digital twins and predictive process models increasingly complement physical experimentation, yet their outputs often remain isolated from historical runs and analytical metadata. The opportunity is not merely to build better models, but to connect those models into a structured reasoning fabric that spans experiments, instruments, and time.

In each of these domains, in-silico iteration accelerates discovery. The architectural shift lies not in introducing new models, but in enabling structured orchestration across existing systems.

Design Principles

Several principles emerge as foundational.

Execution engines remain authoritative and deterministic. Interpretation layers must be fully rebuildable from primary artifacts. Provenance must be first-class rather than implicit. AI orchestrates systems but does not own data. Reproducibility must be enforced architecturally rather than culturally.

When these principles are respected, AI-native research becomes both scalable and governable.

Conclusion

The future of scientific computing is unlikely to be another monolithic platform. Instead, it will be a layered architecture in which computation remains deterministic, metadata is structured, analytics are scalable, and AI coordinates interactions across systems.

The real competitive advantage will not belong to those who adopt the largest models, but to those who design systems where models can reason safely and coherently across structured scientific context.

Scientific acceleration, in this view, is no longer primarily a question of faster models. It is a question of who learns to build research environments that think.

Understanding Agentic Architectures and Why They Differ Fundamentally from Event-Driven Design

Introduction

In recent months, an increasing number of vendors and practitioners have begun describing event-driven architectures (EDA) as the foundation for agentic systems.

While both paradigms involve distributed and asynchronous systems, they address entirely different architectural concerns:

  • Event-driven design enables reliable data movement and temporal decoupling across systems.
  • Agentic design enables autonomous reasoning, coordination, and adaptive decision-making.

This post clarifies these differences through the lens of established architectural literature and pattern theory, helping distinguish between data-flow infrastructure and cognitive control-flow systems.

Pattern Lineage and Conceptual Heritage

Software architecture has evolved through distinct, well-documented pattern families—each solving a different class of problems:

DomainCanonical SourceArchitectural Concern
Software Design PatternsGamma et al., Design Patterns (1994)Structuring software components and behaviour
Enterprise Integration PatternsHohpe & Woolf, Enterprise Integration Patterns(2003)Asynchronous communication and integration
Pattern-Oriented Software ArchitectureBuschmann et al., Pattern-Oriented Software Architecture (1996–2007)Component interaction, brokers, and coordination
Agent-Oriented SystemsWooldridge, An Introduction to MultiAgent Systems(2009)Reasoning, autonomy, and collaboration

Each of these domains emerged to address a specific layer of system complexity.
Event-driven architectures belong to the integration layer.

Agentic architectures operate at the reasoning and control layer.

Event-Driven Architecture: Integration and Temporal Decoupling

Event-driven architecture decouples producers and consumers through asynchronous communication. Common patterns include Publish–SubscribeEvent Bus, and Event Sourcing.

Its core strengths are:

  • High scalability and throughput
  • Loose coupling and resilience
  • Near-real-time responsiveness

EDA is therefore ideal for information propagation, but it does not address why or how a system acts. It transports information; it does not interpret or decide.

Agentic Architecture: Reasoning and Adaptation

Agentic systems focus on autonomous goal-directed behaviour. They implement a cognitive control loop:

Observe → Plan → Act → Learn

This structure—present since early multi-agent research—now underpins modern frameworks such as LangGraph, AutoGen, and Microsoft’s Autonomous Agents Framework.

Core principles include:

  • Control Flow: deciding what to do next based on context
  • Memory and Context: maintaining state across reasoning cycles
  • Tool Use: interacting with APIs or systems to execute plans
  • Collaboration: coordinating with other agents to achieve shared goals

Agentic architectures are thus control-graph frameworks, not messaging infrastructures.

Why “Event-Driven Agentic Architecture” Is a Conceptual Misstep

Confusing event-driven integration with agentic reasoning conflates communication with cognition.

Common AssertionCorrect Interpretation
“Agents communicate through Kafka topics.”That describes data transport, not reasoning or collaboration.
“Event streaming enables autonomy.”Autonomy arises from goal-based planning and local state, not from asynchronous I/O.
“Event mesh = Agent mesh.”An event mesh routes bytes; an agent mesh coordinates intent.
“Streaming platforms enable multi-agent collaboration.”They enable message exchange; collaboration requires shared semantic context and decision logic.

EDA can support agentic systems—for example, as a trigger or observation channel—but it does not constitute their architectural foundation.

Maintaining Conceptual Precision

Architectural vocabulary should map to the corresponding canonical lineage:

ConcernCanonical Reference
Integration, routing, replayHohpe & Woolf, Enterprise Integration Patterns
Reasoning, autonomy, coordinationWooldridge, An Introduction to MultiAgent Systems
System decomposition, blackboard, broker stylesBuschmann et al., Pattern-Oriented Software Architecture
Modern control-flow frameworks for AI agentsLangGraph, Microsoft Autonomous Agents Framework (2024–2025)

Anchoring terminology to established pattern families preserves conceptual integrity and prevents marketing-driven drift.

Practical Implications

  1. Use event-driven design for system integration, data propagation, and observability.
  2. Use agentic design for autonomy, reasoning, and goal-oriented workflows.
  3. Keep a strict separation between data flow (how information moves) and control flow (how decisions are made).
  4. Evaluate vendor claims by tracing them back to canonical architectural literature.
  5. Foster literacy in software and integration pattern theory to maintain shared architectural clarity across teams.

Recommended Reading

  • Wooldridge, M. (2009). An Introduction to MultiAgent Systems (2nd ed.). Wiley.
  • Hohpe, G., & Woolf, B. (2003). Enterprise Integration Patterns. Addison-Wesley.
  • Buschmann, F. et al. (1996–2007). Pattern-Oriented Software Architecture Vols 1–5. Wiley.
  • Gamma, E. et al. (1994). Design Patterns. Addison-Wesley.
  • LangChain / LangGraph Documentation (2024–2025). “Agentic Design Patterns.”
  • Microsoft Autonomous Agents Framework (Preview 2025).

Conclusion

Architectural precision is not academic—it determines how systems scale, adapt, and remain intelligible.

Event-driven architectures will continue to serve as the backbone of data movement.

Agentic architectures will increasingly govern how intelligent systems reason, plan, and act.

Understanding where one ends and the other begins is essential for designing systems that are both well-connected and truly intelligent.

« Older posts

© 2026 Peter Birkholm-Buch

Theme by Anders NorenUp ↑