Stuff about Software Engineering

Tag: CRL

Small Models, Real Acceleration: Notes from the Field

Over the past year, I’ve spent more time than I’d like to admit trying to make AI models actually work in scientific environments. Not as demos or slides, but in the kind of messy, disconnected setups that real research runs on. And after enough of those experiments, one thing keeps repeating itself: the smaller models often get the job done better.

Trying to fine-tune or adapt a multi-hundred-billion-parameter model sounds impressive until you’ve actually tried. The cost, the infrastructure, the data wrangling — it’s a full-time job for a team of specialists. Most research teams don’t have that. But give them a 3B or 7B model that runs locally, and suddenly they’re in control. These smaller models are fast, predictable, and easy to bend to specific problems. You can fine-tune them, distil from a larger one, or just shape them around the way your own data behaves.

That’s the difference between something theoretical and something usable. Scientists can now build domain-specific models on their own machines, without waiting for external infrastructure or a cloud budget to clear. You don’t need a new foundation model—you just need one that understands your work.

Working Close to the Data

Running models locally changes how you think about performance. When your data can’t leave the lab, a local model doesn’t just make sense—it’s the only option. And you start realizing that “good enough” isn’t vague at all. It’s measurable. In genomic analysis, it means sequence alignment accuracy within half a percent of a cloud model. In sensory analysis, it means predicted clusters that match what human panels taste nine times out of ten. That’s good enough to move forward.

I’ve seen small models running on local hardware produce the same analytical outputs as flagship cloud models—only faster and without the overhead. That’s when you stop talking about scale and start talking about speed.

Collaboration is the Multiplier

The real unlock isn’t just the model size—it’s the mix of people using it. Scientists who can code, or who have access to someone who can, change the pace completely. Pair one scientist with one software engineer and you often get a tenfold increase in research velocity. That combination of curiosity and technical fluency is where acceleration really happens.

And the hardware helps. With a workstation-class GPU like the NVIDIA DGX Spark, you can fine-tune a model on your own data, automate repetitive analysis, and test ideas before running a single physical experiment. It’s not about replacing scientists—it’s about removing the waiting time between ideas.

Where It’s Heading

This is the new normal for scientific computing:

  • Small, specialized models embedded directly into the research environment.
  • Agentic systems coordinating tools, data, and models in real time.
  • Scientists and engineers working side by side to shape AI tools that mirror experimental logic.

At some point, AI stops observing science and starts participating in it. And that’s where things start to get interesting.

Software is Eating Science – and That’s a Good Thing

Introduction

I love the internet memos like the one from Jeff Bezos about APIs and Marc Andreessen’s 2011 prediction that “software is eating the world.” Over a decade later, it’s devoured more than commerce and media—it’s now eating science, and quite frankly, it’s about time.

Scientific research, especially in domains like biology, chemistry, and medicine, has historically been a software backwater. Experiments were designed in paper notebooks, data handled via Excel, and results shared through PowerPoint screenshots. It’s only recently that leading institutions began embedding software engineering at the core of how science gets done. And the results speak for themselves. The Nobel Prize in Chemistry 2024, awarded for the use of AlphaFold in solving protein structures, is a striking example of how software—developed and scaled by engineers—has become as fundamental to scientific breakthroughs as any wet-lab technique.

The Glue That Holds Modern Science Together

Software engineers aren’t just building tools. At institutions like the Broad Institute, Allen Institute, and EMBL-EBI, they’re building scientific platforms. Terra, Code Ocean, Benchling—these aren’t developer toys, they’re scientific instruments. They standardize experimentation, automate reproducibility, and unlock collaboration at scale.

The Broad Institute’s Data Sciences Platform employs over 200 engineers supporting a staff of 3,000. Recursion Pharmaceuticals operates with an almost 1:1 engineer-to-scientist ratio. These are not exceptions—they’re exemplars.

The Real Payoff: Research Acceleration

When you embed software engineers into scientific teams, magic happens:

  • Setup time drops by up to 70%
  • Research iteration speeds triple
  • Institutional knowledge gets preserved, not lost in SharePoint folders
  • AI becomes usable beyond ChatGPT prompts—supporting actual data analysis, modeling, and automation

These are not hypothetical. They’re documented results from public case studies and internal programs at peer institutions.

From Hype to Hypothesis

While many institutions obsess over full lab digitization (think IoT pipettes), the smarter move is prioritizing where digital already exists: in workflows, data, and knowledge. With tools like Microsoft Copilot, OpenAI Enterprise, and AI language models for genomics like Evo2, AlphaFold for protein structure prediction, and DeepVariant for variant calling—tools that only become truly impactful when integrated, orchestrated, and maintained by skilled engineers who understand both the research goals and the computational landscape, researchers are now unlocking years of buried insights and accelerating modeling at scale.

Scientific software engineers are the missing link. Their work turns ad hoc experiments into reproducible pipelines. Their platforms turn pet projects into institutional capability. And their mindset—rooted in abstraction, testing, and scalability—brings scientific rigor to the scientific process itself.

What many underestimate is that building software—like conducting experiments—requires skill, discipline, and experience. Until AI is truly capable of writing production-grade code end-to-end (and it’s not—see Speed vs. Precision in AI Development), we need real software engineering best practices. Otherwise, biology labs will unknowingly recreate decades of software evolution from scratch—complete with Y2K-level tech debt, spaghetti code, and glaring security gaps.

What Now?

If you’re in research leadership and haven’t staffed up engineering talent, you’re already behind. A 1:3–1:5 engineer-to-scientist ratio is emerging as the new standard—at least in data-intensive fields like genomics, imaging, and molecular modeling—where golden-path workflows, scalable AI tools, and reproducible science demand deep software expertise.

That said, one size does not fit all. Theoretical physics or field ecology may have very different needs. What’s critical is not the exact ratio, but the recognition that modern science needs engineering—not just tools.

There are challenges. Many scientists weren’t trained to work with software engineers, and collaboration across disciplines takes time and mutual learning. There’s also a cultural risk of over-engineering—replacing rapid experimentation with too much process. But when done right, the gains are exponential.

Science isn’t just done in the lab anymore—it’s done in GitHub. And the sooner we treat software engineers as core members of scientific teams, not as service providers, the faster we’ll unlock the discoveries that matter.

Let’s stop treating software like overhead. It’s the infrastructure of modern science.

AI for Data (Not Data and AI)

Cold Open

Most companies get it backwards.

They say “Data and AI,” as if AI is dessert—something you get to enjoy only after you’ve finished your vegetables. And by vegetables, they mean years of data modeling, integration work, and master‑data management. AI ends up bolted onto the side of a data office that’s already overwhelmed.

That mindset isn’t just outdated—it’s actively getting in the way.

It’s time to flip the script. It’s not Data and AI. It’s AI for Data.

AI as a Data Appendage: The Legacy View

In most org charts, AI still reports to the head of data. That tells you everything: AI is perceived as a tool to be used on top of clean data. The assumption is that AI becomes useful only after you’ve reached some mythical level of data maturity.

So what happens? You wait. You delay. You burn millions building taxonomies and canonical models that never quite deliver. When AI finally shows up, it generates dashboards or slide‑deck summaries. Waste of potential.

What If AI Is Your Integration Layer?

Here’s the mental flip: AI isn’t just a consumer of data—it’s a synthesizer. A translator. An integrator – an Enabler!

Instead of cleaning, mapping, and modeling everything up front, what if you simply exposed your data—as is—and let the AI figure it out?

That’s not fantasy. Today, you can feed an AI messy order tables, half‑finished invoice exports, inconsistent SKU lists—and it still works out the joins. Sales and finance data follow patterns the model has seen a million times.

The magic isn’t that AI understands perfect data. The magic is that it doesn’t need to.

MCP: OData for Agents

Remember OData? It promised introspectable, queryable APIs—you could ask the endpoint what it supported. Now meet MCP (Model Context Protocol). Think OData, but for AI agents.

With MCP, an agent can introspect a tool, learn what actions exist, what inputs it needs, what outputs to expect. No glue code. No brittle integrations. You expose a capability, and the AI takes it from there.

OData made APIs discoverable. MCP makes tools discoverable to AIs.

Expose your data with just enough structure, and let the agent reason. No mapping tables. No MDM. Just AI doing what it’s good at: figuring things out.

Why It Works in Science—And Why It’ll Work in Business

Need proof? Look at biology.

Scientific data is built on shared, Latin‑based taxonomies. Tools like Claude or ChatGPT navigate these datasets without manual schema work. At Carlsberg we’ve shown an AI connecting yeast strains ➜ genes ➜ flavor profiles in minutes.

Business data is easier. You don’t need to teach AI what an invoice is. Or a GL account. These concepts are textbook. Give the AI access and it infers relationships. If it can handle yeast genomics, it can handle your finance tables.

Stop treating AI like glass. It’s ready.

The Dream: MCP‑Compliant OData Servers

Imagine every system—ERP, CRM, LIMS, SharePoint—exposing itself via an AI‑readable surface. No ETLs, no integration middleware, no months of project time.

Combine OData’s self‑describing endpoints with MCP’s agent capabilities. You don’t write connectors. You don’t centralize everything first. The AI layer becomes the system‑of‑systems—a perpetual integrator, analyst, translator.

Integration disappears. Master data becomes a footnote.

When Do You Still Need Clean Data?

Let’s address the elephant in the room: there are still scenarios where data quality matters deeply.

Regulatory reporting. Financial reconciliation. Mission-critical operations where a mistake could be costly. In these domains, AI is a complement to—not a replacement for—rigorous data governance.

But here’s the key insight: you can pursue both paths simultaneously. Critical systems maintain their rigor, while the vast majority of your data landscape becomes accessible through AI-powered approaches.

AI for Data: The Flip That Changes Everything

You don’t need perfect data to start using AI. That’s Data and AI thinking.

AI for Data starts with intelligence and lets structure emerge. Let your AI discover, join, and reason across your real‑world mess—not just your sanitized warehouse.

It’s a shift from enforcing models to exposing capabilities. From building integrations to unleashing agents. From waitingto acting while you learn.

If your organization is still waiting to “get the data right,” here’s your wake‑up call: you’re waiting for something AI no longer needs.

AI is ready. Your data is ready enough.

The only question left: Are you ready to flip the model?

Accelerating Research at Carlsberg Research Laboratory using Scientific Computing

Introduction

Scientific discovery is no longer just about what happens in the lab—it’s about how we enable research through computing, automation, and AI. At Carlsberg Research Laboratory (CRL), our Accelerate Research initiative is designed to remove bottlenecks and drive breakthroughs by embedding cutting-edge technology into every step of the scientific process.

The Five Core Principles of Acceleration

To ensure our researchers can spend more time on discovery we are focusing on:

  • Digitizing the Laboratory – Moving beyond manual processes to automated, IoT-enabled research environments.
  • Data Platform – Creating scalable, accessible, and AI-ready data infrastructure that eliminates data silos.
  • Reusable Workflows – Standardizing and automating research pipelines to improve efficiency and reproducibility.
  • High-Performance Computing (HPC) – Powering complex simulations and large-scale data analysis. We are also preparing for the future of quantum computing, which promises to transform how we model molecular behavior and simulate complex biochemical systems at unprecedented speed and scale.
  • Artificial Intelligence – Enhancing data analysis, predictions, and research automation beyond just generative AI.

The Expected Impact

By modernizing our approach, we aim to:

  • Reduce research setup time by up to 70%
  • Accelerate experiment iteration by 3x
  • Improve cross-team collaboration efficiency by 5x
  • Unlock deeper insights through AI-driven analysis and automation

We’re not just improving research at CRL; we’re redefining how scientific computing fuels innovation. The future of research is fast, automated, and AI-driven.

© 2026 Peter Birkholm-Buch

Theme by Anders NorenUp ↑