3 min read

Multi-agent systems: what actually works in production

Not the research version. The version that has to not fail inside a real product.

I've been running multi-agent systems in internal tooling for several months. Not research experiments. Things that run daily and break workflows when they fail. Here's what I've found.

What the research version looks like vs. what production requires

The research framing: autonomous agents that reason, plan, use tools, and produce complex outputs with minimal human input. The demos are compelling. An agent that searches the web, writes code, runs it, interprets the results, and iterates is a real thing that works in controlled demonstrations.

The production requirement is different: a system that produces consistent output, fails gracefully, and doesn't require babysitting. The autonomous research agent fails on the second and third conditions. It hallucinates tool calls. It loops. It produces confident outputs that are quietly wrong. In a demo, a human is watching and can intervene. In production, nobody is watching.

What actually works

Narrow-scope agents with verification steps. We have an agent that reads our incident retrospective notes, extracts action items, and drafts follow-up tickets. The scope is narrow: a specific document format in, a specific output format out. The verification step is a human review of the drafted tickets before they're created. Narrow scope and human-in-the-loop verification is the pattern that works.

Agents as first-pass filters, not decision makers. We use an agent to triage incoming integration requests from new clients: categorize the request type, extract the technical requirements, route to the relevant team member. The agent doesn't decide anything. It organizes information. The decision is still human. Triage agents work reliably because the cost of an incorrect triage is low and the human who receives it applies the final judgment.

Pipeline orchestration with deterministic steps. Calling a model to parse input, passing the parsed output to a deterministic processing step, calling another model to generate a summary: chaining language model calls with deterministic steps between them is more reliable than pure agent autonomy. The deterministic steps are checkpoints. When something goes wrong, you know where.

What doesn't work

Open-ended planning with real consequences. Agents that need to devise a plan and execute it across multiple tool calls without checkpoints produce failure modes that are hard to predict and hard to debug. The failure is usually a combination of a plausible-but-wrong plan and confident execution of that plan. The result is work done in the wrong direction with no easy reversal.

Tool use at scale. A single agent making five tool calls is manageable. An agent making fifty tool calls in the course of a task accumulates errors in ways that compound. Each tool call can fail silently. Each silent failure changes the context for subsequent calls. At some length of tool use chain, the output reliability becomes unacceptably low.

Replacing human judgment in ambiguous situations. When the task is genuinely ambiguous, the agent produces an answer with the same confidence it has on unambiguous tasks. The confidence is not calibrated to the ambiguity. A human knows when they don't know. Current agents don't.

The actual deployment pattern

We build agents for tasks where the scope is defined, the output is verifiable, and the cost of failure is bounded. For anything outside that envelope, we use agents to prepare information for human decision-making rather than to make decisions themselves.

This is a narrower use case than the research literature suggests. It's also the one that runs reliably in production without requiring a human to monitor it constantly.

With gusto, Fatih.