# Anthropic's Multi-Agent Architecture: How the Orchestrator-Worker Pattern Beats Single-Agent by 90.2%

> Source: https://sukruyusufkaya.com/en/blog/anthropic-multi-agent-orchestrator-worker-pattern-2026
> Updated: 2026-05-27T18:12:33.273Z
> Type: blog
> Category: yapay-zeka
**TLDR:** Anthropic's Multi-Agent Research system beat single-agent Claude Opus by 90.2% on internal research evals using an orchestrator-worker pattern. This guide covers lead agent + parallel subagent architecture, structured artifact handoffs, planner-generator-evaluator loops, Claude Agent SDK with .claude/agents/, cost caps, deadlock prevention, comparisons with CrewAI/LangGraph/AutoGen, and a Turkish law-firm contract-analysis case.

<tldr data-summary="[&quot;Anthropic&apos;&apos;s Multi-Agent Research system beat single-agent Claude Opus by 90.2% on its internal research eval — one of the largest documented production-grade gains in agentic AI to date.&quot;,&quot;Pattern: a Lead Orchestrator agent (Opus 4.7) plans and dispatches 3-5 parallel Worker subagents (Sonnet 4.6/Haiku 4.5). Each worker runs in its own isolated context; output flows back as a structured artifact.&quot;,&quot;Context isolation is critical: clean per-subagent context eliminates lost-in-the-middle errors on long-chain tasks.&quot;,&quot;Token efficiency is a paradox: multi-agent spends 4-15x more tokens than single-agent, but on deep research the economic value of the output overwhelmingly exceeds the token cost.&quot;,&quot;Practical implementation: Claude Agent SDK + .claude/agents/ folder + structured handoff schema. Cost caps, error handling, and deadlock detection must be designed on day one.&quot;]" data-one-line="The multi-agent orchestrator-worker pattern is the 2026 production-grade agentic architecture that beats single-agent by 90.2% on deep research and complex tasks."></tldr>

## 1. Why a Single Agent is Not Enough

A single LLM agent with a strong model, tools, and a good system prompt can handle a lot. But on **deep research, multi-document analysis, multi-stakeholder reporting**, single agents hit three bottlenecks:

1. **Context dilution.** As the task grows, files, observations, and intermediate outputs accumulate in the context window. *Lost in the middle* makes the model forget early steps.
2. **Sequential bottleneck.** A single agent cannot think in parallel; it analyzes 10 documents one by one. A 1-hour task becomes 10 hours.
3. **Divided attention.** The same agent strategizes and grinds details, doing both at medium quality.

Anthropic solved these with the **multi-agent orchestrator-worker pattern**, documented in *"How we built our multi-agent research system"* (March 2025). Result: **90.2% better than single-agent Claude Opus** on internal research evals.

<definition-box data-term="Multi-Agent Orchestrator-Worker Pattern" data-definition="An agentic AI architecture where a Lead/Orchestrator agent plans the task and dispatches 3-5 parallel Worker subagents. Each subagent runs in isolated context; results return to the Lead as structured artifacts. Documented by Anthropic Engineering and reference-implemented in the Claude Agent SDK." data-also="Multi-Agent System, MAS, Orchestrator-Worker" data-wikidata="Q1064782"></definition-box>

<stat-callout data-value="90.2%" data-context="Anthropic''s Multi-Agent Research system''s lift over single-agent Claude Opus on internal research evals" data-outcome="— one of the largest single-architecture jumps in production-grade agentic AI documented to date." data-source="{&quot;label&quot;:&quot;Anthropic Engineering: Multi-Agent Research System&quot;,&quot;url&quot;:&quot;https://www.anthropic.com/engineering/multi-agent-research-system&quot;,&quot;date&quot;:&quot;2025-03&quot;}"></stat-callout>

### Multi-Agent Timeline: From ReAct to Anthropic

Four waves between 2022 and 2026:

- **2022 — ReAct.** Single-agent think/act/observe loop. Tool use standardized.
- **2023 — Reflexion, Tree of Thoughts, Generative Agents.** Self-reflection, multi-path reasoning, agent simulation.
- **2024 — AutoGen, CrewAI, LangGraph.** First multi-agent Python frameworks. Experimental.
- **2025 — Anthropic Multi-Agent Research + Claude Agent SDK.** First production-grade, benchmark-backed evidence. Pattern matured.

Anthropic's March-2025 publication mattered because it was a *production report*, not academic research — real users, real workloads, real cost/quality trade-offs.

### What Multi-Agent is Not

- Multi-step ≠ multi-agent.
- LLM chaining ≠ multi-agent.
- Tool use ≠ subagent.
- Mixture of Experts ≠ multi-agent.

Clarifying these distinctions is critical to identify cases where multi-agent truly adds value.

## 2. Architecture Anatomy: Lead + Workers

The pattern has five components.

### 2.1. Lead Orchestrator Agent

- **Model:** Best-in-class (Opus 4.7 in Anthropic's example).
- **Role:** Understand task, decompose into subtasks, dispatch subagents in parallel, merge results, present to user.
- **Context:** High-level plan + subagent output summaries. No details.

### 2.2. Worker Subagents

- **Model:** Fast and cheap (Sonnet 4.6 or Haiku 4.5).
- **Role:** Execute the Lead's subtask in clean context.
- **Context isolation:** Each subagent has its own context; cannot see other subagents.

### 2.3. Tools

- Subagents have access to web search, code execution, file read, MCP tools.
- The Lead typically only has a "spawn subagent" tool, not direct tools.

### 2.4. Structured Artifact Handoffs

- Subagent output is not free text but a **JSON-schema artifact**.
- Example: `{ key_finding: ..., sources: [...], confidence: 0.x }`.
- The Lead parses and merges artifacts.

### 2.5. Evaluator / Critic (optional)

- A third subagent type: audits a worker's output for quality/accuracy.
- The "Evaluator" part of Planner-Generator-Evaluator.

### 2.6. Subagent Lifecycle: Seven Phases

1. **Spawn** (Lead requests a new subagent instance).
2. **Initialize** (system prompt + tool catalog loaded).
3. **Receive Task** (parsed from JSON input).
4. **Execute** (ReAct loop).
5. **Validate Output** (schema validation).
6. **Return Artifact** (structured handoff).
7. **Cleanup** (release memory, files, network).

A solid orchestrator defines a fallback and retry policy per phase.

### 2.7. Artifact Schema Principles

1. Stable top-level fields across all artifacts.
2. Type-safe with Pydantic/Zod.
3. Confidence score required (0-1).
4. Source attribution per finding.
5. Open questions flagged when uncertain.
6. Hash signature for tamper detection.
7. Timestamp for cache invalidation.

Anti-pattern: each subagent uses a different schema, forcing the orchestrator into ad-hoc string parsing.

## 3. Planner-Generator-Evaluator Pattern

The multi-agent architecture is enriched by a sub-pattern: **Planner-Generator-Evaluator**.

<comparison-table data-caption="Multi-Agent Roles and Responsibilities" data-headers="[&quot;Role&quot;,&quot;Responsibility&quot;,&quot;Typical Model&quot;,&quot;Context Size&quot;]" data-rows="[{&quot;feature&quot;:&quot;Planner&quot;,&quot;values&quot;:[&quot;Decompose task into subtasks&quot;,&quot;Opus 4.7&quot;,&quot;High (200K-1M)&quot;]},{&quot;feature&quot;:&quot;Generator (Worker)&quot;,&quot;values&quot;:[&quot;Execute subtask&quot;,&quot;Sonnet 4.6 / Haiku 4.5&quot;,&quot;Medium (own context)&quot;]},{&quot;feature&quot;:&quot;Evaluator / Critic&quot;,&quot;values&quot;:[&quot;Audit output&quot;,&quot;Sonnet 4.6&quot;,&quot;Medium&quot;]},{&quot;feature&quot;:&quot;Writer&quot;,&quot;values&quot;:[&quot;Compose final&quot;,&quot;Opus 4.7&quot;,&quot;High&quot;]},{&quot;feature&quot;:&quot;Reviewer&quot;,&quot;values&quot;:[&quot;QA the final&quot;,&quot;Sonnet 4.6&quot;,&quot;Medium&quot;]}]"></comparison-table>

### Flow

1. **Planner** (Lead, Opus 4.7): reads query, produces n subtasks.
2. **n parallel Generators/Workers:** each executes a subtask, returns a structured artifact.
3. **Evaluator/Critic:** scores artifacts; signals re-try for low-quality ones.
4. **Writer** (Lead): merges high-quality artifacts and writes the answer.
5. **Reviewer** (optional): QA's the final answer before delivery.

### Pattern Variations

- **Hierarchical PGE** (Lead → Sub-lead → Worker) for very large tasks.
- **PGE with Self-Reflection** (Generator double-checks itself).
- **Adversarial PGE** (one Evaluator red-teams another) for high-stakes cases.
- **Iterative PGE** (low-confidence triggers loop, 2-3 iterations).

### When PGE is Not the Right Pattern

- Very short tasks solvable in one prompt.
- Deterministic audit-required pipelines.
- Tight token budgets.

## 4. Practical Implementation: Claude Agent SDK

The Claude Agent SDK (2025) is the reference implementation. Folder layout:

~~~text
my-project/
├── .claude/
│   ├── settings.json
│   ├── agents/
│   │   ├── researcher.md
│   │   ├── critic.md
│   │   ├── writer.md
│   │   └── reviewer.md
│   └── mcp.json
└── src/
~~~

### Subagent Definition (.claude/agents/researcher.md)

~~~markdown
---
name: researcher
description: |
  Use for deep research tasks. Given a question and authorized sources,
  returns a structured artifact with findings, citations, and confidence.
tools: [web_search, fetch, read_file]
model: claude-sonnet-4-6
---

You are a research subagent. For each subtask:

1. Search authoritative sources.
2. Extract key findings with direct quotes.
3. Cite every claim with source URL + date.
4. Return JSON artifact:

~~~json
{
  "subtask_id": "<id>",
  "key_findings": ["..."],
  "sources": [{"url":"...","title":"...","date":"..."}],
  "confidence": 0.0-1.0,
  "open_questions": ["..."]
}
~~~
~~~

### Lead Orchestrator (application code)

~~~typescript
import { query } from "@anthropic-ai/claude-agent-sdk";

async function multiAgentResearch(question: string) {
  const plan = await query({
    model: "claude-opus-4-7",
    systemPrompt: "You are a research planner...",
    prompt: "Decompose into 3-5 parallel subtasks: " + question,
  });
  const subtasks = JSON.parse(plan.text).subtasks;

  const workerResults = await Promise.all(
    subtasks.map((task) =>
      query({ agent: "researcher", prompt: JSON.stringify(task) })
    )
  );

  const evaluations = await Promise.all(
    workerResults.map((r) => query({ agent: "critic", prompt: r.text }))
  );

  const final = await query({
    model: "claude-opus-4-7",
    systemPrompt: "You are a research writer. Synthesize the artifacts...",
    prompt: JSON.stringify({ subtasks, results: workerResults, evals: evaluations }),
  });
  return final.text;
}
~~~

### Config Details

- **Cost cap:** Max tokens + time per subagent.
- **Concurrency:** Max parallel subagents (4-6 most common).
- **Retry policy:** 2x retry on failure, then swap with critic.
- **Telemetry:** Latency, tokens, model, success per subagent.

### Markdown Frontmatter for Subagents

Anthropic's `.claude/agents/*.md` format ships config in frontmatter and prompt in the body. The file lives in git, gets code-reviewed, and shares across teams.

### MCP Integration

A multi-agent system pairs naturally with MCP: orchestrator gets spawn/store meta-tools; workers get domain MCP tools (web_search, github, sql, vector_db); critics get read/score tools. `.claude/mcp.json` controls which MCPs are visible to which subagent.

### State Management

State lives in three tiers: per-subagent (transient), orchestrator (combined), persistent (Redis/Postgres/object storage for long tasks).

## 5. Why 90.2%? Performance Analysis

Four factors behind the gap:

### Context Isolation

Each subagent works in clean context — no lost-in-the-middle. Instead of cramming 200 pages into one context, you give 5 subagents 40 pages each.

### Parallelism

With 5 parallel subagents, total latency is ~1/5 of single-agent. Anthropic showed the parallel system dominates single-agent on the latency-quality tradeoff.

### Model Optimization

Lead with Opus, workers with Sonnet/Haiku — right model in the right place. Strategic thinking on premium model, grunt work on cheap model.

### Specialization

A "researcher" agent prompted explicitly for its role beats a generalist single agent that does everything at average quality.

### Where the Gap Widens

The 90.2% gap is not universal. It widens on:

- Multi-document synthesis (20+ documents, conflicting findings).
- Multi-stakeholder reporting (legal + financial + ops in one report).
- Deep web research (50+ sources, citation aggregation).
- Parallel hypothesis testing.

It is close to zero on single-file code review, short Q&A, summaries, classic RAG retrieval, and code completion.

### Benchmark Caveat

90.2% is Anthropic's internal eval on their task set. Independent benchmarks (AgentBench, GAIA, SWE-bench) show gaps in the 15-40% range — still meaningful, but architecture is not the only lever; prompt engineering and role design also matter.

<stat-callout data-value="4-15x" data-context="Multi-agent typically spends this many more tokens than single-agent (Anthropic report)" data-outcome="— but for deep research tasks the economic value of the output (hours saved, quality lift) overwhelmingly exceeds token cost." data-source="{&quot;label&quot;:&quot;Anthropic Engineering: Multi-Agent Research System&quot;,&quot;url&quot;:&quot;https://www.anthropic.com/engineering/multi-agent-research-system&quot;,&quot;date&quot;:&quot;2025-03&quot;}"></stat-callout>

## 6. Comparison with Other Multi-Agent Frameworks

<comparison-table data-caption="2026 Multi-Agent Framework Comparison" data-headers="[&quot;Framework&quot;,&quot;Type&quot;,&quot;Model Agnostic&quot;,&quot;Production-Ready&quot;,&quot;Community&quot;]" data-rows="[{&quot;feature&quot;:&quot;Claude Agent SDK&quot;,&quot;values&quot;:[&quot;Orchestrator-Worker&quot;,&quot;Claude only&quot;,&quot;Yes&quot;,&quot;Medium-High&quot;]},{&quot;feature&quot;:&quot;LangGraph&quot;,&quot;values&quot;:[&quot;Graph-based&quot;,&quot;Multi-provider&quot;,&quot;Yes&quot;,&quot;High&quot;]},{&quot;feature&quot;:&quot;CrewAI&quot;,&quot;values&quot;:[&quot;Role-based&quot;,&quot;Multi-provider&quot;,&quot;Yes&quot;,&quot;High&quot;]},{&quot;feature&quot;:&quot;AutoGen (Microsoft)&quot;,&quot;values&quot;:[&quot;Conversation&quot;,&quot;Multi-provider&quot;,&quot;Yes&quot;,&quot;High&quot;]},{&quot;feature&quot;:&quot;OpenAI Swarm&quot;,&quot;values&quot;:[&quot;Lightweight handoff&quot;,&quot;OpenAI&quot;,&quot;Experimental&quot;,&quot;Medium&quot;]},{&quot;feature&quot;:&quot;Atomic Agents&quot;,&quot;values&quot;:[&quot;Minimal&quot;,&quot;Multi-provider&quot;,&quot;New&quot;,&quot;Low&quot;]}]"></comparison-table>

### Which Framework, Which Use Case?

- **Claude Agent SDK:** Claude-based stack, .claude/ workflow, deep MCP integration.
- **LangGraph:** Complex state machines and loops; agentic graphs.
- **CrewAI:** Fast POC + role-based design; Python ecosystem.
- **AutoGen:** Agent-to-agent conversation + human-in-the-loop.

### Hybrid Stacks

In practice teams combine frameworks: Claude Agent SDK as lead + LangGraph workers; CrewAI for POC then Claude Agent SDK for production; AutoGen for human-in-the-loop + LangGraph for the deterministic part. The trade-off is reduced lock-in vs added complexity.

### "Don't Build Multi-Agents" Counterpoint

In September 2025 Cognition AI (Devin) published *"Don't Build Multi-Agents"* arguing that multi-agent stacks compound latency and error surface, and that single-agent + careful context can match the output. Valid for **interactive coding**; Anthropic's reported cases are deep research — a different profile. The honest answer: it depends on the use case.

## 7. Turkish Angle: KVKK, BDDK, and Multi-Stakeholder Work

Three scenarios where multi-agent shines for Turkish companies.

### Compliance Automation

KVKK breach review: planner reads complaint, three workers run in parallel — (1) pull VERBİS record, (2) search precedent KVKK rulings, (3) compare against internal policy. Evaluator scores, writer produces report. 4-8 human-hours → 12-18 minutes.

### Multi-Document Analysis

Law firms, audit firms, M&A consultancy: simultaneous analysis of 50-200 documents. Single-agent insufficient; multi-agent is the natural choice.

### Research + Report Generation

Strategy consulting, sector reports: parallel scanning of multiple sources, merged structured findings, executive summary.

### KVKK Considerations

PII redaction pre-orchestrator; subagents bound to EU/TR-hosted endpoints; artifacts written to audit logs.

### Use Case Map for Turkish Sectors

- **Banking/Insurance:** M&A contract DD, credit risk evaluation (parallel KYC/AML/financials), KVKK breach automation, fraud triage.
- **Legal:** Contract DD, case-law search + synthesis, regulation impact analysis.
- **Healthcare:** Multi-specialist case discussion (cardiologist + endocrinologist subagents), clinical literature triage (reviewer mandatory).
- **E-commerce/Marketing:** Competitor catalog analysis, customer segmentation research.
- **Manufacturing/Logistics:** Supply-chain risk, supplier due diligence, ops dashboards.

High-stakes sectors (healthcare, legal) require iterative + adversarial PGE.

### Turkish Subagent Design

For Turkish customers: (1) system prompt in Turkish, (2) Turkish-first sources (mevzuat, içtihat — Said Surucu MCPs), (3) Turkish citation format, (4) embed KVKK + BDDK awareness in system prompts. This yields a *Turkey-first* multi-agent architecture rather than "global + translated."

## 8. Case Study (Anonymized): Turkish Law Firm Contract Analysis

### Problem

An Istanbul-based corporate law firm must analyze every contract (90-300) at a target during M&A due diligence. Typical M&A: 12-18 lawyers × 3-4 weeks × 60 hours/week ≈ 2,500-4,000 hours of manual work.

### Solution

Multi-agent orchestrator-worker pattern:

- **Lead (Opus 4.7):** categorizes contracts (NDA, services, license, lease, employment, financial); spawns a pipeline per category.
- **6 parallel workers (Sonnet 4.6):** one per category. Risk clauses, change-of-control, indemnity caps, KVKK compliance, error clauses.
- **Evaluator (Sonnet 4.6):** flags conflicting findings and low-confidence artifacts.
- **Writer (Opus 4.7):** drafts executive due diligence report.
- **Reviewer (senior lawyer):** human QA.

KVKK: PII redaction pre-orchestrator; Anthropic EU endpoint; audit log per artifact.

### Result

- Time: 2,500 hours → 65 human-hours + ~80 AI-processing hours. ~16x speedup.
- Risk clauses detected: 23% higher (AI caught issues humans missed).
- Subjective quality: partners said "the report is now more consistent" (human teams were fatigue-inconsistent).
- Cost: ~$8,500 LLM cost per M&A vs ~$2.4M saved in human hours.

### Case 2 (Anonymized) — Turkish Bank: Credit Application Triage

**Problem.** 8,000-12,000 corporate credit applications/day. Per application: KYC + AML + financial statements + precedent (blacklist, restructuring history) + customer profile. Manual review: 35-50 minutes per analyst.

**Solution.** Multi-agent triage with five parallel subagents (KYC/AML, financial statements, precedent, sector/macro risk, customer profile + cross-sell), one evaluator, a writer producing the credit memo, and a senior credit analyst reviewer. KVKK/BDDK: PII redaction, EU/TR endpoints, on-prem MCP gateway, audit logs.

**Result.** Review time 35 min → 8 min human + 6 min AI; risk-score stdev cut from 15 to 6; high-risk catch rate up 18%; ops capacity 1.4x without new hires.

### Case 3 (Anonymized) — Turkish Strategy Consultancy: Sector Reports

**Problem.** A typical sector report: 4 weeks × 3 analysts ≈ 480 research-hours + 80 writing-hours.

**Solution.** Lead spawns 6 researcher subagents (one per section: market sizing, players, regulation, trends, risks, opportunities), one data subagent pulling TÜİK/KAP/sector association data, a critic, a writer, and a senior partner reviewer.

**Result.** Report cycle 4 weeks → 6 days; analyst can run 3x more parallel projects; source diversity 40% higher than human-only; client NPS 8 → 9.4.

## 9. Risks, Costs, and Operational Concerns

### Token Cost

Multi-agent spends 4-15x more tokens. The decision criterion: **"Does the per-subagent output's value exceed the token cost?"** Easily yes for deep research, easily no for casual chat.

### Deadlock and Infinite Loops

If subagents can spawn subagents (recursive), infinite loops are possible. Mitigations: call-depth limit (max 3), per-task timeout, global cost cap.

### Error Handling

If a subagent fails: (1) fail the whole task, (2) skip and continue, (3) retry up to 2x. Most robust: critic marks artifact as low-confidence; orchestrator decides.

### Observability

Track per subagent: latency, tokens in/out, model, success, output size, evaluator score. Tools: Langfuse, Arize Phoenix, Helicone, OpenTelemetry.

### Non-Determinism

Subagent outputs are stochastic. The same task twice can yield different results — a challenge for deterministic pipelines (audit, financial). Mitigation: temperature=0, structured output, eval harness.

<callout-box data-variant="warning" data-title="Multi-Agent = Complexity Multiplier">

Multi-agent raises operational complexity 3-5x. If single-agent + good RAG delivers 80% of the value, treat multi-agent as post-POC, not production day-one. ROI should drive the decision, not coolness.

</callout-box>

## 10. Frequently Asked Questions

<callout-box data-variant="answer" data-title="Is multi-agent always better?">

No. Single-step tasks (simple Q&A, short summaries, code completion) are fine for single agents. Multi-agent shines on **parallel + multi-step** tasks like deep research, multi-document analysis, and multi-stakeholder reporting. Anthropic's 90.2% is for *research* — not chat.

</callout-box>

<callout-box data-variant="answer" data-title="How many subagents are ideal?">

3-5 is most common per Anthropic. More raises orchestration overhead and cost-cap risk. Fewer loses parallelism advantages.

</callout-box>

<callout-box data-variant="answer" data-title="Which model for subagents?">

A Sonnet 4.6 + Haiku 4.5 mix is most common. Low-stake workers on Haiku (cheap, fast), high-stake on Sonnet. Lead almost always Opus 4.7.

</callout-box>

<callout-box data-variant="answer" data-title="How do I design the structured artifact schema?">

Optimize for Lead's merging step. Typical fields: subtask_id, key_findings, sources, confidence, open_questions, next_actions. JSON Schema + Pydantic/Zod validation on orchestrator side is mandatory.

</callout-box>

<callout-box data-variant="answer" data-title="How do I control cost?">

Three layers: (1) global cost cap per task, (2) per-subagent token cap, (3) early termination (planner signals "enough info"). Claude Agent SDK exposes these as config.

</callout-box>

<callout-box data-variant="answer" data-title="LangGraph or Claude Agent SDK?">

Claude-centric stack + .claude/ workflow + MCP: Claude Agent SDK. Multi-provider + complex state machines + conditional branching: LangGraph. They can also coexist.

</callout-box>

<callout-box data-variant="answer" data-title="How do I debug multi-agent?">

Track each subagent: full prompt + response + tools called + latency + token usage. Use Langfuse traces or Anthropic's workflow viewer at console.anthropic.com. Record all subagent calls for reproducibility.

</callout-box>

## 10.5. Cost Optimization Strategies

Ten practical levers:

1. **Use Haiku for workers** — reserve Sonnet for medium-stake, Opus for lead/writer only.
2. **Prompt caching** — subagent system prompts repeat; Anthropic prompt caching is ~10x cheaper on repeats.
3. **Early termination** — let the planner stop spawning if "enough info."
4. **Cache subagent outputs** — same task returns cached artifact.
5. **Batch subagents** — pair small subtasks into one subagent.
6. **Tool-output truncation** — summarize huge tool returns before feeding to subagent.
7. **Context optimization** — store summaries in Lead context, not raw outputs.
8. **Streaming** — start the writer as soon as the first artifact arrives.
9. **Selective re-runs** — re-run only the subagent flagged by the critic.
10. **Provisioned throughput** — Anthropic Bedrock/Azure committed capacity for high volume.

Together, these levers typically cut multi-agent cost 3-7x.

## 10.7. Multi-Agent Eval Harness

Intermediate metrics (subagent faithfulness/recall, critic accuracy, plan quality) plus final metrics (end-to-end accuracy, coherence, citation accuracy, latency, cost) — measured with LangFuse hierarchical traces, Anthropic Console workflow viewer, or custom Grafana + OpenTelemetry GenAI dashboards. Without an eval harness, regressions go unseen.

## 10.8. Multi-Agent Anti-Patterns

Production failure modes:

1. **Echo chamber** — all subagents pull from the same source.
2. **Hyper-granular decomposition** — 12 subtasks for a small task.
3. **No evaluator** — quality unmeasured.
4. **Free-form artifacts** — schemaless strings → regex parsing → flake.
5. **Shared mutable state** — race conditions.
6. **Synchronous waits** — losing parallelism.
7. **Unbounded recursion** — runaway cost.
8. **Long-lived Lead context** — lost-in-the-middle returns.
9. **No cost cap** — blowups.
10. **Skipping reviewer on high-stakes tasks** — legal/reputational risk.

## 10.9. Production Readiness Checklist (16 Items)

- [ ] Lead + worker + critic + writer + reviewer roles documented.
- [ ] Per-subagent system prompts + tools + models documented.
- [ ] Structured artifact JSON schema written + enforced.
- [ ] Pydantic/Zod validation; fail-fast on schema mismatch.
- [ ] Cost caps (global + per subagent).
- [ ] Concurrency limit (≤ 4-6 parallel).
- [ ] Retry policy (≤ 2 retries, exponential backoff).
- [ ] Timeout per subagent (60-300s).
- [ ] Cancellation propagation.
- [ ] PII redaction pre-orchestrator.
- [ ] Audit logging per subagent call.
- [ ] KVKK/BDDK compliance sign-off.
- [ ] Observability stack (LangFuse + Helicone).
- [ ] Eval harness (intermediate + final).
- [ ] Production smoke test (10+ real tasks).
- [ ] Runbook (incident response, rollback).

Less than 16/16 green is a launch risk.

## 11. Next Steps

Practical roadmap to bring multi-agent to your enterprise:

1. **POC evaluation.** Review current single-agent workloads; pick 1-2 use-cases where multi-agent makes sense. 2-3 weeks.
2. **Pattern design.** Lead + worker roles, structured artifact schema, evaluator strategy, cost caps, KVKK layer. 4-6 weeks.
3. **Production deploy + observability.** Langfuse traces, retry policy, deadlock detection, eval harness. 6-10 weeks.
4. **Training.** Workshop for devs and domain teams on Claude Agent SDK + .claude/ workflow + MCP integration.

Reach out via the contact form on the site.

<references-list data-items="[{&quot;title&quot;:&quot;How we built our multi-agent research system&quot;,&quot;url&quot;:&quot;https://www.anthropic.com/engineering/multi-agent-research-system&quot;,&quot;author&quot;:&quot;Anthropic Engineering&quot;,&quot;publishedAt&quot;:&quot;2025-03&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;Building Effective Agents&quot;,&quot;url&quot;:&quot;https://www.anthropic.com/research/building-effective-agents&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2024-12&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;Claude Agent SDK&quot;,&quot;url&quot;:&quot;https://docs.anthropic.com/en/docs/agents/sdk&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;Subagents and .claude/agents/&quot;,&quot;url&quot;:&quot;https://docs.anthropic.com/en/docs/claude-code/subagents&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;Lost in the Middle: How Language Models Use Long Contexts&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2307.03172&quot;,&quot;author&quot;:&quot;Liu et al.&quot;,&quot;publishedAt&quot;:&quot;2023-07&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;AutoGen — Microsoft Multi-Agent Conversation&quot;,&quot;url&quot;:&quot;https://microsoft.github.io/autogen/&quot;,&quot;author&quot;:&quot;Microsoft&quot;,&quot;publishedAt&quot;:&quot;2024&quot;,&quot;publisher&quot;:&quot;Microsoft&quot;},{&quot;title&quot;:&quot;LangGraph&quot;,&quot;url&quot;:&quot;https://langchain-ai.github.io/langgraph/&quot;,&quot;author&quot;:&quot;LangChain&quot;,&quot;publishedAt&quot;:&quot;2024&quot;,&quot;publisher&quot;:&quot;LangChain&quot;},{&quot;title&quot;:&quot;CrewAI&quot;,&quot;url&quot;:&quot;https://docs.crewai.com/&quot;,&quot;author&quot;:&quot;CrewAI&quot;,&quot;publishedAt&quot;:&quot;2024&quot;,&quot;publisher&quot;:&quot;CrewAI&quot;},{&quot;title&quot;:&quot;OpenAI Swarm&quot;,&quot;url&quot;:&quot;https://github.com/openai/swarm&quot;,&quot;author&quot;:&quot;OpenAI&quot;,&quot;publishedAt&quot;:&quot;2024&quot;,&quot;publisher&quot;:&quot;GitHub&quot;},{&quot;title&quot;:&quot;ReAct: Synergizing Reasoning and Acting&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2210.03629&quot;,&quot;author&quot;:&quot;Yao et al.&quot;,&quot;publishedAt&quot;:&quot;2022&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;Reflexion&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2303.11366&quot;,&quot;author&quot;:&quot;Shinn et al.&quot;,&quot;publishedAt&quot;:&quot;2023&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;Tree of Thoughts&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2305.10601&quot;,&quot;author&quot;:&quot;Yao et al.&quot;,&quot;publishedAt&quot;:&quot;2023&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;Generative Agents&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2304.03442&quot;,&quot;author&quot;:&quot;Park et al.&quot;,&quot;publishedAt&quot;:&quot;2023&quot;,&quot;publisher&quot;:&quot;Stanford / Google&quot;},{&quot;title&quot;:&quot;Anthropic Console&quot;,&quot;url&quot;:&quot;https://console.anthropic.com/&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;Langfuse&quot;,&quot;url&quot;:&quot;https://langfuse.com/&quot;,&quot;author&quot;:&quot;Langfuse&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Langfuse&quot;},{&quot;title&quot;:&quot;Arize Phoenix&quot;,&quot;url&quot;:&quot;https://docs.arize.com/phoenix&quot;,&quot;author&quot;:&quot;Arize&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Arize&quot;},{&quot;title&quot;:&quot;OpenTelemetry GenAI&quot;,&quot;url&quot;:&quot;https://opentelemetry.io/docs/specs/semconv/gen-ai/&quot;,&quot;author&quot;:&quot;CNCF&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;CNCF&quot;},{&quot;title&quot;:&quot;Fountain City — Multi-Agent Production Patterns&quot;,&quot;url&quot;:&quot;https://fountaincity.com/multi-agent&quot;,&quot;author&quot;:&quot;Fountain City&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Fountain City&quot;},{&quot;title&quot;:&quot;The AI Engineer Substack: Multi-Agent Deep Dive&quot;,&quot;url&quot;:&quot;https://theaiengineer.substack.com/p/multi-agent&quot;,&quot;author&quot;:&quot;The AI Engineer&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Substack&quot;},{&quot;title&quot;:&quot;Cognition AI — Don&apos;&apos;t Build Multi-Agents (Counterpoint)&quot;,&quot;url&quot;:&quot;https://cognition.ai/blog/dont-build-multi-agents&quot;,&quot;author&quot;:&quot;Cognition AI&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Cognition&quot;},{&quot;title&quot;:&quot;AgentBench&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2308.03688&quot;,&quot;author&quot;:&quot;Liu et al.&quot;,&quot;publishedAt&quot;:&quot;2023&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;GAIA&quot;,&quot;url&quot;:&quot;https://arxiv.org/abs/2311.12983&quot;,&quot;author&quot;:&quot;Mialon et al.&quot;,&quot;publishedAt&quot;:&quot;2023&quot;,&quot;publisher&quot;:&quot;arXiv&quot;},{&quot;title&quot;:&quot;SWE-bench&quot;,&quot;url&quot;:&quot;https://www.swebench.com/&quot;,&quot;author&quot;:&quot;SWE-bench&quot;,&quot;publishedAt&quot;:&quot;2024&quot;,&quot;publisher&quot;:&quot;Princeton&quot;},{&quot;title&quot;:&quot;Claude Code Documentation&quot;,&quot;url&quot;:&quot;https://docs.anthropic.com/en/docs/claude-code&quot;,&quot;author&quot;:&quot;Anthropic&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Anthropic&quot;},{&quot;title&quot;:&quot;Pydantic&quot;,&quot;url&quot;:&quot;https://docs.pydantic.dev/&quot;,&quot;author&quot;:&quot;Pydantic&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Pydantic&quot;},{&quot;title&quot;:&quot;Zod&quot;,&quot;url&quot;:&quot;https://zod.dev/&quot;,&quot;author&quot;:&quot;Zod&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Zod&quot;},{&quot;title&quot;:&quot;KVKK - Law No. 6698&quot;,&quot;url&quot;:&quot;https://www.kvkk.gov.tr/&quot;,&quot;author&quot;:&quot;Republic of Turkiye - KVKK&quot;,&quot;publishedAt&quot;:&quot;2016-04-07&quot;,&quot;publisher&quot;:&quot;Republic of Turkiye&quot;},{&quot;title&quot;:&quot;BDDK Cloud Services Regulation&quot;,&quot;url&quot;:&quot;https://www.bddk.org.tr/&quot;,&quot;author&quot;:&quot;BDDK&quot;,&quot;publishedAt&quot;:&quot;2023&quot;,&quot;publisher&quot;:&quot;BDDK&quot;},{&quot;title&quot;:&quot;Helicone&quot;,&quot;url&quot;:&quot;https://helicone.ai/&quot;,&quot;author&quot;:&quot;Helicone&quot;,&quot;publishedAt&quot;:&quot;2025&quot;,&quot;publisher&quot;:&quot;Helicone&quot;},{&quot;title&quot;:&quot;EU AI Act&quot;,&quot;url&quot;:&quot;https://artificialintelligenceact.eu/&quot;,&quot;author&quot;:&quot;European Commission&quot;,&quot;publishedAt&quot;:&quot;2024-03&quot;,&quot;publisher&quot;:&quot;EU&quot;}]"></references-list>

---

A living document; the multi-agent ecosystem shifts every quarter, so this guide is **revised every quarter**.