# Anthropic's Multi-Agent Architecture: How the Orchestrator-Worker Pattern Beats Single-Agent by 90.2% > Source: https://sukruyusufkaya.com/en/blog/anthropic-multi-agent-orchestrator-worker-pattern-2026 > Updated: 2026-07-11T19:45:44.110Z > Type: blog > Category: yapay-zeka **TLDR:** Anthropic's Multi-Agent Research system beat single-agent Claude Opus by 90.2% on internal research evals using an orchestrator-worker pattern. This guide covers lead agent + parallel subagent architecture, structured artifact handoffs, planner-generator-evaluator loops, Claude Agent SDK with .claude/agents/, cost caps, deadlock prevention, comparisons with CrewAI/LangGraph/AutoGen, and a Turkish law-firm contract-analysis case. ## 1. Why a Single Agent is Not Enough A single LLM agent with a strong model, tools, and a good system prompt can handle a lot. But on **deep research, multi-document analysis, multi-stakeholder reporting**, single agents hit three bottlenecks: 1. **Context dilution.** As the task grows, files, observations, and intermediate outputs accumulate in the context window. *Lost in the middle* makes the model forget early steps. 2. **Sequential bottleneck.** A single agent cannot think in parallel; it analyzes 10 documents one by one. A 1-hour task becomes 10 hours. 3. **Divided attention.** The same agent strategizes and grinds details, doing both at medium quality. Anthropic solved these with the **multi-agent orchestrator-worker pattern**, documented in *"How we built our multi-agent research system"* (March 2025). Result: **90.2% better than single-agent Claude Opus** on internal research evals. ### Multi-Agent Timeline: From ReAct to Anthropic Four waves between 2022 and 2026: - **2022 — ReAct.** Single-agent think/act/observe loop. Tool use standardized. - **2023 — Reflexion, Tree of Thoughts, Generative Agents.** Self-reflection, multi-path reasoning, agent simulation. - **2024 — AutoGen, CrewAI, LangGraph.** First multi-agent Python frameworks. Experimental. - **2025 — Anthropic Multi-Agent Research + Claude Agent SDK.** First production-grade, benchmark-backed evidence. Pattern matured. Anthropic's March-2025 publication mattered because it was a *production report*, not academic research — real users, real workloads, real cost/quality trade-offs. ### What Multi-Agent is Not - Multi-step ≠ multi-agent. - LLM chaining ≠ multi-agent. - Tool use ≠ subagent. - Mixture of Experts ≠ multi-agent. Clarifying these distinctions is critical to identify cases where multi-agent truly adds value. ## 2. Architecture Anatomy: Lead + Workers The pattern has five components. ### 2.1. Lead Orchestrator Agent - **Model:** Best-in-class (Opus 4.7 in Anthropic's example). - **Role:** Understand task, decompose into subtasks, dispatch subagents in parallel, merge results, present to user. - **Context:** High-level plan + subagent output summaries. No details. ### 2.2. Worker Subagents - **Model:** Fast and cheap (Sonnet 4.6 or Haiku 4.5). - **Role:** Execute the Lead's subtask in clean context. - **Context isolation:** Each subagent has its own context; cannot see other subagents. ### 2.3. Tools - Subagents have access to web search, code execution, file read, MCP tools. - The Lead typically only has a "spawn subagent" tool, not direct tools. ### 2.4. Structured Artifact Handoffs - Subagent output is not free text but a **JSON-schema artifact**. - Example: `{ key_finding: ..., sources: [...], confidence: 0.x }`. - The Lead parses and merges artifacts. ### 2.5. Evaluator / Critic (optional) - A third subagent type: audits a worker's output for quality/accuracy. - The "Evaluator" part of Planner-Generator-Evaluator. ### 2.6. Subagent Lifecycle: Seven Phases 1. **Spawn** (Lead requests a new subagent instance). 2. **Initialize** (system prompt + tool catalog loaded). 3. **Receive Task** (parsed from JSON input). 4. **Execute** (ReAct loop). 5. **Validate Output** (schema validation). 6. **Return Artifact** (structured handoff). 7. **Cleanup** (release memory, files, network). A solid orchestrator defines a fallback and retry policy per phase. ### 2.7. Artifact Schema Principles 1. Stable top-level fields across all artifacts. 2. Type-safe with Pydantic/Zod. 3. Confidence score required (0-1). 4. Source attribution per finding. 5. Open questions flagged when uncertain. 6. Hash signature for tamper detection. 7. Timestamp for cache invalidation. Anti-pattern: each subagent uses a different schema, forcing the orchestrator into ad-hoc string parsing. ## 3. Planner-Generator-Evaluator Pattern The multi-agent architecture is enriched by a sub-pattern: **Planner-Generator-Evaluator**. ### Flow 1. **Planner** (Lead, Opus 4.7): reads query, produces n subtasks. 2. **n parallel Generators/Workers:** each executes a subtask, returns a structured artifact. 3. **Evaluator/Critic:** scores artifacts; signals re-try for low-quality ones. 4. **Writer** (Lead): merges high-quality artifacts and writes the answer. 5. **Reviewer** (optional): QA's the final answer before delivery. ### Pattern Variations - **Hierarchical PGE** (Lead → Sub-lead → Worker) for very large tasks. - **PGE with Self-Reflection** (Generator double-checks itself). - **Adversarial PGE** (one Evaluator red-teams another) for high-stakes cases. - **Iterative PGE** (low-confidence triggers loop, 2-3 iterations). ### When PGE is Not the Right Pattern - Very short tasks solvable in one prompt. - Deterministic audit-required pipelines. - Tight token budgets. ## 4. Practical Implementation: Claude Agent SDK The Claude Agent SDK (2025) is the reference implementation. Folder layout: ~~~text my-project/ ├── .claude/ │ ├── settings.json │ ├── agents/ │ │ ├── researcher.md │ │ ├── critic.md │ │ ├── writer.md │ │ └── reviewer.md │ └── mcp.json └── src/ ~~~ ### Subagent Definition (.claude/agents/researcher.md) ~~~markdown --- name: researcher description: | Use for deep research tasks. Given a question and authorized sources, returns a structured artifact with findings, citations, and confidence. tools: [web_search, fetch, read_file] model: claude-sonnet-4-6 --- You are a research subagent. For each subtask: 1. Search authoritative sources. 2. Extract key findings with direct quotes. 3. Cite every claim with source URL + date. 4. Return JSON artifact: ~~~json { "subtask_id": "", "key_findings": ["..."], "sources": [{"url":"...","title":"...","date":"..."}], "confidence": 0.0-1.0, "open_questions": ["..."] } ~~~ ~~~ ### Lead Orchestrator (application code) ~~~typescript import { query } from "@anthropic-ai/claude-agent-sdk"; async function multiAgentResearch(question: string) { const plan = await query({ model: "claude-opus-4-7", systemPrompt: "You are a research planner...", prompt: "Decompose into 3-5 parallel subtasks: " + question, }); const subtasks = JSON.parse(plan.text).subtasks; const workerResults = await Promise.all( subtasks.map((task) => query({ agent: "researcher", prompt: JSON.stringify(task) }) ) ); const evaluations = await Promise.all( workerResults.map((r) => query({ agent: "critic", prompt: r.text })) ); const final = await query({ model: "claude-opus-4-7", systemPrompt: "You are a research writer. Synthesize the artifacts...", prompt: JSON.stringify({ subtasks, results: workerResults, evals: evaluations }), }); return final.text; } ~~~ ### Config Details - **Cost cap:** Max tokens + time per subagent. - **Concurrency:** Max parallel subagents (4-6 most common). - **Retry policy:** 2x retry on failure, then swap with critic. - **Telemetry:** Latency, tokens, model, success per subagent. ### Markdown Frontmatter for Subagents Anthropic's `.claude/agents/*.md` format ships config in frontmatter and prompt in the body. The file lives in git, gets code-reviewed, and shares across teams. ### MCP Integration A multi-agent system pairs naturally with MCP: orchestrator gets spawn/store meta-tools; workers get domain MCP tools (web_search, github, sql, vector_db); critics get read/score tools. `.claude/mcp.json` controls which MCPs are visible to which subagent. ### State Management State lives in three tiers: per-subagent (transient), orchestrator (combined), persistent (Redis/Postgres/object storage for long tasks). ## 5. Why 90.2%? Performance Analysis Four factors behind the gap: ### Context Isolation Each subagent works in clean context — no lost-in-the-middle. Instead of cramming 200 pages into one context, you give 5 subagents 40 pages each. ### Parallelism With 5 parallel subagents, total latency is ~1/5 of single-agent. Anthropic showed the parallel system dominates single-agent on the latency-quality tradeoff. ### Model Optimization Lead with Opus, workers with Sonnet/Haiku — right model in the right place. Strategic thinking on premium model, grunt work on cheap model. ### Specialization A "researcher" agent prompted explicitly for its role beats a generalist single agent that does everything at average quality. ### Where the Gap Widens The 90.2% gap is not universal. It widens on: - Multi-document synthesis (20+ documents, conflicting findings). - Multi-stakeholder reporting (legal + financial + ops in one report). - Deep web research (50+ sources, citation aggregation). - Parallel hypothesis testing. It is close to zero on single-file code review, short Q&A, summaries, classic RAG retrieval, and code completion. ### Benchmark Caveat 90.2% is Anthropic's internal eval on their task set. Independent benchmarks (AgentBench, GAIA, SWE-bench) show gaps in the 15-40% range — still meaningful, but architecture is not the only lever; prompt engineering and role design also matter. ## 6. Comparison with Other Multi-Agent Frameworks ### Which Framework, Which Use Case? - **Claude Agent SDK:** Claude-based stack, .claude/ workflow, deep MCP integration. - **LangGraph:** Complex state machines and loops; agentic graphs. - **CrewAI:** Fast POC + role-based design; Python ecosystem. - **AutoGen:** Agent-to-agent conversation + human-in-the-loop. ### Hybrid Stacks In practice teams combine frameworks: Claude Agent SDK as lead + LangGraph workers; CrewAI for POC then Claude Agent SDK for production; AutoGen for human-in-the-loop + LangGraph for the deterministic part. The trade-off is reduced lock-in vs added complexity. ### "Don't Build Multi-Agents" Counterpoint In September 2025 Cognition AI (Devin) published *"Don't Build Multi-Agents"* arguing that multi-agent stacks compound latency and error surface, and that single-agent + careful context can match the output. Valid for **interactive coding**; Anthropic's reported cases are deep research — a different profile. The honest answer: it depends on the use case. ## 7. Turkish Angle: KVKK, BDDK, and Multi-Stakeholder Work Three scenarios where multi-agent shines for Turkish companies. ### Compliance Automation KVKK breach review: planner reads complaint, three workers run in parallel — (1) pull VERBİS record, (2) search precedent KVKK rulings, (3) compare against internal policy. Evaluator scores, writer produces report. 4-8 human-hours → 12-18 minutes. ### Multi-Document Analysis Law firms, audit firms, M&A consultancy: simultaneous analysis of 50-200 documents. Single-agent insufficient; multi-agent is the natural choice. ### Research + Report Generation Strategy consulting, sector reports: parallel scanning of multiple sources, merged structured findings, executive summary. ### KVKK Considerations PII redaction pre-orchestrator; subagents bound to EU/TR-hosted endpoints; artifacts written to audit logs. ### Use Case Map for Turkish Sectors - **Banking/Insurance:** M&A contract DD, credit risk evaluation (parallel KYC/AML/financials), KVKK breach automation, fraud triage. - **Legal:** Contract DD, case-law search + synthesis, regulation impact analysis. - **Healthcare:** Multi-specialist case discussion (cardiologist + endocrinologist subagents), clinical literature triage (reviewer mandatory). - **E-commerce/Marketing:** Competitor catalog analysis, customer segmentation research. - **Manufacturing/Logistics:** Supply-chain risk, supplier due diligence, ops dashboards. High-stakes sectors (healthcare, legal) require iterative + adversarial PGE. ### Turkish Subagent Design For Turkish customers: (1) system prompt in Turkish, (2) Turkish-first sources (mevzuat, içtihat — Said Surucu MCPs), (3) Turkish citation format, (4) embed KVKK + BDDK awareness in system prompts. This yields a *Turkey-first* multi-agent architecture rather than "global + translated." ## 8. Case Study (Anonymized): Turkish Law Firm Contract Analysis ### Problem An Istanbul-based corporate law firm must analyze every contract (90-300) at a target during M&A due diligence. Typical M&A: 12-18 lawyers × 3-4 weeks × 60 hours/week ≈ 2,500-4,000 hours of manual work. ### Solution Multi-agent orchestrator-worker pattern: - **Lead (Opus 4.7):** categorizes contracts (NDA, services, license, lease, employment, financial); spawns a pipeline per category. - **6 parallel workers (Sonnet 4.6):** one per category. Risk clauses, change-of-control, indemnity caps, KVKK compliance, error clauses. - **Evaluator (Sonnet 4.6):** flags conflicting findings and low-confidence artifacts. - **Writer (Opus 4.7):** drafts executive due diligence report. - **Reviewer (senior lawyer):** human QA. KVKK: PII redaction pre-orchestrator; Anthropic EU endpoint; audit log per artifact. ### Result - Time: 2,500 hours → 65 human-hours + ~80 AI-processing hours. ~16x speedup. - Risk clauses detected: 23% higher (AI caught issues humans missed). - Subjective quality: partners said "the report is now more consistent" (human teams were fatigue-inconsistent). - Cost: ~$8,500 LLM cost per M&A vs ~$2.4M saved in human hours. ### Case 2 (Anonymized) — Turkish Bank: Credit Application Triage **Problem.** 8,000-12,000 corporate credit applications/day. Per application: KYC + AML + financial statements + precedent (blacklist, restructuring history) + customer profile. Manual review: 35-50 minutes per analyst. **Solution.** Multi-agent triage with five parallel subagents (KYC/AML, financial statements, precedent, sector/macro risk, customer profile + cross-sell), one evaluator, a writer producing the credit memo, and a senior credit analyst reviewer. KVKK/BDDK: PII redaction, EU/TR endpoints, on-prem MCP gateway, audit logs. **Result.** Review time 35 min → 8 min human + 6 min AI; risk-score stdev cut from 15 to 6; high-risk catch rate up 18%; ops capacity 1.4x without new hires. ### Case 3 (Anonymized) — Turkish Strategy Consultancy: Sector Reports **Problem.** A typical sector report: 4 weeks × 3 analysts ≈ 480 research-hours + 80 writing-hours. **Solution.** Lead spawns 6 researcher subagents (one per section: market sizing, players, regulation, trends, risks, opportunities), one data subagent pulling TÜİK/KAP/sector association data, a critic, a writer, and a senior partner reviewer. **Result.** Report cycle 4 weeks → 6 days; analyst can run 3x more parallel projects; source diversity 40% higher than human-only; client NPS 8 → 9.4. ## 9. Risks, Costs, and Operational Concerns ### Token Cost Multi-agent spends 4-15x more tokens. The decision criterion: **"Does the per-subagent output's value exceed the token cost?"** Easily yes for deep research, easily no for casual chat. ### Deadlock and Infinite Loops If subagents can spawn subagents (recursive), infinite loops are possible. Mitigations: call-depth limit (max 3), per-task timeout, global cost cap. ### Error Handling If a subagent fails: (1) fail the whole task, (2) skip and continue, (3) retry up to 2x. Most robust: critic marks artifact as low-confidence; orchestrator decides. ### Observability Track per subagent: latency, tokens in/out, model, success, output size, evaluator score. Tools: Langfuse, Arize Phoenix, Helicone, OpenTelemetry. ### Non-Determinism Subagent outputs are stochastic. The same task twice can yield different results — a challenge for deterministic pipelines (audit, financial). Mitigation: temperature=0, structured output, eval harness. Multi-agent raises operational complexity 3-5x. If single-agent + good RAG delivers 80% of the value, treat multi-agent as post-POC, not production day-one. ROI should drive the decision, not coolness. ## 10. Frequently Asked Questions No. Single-step tasks (simple Q&A, short summaries, code completion) are fine for single agents. Multi-agent shines on **parallel + multi-step** tasks like deep research, multi-document analysis, and multi-stakeholder reporting. Anthropic's 90.2% is for *research* — not chat.

3-5 is most common per Anthropic. More raises orchestration overhead and cost-cap risk. Fewer loses parallelism advantages.

A Sonnet 4.6 + Haiku 4.5 mix is most common. Low-stake workers on Haiku (cheap, fast), high-stake on Sonnet. Lead almost always Opus 4.7.

Optimize for Lead's merging step. Typical fields: subtask_id, key_findings, sources, confidence, open_questions, next_actions. JSON Schema + Pydantic/Zod validation on orchestrator side is mandatory.

Three layers: (1) global cost cap per task, (2) per-subagent token cap, (3) early termination (planner signals "enough info"). Claude Agent SDK exposes these as config.

Claude-centric stack + .claude/ workflow + MCP: Claude Agent SDK. Multi-provider + complex state machines + conditional branching: LangGraph. They can also coexist.

Track each subagent: full prompt + response + tools called + latency + token usage. Use Langfuse traces or Anthropic's workflow viewer at console.anthropic.com. Record all subagent calls for reproducibility. ## 10.5. Cost Optimization Strategies Ten practical levers: 1. **Use Haiku for workers** — reserve Sonnet for medium-stake, Opus for lead/writer only. 2. **Prompt caching** — subagent system prompts repeat; Anthropic prompt caching is ~10x cheaper on repeats. 3. **Early termination** — let the planner stop spawning if "enough info." 4. **Cache subagent outputs** — same task returns cached artifact. 5. **Batch subagents** — pair small subtasks into one subagent. 6. **Tool-output truncation** — summarize huge tool returns before feeding to subagent. 7. **Context optimization** — store summaries in Lead context, not raw outputs. 8. **Streaming** — start the writer as soon as the first artifact arrives. 9. **Selective re-runs** — re-run only the subagent flagged by the critic. 10. **Provisioned throughput** — Anthropic Bedrock/Azure committed capacity for high volume. Together, these levers typically cut multi-agent cost 3-7x. ## 10.7. Multi-Agent Eval Harness Intermediate metrics (subagent faithfulness/recall, critic accuracy, plan quality) plus final metrics (end-to-end accuracy, coherence, citation accuracy, latency, cost) — measured with LangFuse hierarchical traces, Anthropic Console workflow viewer, or custom Grafana + OpenTelemetry GenAI dashboards. Without an eval harness, regressions go unseen. ## 10.8. Multi-Agent Anti-Patterns Production failure modes: 1. **Echo chamber** — all subagents pull from the same source. 2. **Hyper-granular decomposition** — 12 subtasks for a small task. 3. **No evaluator** — quality unmeasured. 4. **Free-form artifacts** — schemaless strings → regex parsing → flake. 5. **Shared mutable state** — race conditions. 6. **Synchronous waits** — losing parallelism. 7. **Unbounded recursion** — runaway cost. 8. **Long-lived Lead context** — lost-in-the-middle returns. 9. **No cost cap** — blowups. 10. **Skipping reviewer on high-stakes tasks** — legal/reputational risk. ## 10.9. Production Readiness Checklist (16 Items) - [ ] Lead + worker + critic + writer + reviewer roles documented. - [ ] Per-subagent system prompts + tools + models documented. - [ ] Structured artifact JSON schema written + enforced. - [ ] Pydantic/Zod validation; fail-fast on schema mismatch. - [ ] Cost caps (global + per subagent). - [ ] Concurrency limit (≤ 4-6 parallel). - [ ] Retry policy (≤ 2 retries, exponential backoff). - [ ] Timeout per subagent (60-300s). - [ ] Cancellation propagation. - [ ] PII redaction pre-orchestrator. - [ ] Audit logging per subagent call. - [ ] KVKK/BDDK compliance sign-off. - [ ] Observability stack (LangFuse + Helicone). - [ ] Eval harness (intermediate + final). - [ ] Production smoke test (10+ real tasks). - [ ] Runbook (incident response, rollback). Less than 16/16 green is a launch risk. ## 11. Next Steps Practical roadmap to bring multi-agent to your enterprise: 1. **POC evaluation.** Review current single-agent workloads; pick 1-2 use-cases where multi-agent makes sense. 2-3 weeks. 2. **Pattern design.** Lead + worker roles, structured artifact schema, evaluator strategy, cost caps, KVKK layer. 4-6 weeks. 3. **Production deploy + observability.** Langfuse traces, retry policy, deadlock detection, eval harness. 6-10 weeks. 4. **Training.** Workshop for devs and domain teams on Claude Agent SDK + .claude/ workflow + MCP integration. Reach out via the contact form on the site. --- A living document; the multi-agent ecosystem shifts every quarter, so this guide is **revised every quarter**.