Skip to content
Artificial Intelligence·35 min·May 27, 2026·2

Anthropic's Multi-Agent Architecture: How the Orchestrator-Worker Pattern Beats Single-Agent by 90.2%

Anthropic's Multi-Agent Research system beat single-agent Claude Opus by 90.2% on internal research evals using an orchestrator-worker pattern. This guide covers lead agent + parallel subagent architecture, structured artifact handoffs, planner-generator-evaluator loops, Claude Agent SDK with .claude/agents/, cost caps, deadlock prevention, comparisons with CrewAI/LangGraph/AutoGen, and a Turkish law-firm contract-analysis case.

SYK
Şükrü Yusuf KAYA
AI Expert · Enterprise AI Consultant
Anthropic's Multi-Agent Architecture: How the Orchestrator-Worker Pattern Beats Single-Agent by 90.2%

1. Why a Single Agent is Not Enough

A single LLM agent with a strong model, tools, and a good system prompt can handle a lot. But on deep research, multi-document analysis, multi-stakeholder reporting, single agents hit three bottlenecks:

  1. Context dilution. As the task grows, files, observations, and intermediate outputs accumulate in the context window. Lost in the middle makes the model forget early steps.
  2. Sequential bottleneck. A single agent cannot think in parallel; it analyzes 10 documents one by one. A 1-hour task becomes 10 hours.
  3. Divided attention. The same agent strategizes and grinds details, doing both at medium quality.

Anthropic solved these with the multi-agent orchestrator-worker pattern, documented in "How we built our multi-agent research system" (March 2025). Result: 90.2% better than single-agent Claude Opus on internal research evals.

Definition
Multi-Agent Orchestrator-Worker Pattern
An agentic AI architecture where a Lead/Orchestrator agent plans the task and dispatches 3-5 parallel Worker subagents. Each subagent runs in isolated context; results return to the Lead as structured artifacts. Documented by Anthropic Engineering and reference-implemented in the Claude Agent SDK.
Also known as: Multi-Agent System, MAS, Orchestrator-Worker
Wikidata: Q1064782

Multi-Agent Timeline: From ReAct to Anthropic

Four waves between 2022 and 2026:

  • 2022 — ReAct. Single-agent think/act/observe loop. Tool use standardized.
  • 2023 — Reflexion, Tree of Thoughts, Generative Agents. Self-reflection, multi-path reasoning, agent simulation.
  • 2024 — AutoGen, CrewAI, LangGraph. First multi-agent Python frameworks. Experimental.
  • 2025 — Anthropic Multi-Agent Research + Claude Agent SDK. First production-grade, benchmark-backed evidence. Pattern matured.

Anthropic's March-2025 publication mattered because it was a production report, not academic research — real users, real workloads, real cost/quality trade-offs.

What Multi-Agent is Not

  • Multi-step ≠ multi-agent.
  • LLM chaining ≠ multi-agent.
  • Tool use ≠ subagent.
  • Mixture of Experts ≠ multi-agent.

Clarifying these distinctions is critical to identify cases where multi-agent truly adds value.

2. Architecture Anatomy: Lead + Workers

The pattern has five components.

2.1. Lead Orchestrator Agent

  • Model: Best-in-class (Opus 4.7 in Anthropic's example).
  • Role: Understand task, decompose into subtasks, dispatch subagents in parallel, merge results, present to user.
  • Context: High-level plan + subagent output summaries. No details.

2.2. Worker Subagents

  • Model: Fast and cheap (Sonnet 4.6 or Haiku 4.5).
  • Role: Execute the Lead's subtask in clean context.
  • Context isolation: Each subagent has its own context; cannot see other subagents.

2.3. Tools

  • Subagents have access to web search, code execution, file read, MCP tools.
  • The Lead typically only has a "spawn subagent" tool, not direct tools.

2.4. Structured Artifact Handoffs

  • Subagent output is not free text but a JSON-schema artifact.
  • Example: { key_finding: ..., sources: [...], confidence: 0.x }.
  • The Lead parses and merges artifacts.

2.5. Evaluator / Critic (optional)

  • A third subagent type: audits a worker's output for quality/accuracy.
  • The "Evaluator" part of Planner-Generator-Evaluator.

2.6. Subagent Lifecycle: Seven Phases

  1. Spawn (Lead requests a new subagent instance).
  2. Initialize (system prompt + tool catalog loaded).
  3. Receive Task (parsed from JSON input).
  4. Execute (ReAct loop).
  5. Validate Output (schema validation).
  6. Return Artifact (structured handoff).
  7. Cleanup (release memory, files, network).

A solid orchestrator defines a fallback and retry policy per phase.

2.7. Artifact Schema Principles

  1. Stable top-level fields across all artifacts.
  2. Type-safe with Pydantic/Zod.
  3. Confidence score required (0-1).
  4. Source attribution per finding.
  5. Open questions flagged when uncertain.
  6. Hash signature for tamper detection.
  7. Timestamp for cache invalidation.

Anti-pattern: each subagent uses a different schema, forcing the orchestrator into ad-hoc string parsing.

3. Planner-Generator-Evaluator Pattern

The multi-agent architecture is enriched by a sub-pattern: Planner-Generator-Evaluator.

Multi-Agent Roles and Responsibilities
RoleResponsibilityTypical ModelContext Size
PlannerDecompose task into subtasksOpus 4.7High (200K-1M)
Generator (Worker)Execute subtaskSonnet 4.6 / Haiku 4.5Medium (own context)
Evaluator / CriticAudit outputSonnet 4.6Medium
WriterCompose finalOpus 4.7High
ReviewerQA the finalSonnet 4.6Medium

Flow

  1. Planner (Lead, Opus 4.7): reads query, produces n subtasks.
  2. n parallel Generators/Workers: each executes a subtask, returns a structured artifact.
  3. Evaluator/Critic: scores artifacts; signals re-try for low-quality ones.
  4. Writer (Lead): merges high-quality artifacts and writes the answer.
  5. Reviewer (optional): QA's the final answer before delivery.

Pattern Variations

  • Hierarchical PGE (Lead → Sub-lead → Worker) for very large tasks.
  • PGE with Self-Reflection (Generator double-checks itself).
  • Adversarial PGE (one Evaluator red-teams another) for high-stakes cases.
  • Iterative PGE (low-confidence triggers loop, 2-3 iterations).

When PGE is Not the Right Pattern

  • Very short tasks solvable in one prompt.
  • Deterministic audit-required pipelines.
  • Tight token budgets.

4. Practical Implementation: Claude Agent SDK

The Claude Agent SDK (2025) is the reference implementation. Folder layout:

Code Snippet
my-project/
├── .claude/
│   ├── settings.json
│   ├── agents/
│   │   ├── researcher.md
│   │   ├── critic.md
│   │   ├── writer.md
│   │   └── reviewer.md
│   └── mcp.json
└── src/

Subagent Definition (.claude/agents/researcher.md)

Code Snippet
---
name: researcher
description: |
  Use for deep research tasks. Given a question and authorized sources,
  returns a structured artifact with findings, citations, and confidence.
tools: [web_search, fetch, read_file]
model: claude-sonnet-4-6
---

You are a research subagent. For each subtask:

1. Search authoritative sources.
2. Extract key findings with direct quotes.
3. Cite every claim with source URL + date.
4. Return JSON artifact:

~~~json
{
  "subtask_id": "<id>",
  "key_findings": ["..."],
  "sources": [{"url":"...","title":"...","date":"..."}],
  "confidence": 0.0-1.0,
  "open_questions": ["..."]
}
Code Snippet

### Lead Orchestrator (application code)

~~~typescript
import { query } from "@anthropic-ai/claude-agent-sdk";

async function multiAgentResearch(question: string) {
  const plan = await query({
    model: "claude-opus-4-7",
    systemPrompt: "You are a research planner...",
    prompt: "Decompose into 3-5 parallel subtasks: " + question,
  });
  const subtasks = JSON.parse(plan.text).subtasks;

  const workerResults = await Promise.all(
    subtasks.map((task) =>
      query({ agent: "researcher", prompt: JSON.stringify(task) })
    )
  );

  const evaluations = await Promise.all(
    workerResults.map((r) => query({ agent: "critic", prompt: r.text }))
  );

  const final = await query({
    model: "claude-opus-4-7",
    systemPrompt: "You are a research writer. Synthesize the artifacts...",
    prompt: JSON.stringify({ subtasks, results: workerResults, evals: evaluations }),
  });
  return final.text;
}

Config Details

  • Cost cap: Max tokens + time per subagent.
  • Concurrency: Max parallel subagents (4-6 most common).
  • Retry policy: 2x retry on failure, then swap with critic.
  • Telemetry: Latency, tokens, model, success per subagent.

Markdown Frontmatter for Subagents

Anthropic's .claude/agents/*.md format ships config in frontmatter and prompt in the body. The file lives in git, gets code-reviewed, and shares across teams.

MCP Integration

A multi-agent system pairs naturally with MCP: orchestrator gets spawn/store meta-tools; workers get domain MCP tools (web_search, github, sql, vector_db); critics get read/score tools. .claude/mcp.json controls which MCPs are visible to which subagent.

State Management

State lives in three tiers: per-subagent (transient), orchestrator (combined), persistent (Redis/Postgres/object storage for long tasks).

5. Why 90.2%? Performance Analysis

Four factors behind the gap:

Context Isolation

Each subagent works in clean context — no lost-in-the-middle. Instead of cramming 200 pages into one context, you give 5 subagents 40 pages each.

Parallelism

With 5 parallel subagents, total latency is ~1/5 of single-agent. Anthropic showed the parallel system dominates single-agent on the latency-quality tradeoff.

Model Optimization

Lead with Opus, workers with Sonnet/Haiku — right model in the right place. Strategic thinking on premium model, grunt work on cheap model.

Specialization

A "researcher" agent prompted explicitly for its role beats a generalist single agent that does everything at average quality.

Where the Gap Widens

The 90.2% gap is not universal. It widens on:

  • Multi-document synthesis (20+ documents, conflicting findings).
  • Multi-stakeholder reporting (legal + financial + ops in one report).
  • Deep web research (50+ sources, citation aggregation).
  • Parallel hypothesis testing.

It is close to zero on single-file code review, short Q&A, summaries, classic RAG retrieval, and code completion.

Benchmark Caveat

90.2% is Anthropic's internal eval on their task set. Independent benchmarks (AgentBench, GAIA, SWE-bench) show gaps in the 15-40% range — still meaningful, but architecture is not the only lever; prompt engineering and role design also matter.

6. Comparison with Other Multi-Agent Frameworks

2026 Multi-Agent Framework Comparison
FrameworkTypeModel AgnosticProduction-ReadyCommunity
Claude Agent SDKOrchestrator-WorkerClaude onlyYesMedium-High
LangGraphGraph-basedMulti-providerYesHigh
CrewAIRole-basedMulti-providerYesHigh
AutoGen (Microsoft)ConversationMulti-providerYesHigh
OpenAI SwarmLightweight handoffOpenAIExperimentalMedium
Atomic AgentsMinimalMulti-providerNewLow

Which Framework, Which Use Case?

  • Claude Agent SDK: Claude-based stack, .claude/ workflow, deep MCP integration.
  • LangGraph: Complex state machines and loops; agentic graphs.
  • CrewAI: Fast POC + role-based design; Python ecosystem.
  • AutoGen: Agent-to-agent conversation + human-in-the-loop.

Hybrid Stacks

In practice teams combine frameworks: Claude Agent SDK as lead + LangGraph workers; CrewAI for POC then Claude Agent SDK for production; AutoGen for human-in-the-loop + LangGraph for the deterministic part. The trade-off is reduced lock-in vs added complexity.

"Don't Build Multi-Agents" Counterpoint

In September 2025 Cognition AI (Devin) published "Don't Build Multi-Agents" arguing that multi-agent stacks compound latency and error surface, and that single-agent + careful context can match the output. Valid for interactive coding; Anthropic's reported cases are deep research — a different profile. The honest answer: it depends on the use case.

7. Turkish Angle: KVKK, BDDK, and Multi-Stakeholder Work

Three scenarios where multi-agent shines for Turkish companies.

Compliance Automation

KVKK breach review: planner reads complaint, three workers run in parallel — (1) pull VERBİS record, (2) search precedent KVKK rulings, (3) compare against internal policy. Evaluator scores, writer produces report. 4-8 human-hours → 12-18 minutes.

Multi-Document Analysis

Law firms, audit firms, M&A consultancy: simultaneous analysis of 50-200 documents. Single-agent insufficient; multi-agent is the natural choice.

Research + Report Generation

Strategy consulting, sector reports: parallel scanning of multiple sources, merged structured findings, executive summary.

KVKK Considerations

PII redaction pre-orchestrator; subagents bound to EU/TR-hosted endpoints; artifacts written to audit logs.

Use Case Map for Turkish Sectors

  • Banking/Insurance: M&A contract DD, credit risk evaluation (parallel KYC/AML/financials), KVKK breach automation, fraud triage.
  • Legal: Contract DD, case-law search + synthesis, regulation impact analysis.
  • Healthcare: Multi-specialist case discussion (cardiologist + endocrinologist subagents), clinical literature triage (reviewer mandatory).
  • E-commerce/Marketing: Competitor catalog analysis, customer segmentation research.
  • Manufacturing/Logistics: Supply-chain risk, supplier due diligence, ops dashboards.

High-stakes sectors (healthcare, legal) require iterative + adversarial PGE.

Turkish Subagent Design

For Turkish customers: (1) system prompt in Turkish, (2) Turkish-first sources (mevzuat, içtihat — Said Surucu MCPs), (3) Turkish citation format, (4) embed KVKK + BDDK awareness in system prompts. This yields a Turkey-first multi-agent architecture rather than "global + translated."

8. Case Study (Anonymized): Turkish Law Firm Contract Analysis

Problem

An Istanbul-based corporate law firm must analyze every contract (90-300) at a target during M&A due diligence. Typical M&A: 12-18 lawyers × 3-4 weeks × 60 hours/week ≈ 2,500-4,000 hours of manual work.

Solution

Multi-agent orchestrator-worker pattern:

  • Lead (Opus 4.7): categorizes contracts (NDA, services, license, lease, employment, financial); spawns a pipeline per category.
  • 6 parallel workers (Sonnet 4.6): one per category. Risk clauses, change-of-control, indemnity caps, KVKK compliance, error clauses.
  • Evaluator (Sonnet 4.6): flags conflicting findings and low-confidence artifacts.
  • Writer (Opus 4.7): drafts executive due diligence report.
  • Reviewer (senior lawyer): human QA.

KVKK: PII redaction pre-orchestrator; Anthropic EU endpoint; audit log per artifact.

Result

  • Time: 2,500 hours → 65 human-hours + ~80 AI-processing hours. ~16x speedup.
  • Risk clauses detected: 23% higher (AI caught issues humans missed).
  • Subjective quality: partners said "the report is now more consistent" (human teams were fatigue-inconsistent).
  • Cost: ~$8,500 LLM cost per M&A vs ~$2.4M saved in human hours.

Case 2 (Anonymized) — Turkish Bank: Credit Application Triage

Problem. 8,000-12,000 corporate credit applications/day. Per application: KYC + AML + financial statements + precedent (blacklist, restructuring history) + customer profile. Manual review: 35-50 minutes per analyst.

Solution. Multi-agent triage with five parallel subagents (KYC/AML, financial statements, precedent, sector/macro risk, customer profile + cross-sell), one evaluator, a writer producing the credit memo, and a senior credit analyst reviewer. KVKK/BDDK: PII redaction, EU/TR endpoints, on-prem MCP gateway, audit logs.

Result. Review time 35 min → 8 min human + 6 min AI; risk-score stdev cut from 15 to 6; high-risk catch rate up 18%; ops capacity 1.4x without new hires.

Case 3 (Anonymized) — Turkish Strategy Consultancy: Sector Reports

Problem. A typical sector report: 4 weeks × 3 analysts ≈ 480 research-hours + 80 writing-hours.

Solution. Lead spawns 6 researcher subagents (one per section: market sizing, players, regulation, trends, risks, opportunities), one data subagent pulling TÜİK/KAP/sector association data, a critic, a writer, and a senior partner reviewer.

Result. Report cycle 4 weeks → 6 days; analyst can run 3x more parallel projects; source diversity 40% higher than human-only; client NPS 8 → 9.4.

9. Risks, Costs, and Operational Concerns

Token Cost

Multi-agent spends 4-15x more tokens. The decision criterion: "Does the per-subagent output's value exceed the token cost?" Easily yes for deep research, easily no for casual chat.

Deadlock and Infinite Loops

If subagents can spawn subagents (recursive), infinite loops are possible. Mitigations: call-depth limit (max 3), per-task timeout, global cost cap.

Error Handling

If a subagent fails: (1) fail the whole task, (2) skip and continue, (3) retry up to 2x. Most robust: critic marks artifact as low-confidence; orchestrator decides.

Observability

Track per subagent: latency, tokens in/out, model, success, output size, evaluator score. Tools: Langfuse, Arize Phoenix, Helicone, OpenTelemetry.

Non-Determinism

Subagent outputs are stochastic. The same task twice can yield different results — a challenge for deterministic pipelines (audit, financial). Mitigation: temperature=0, structured output, eval harness.

10. Frequently Asked Questions

10.5. Cost Optimization Strategies

Ten practical levers:

  1. Use Haiku for workers — reserve Sonnet for medium-stake, Opus for lead/writer only.
  2. Prompt caching — subagent system prompts repeat; Anthropic prompt caching is ~10x cheaper on repeats.
  3. Early termination — let the planner stop spawning if "enough info."
  4. Cache subagent outputs — same task returns cached artifact.
  5. Batch subagents — pair small subtasks into one subagent.
  6. Tool-output truncation — summarize huge tool returns before feeding to subagent.
  7. Context optimization — store summaries in Lead context, not raw outputs.
  8. Streaming — start the writer as soon as the first artifact arrives.
  9. Selective re-runs — re-run only the subagent flagged by the critic.
  10. Provisioned throughput — Anthropic Bedrock/Azure committed capacity for high volume.

Together, these levers typically cut multi-agent cost 3-7x.

10.7. Multi-Agent Eval Harness

Intermediate metrics (subagent faithfulness/recall, critic accuracy, plan quality) plus final metrics (end-to-end accuracy, coherence, citation accuracy, latency, cost) — measured with LangFuse hierarchical traces, Anthropic Console workflow viewer, or custom Grafana + OpenTelemetry GenAI dashboards. Without an eval harness, regressions go unseen.

10.8. Multi-Agent Anti-Patterns

Production failure modes:

  1. Echo chamber — all subagents pull from the same source.
  2. Hyper-granular decomposition — 12 subtasks for a small task.
  3. No evaluator — quality unmeasured.
  4. Free-form artifacts — schemaless strings → regex parsing → flake.
  5. Shared mutable state — race conditions.
  6. Synchronous waits — losing parallelism.
  7. Unbounded recursion — runaway cost.
  8. Long-lived Lead context — lost-in-the-middle returns.
  9. No cost cap — blowups.
  10. Skipping reviewer on high-stakes tasks — legal/reputational risk.

10.9. Production Readiness Checklist (16 Items)

  • Lead + worker + critic + writer + reviewer roles documented.
  • Per-subagent system prompts + tools + models documented.
  • Structured artifact JSON schema written + enforced.
  • Pydantic/Zod validation; fail-fast on schema mismatch.
  • Cost caps (global + per subagent).
  • Concurrency limit (≤ 4-6 parallel).
  • Retry policy (≤ 2 retries, exponential backoff).
  • Timeout per subagent (60-300s).
  • Cancellation propagation.
  • PII redaction pre-orchestrator.
  • Audit logging per subagent call.
  • KVKK/BDDK compliance sign-off.
  • Observability stack (LangFuse + Helicone).
  • Eval harness (intermediate + final).
  • Production smoke test (10+ real tasks).
  • Runbook (incident response, rollback).

Less than 16/16 green is a launch risk.

11. Next Steps

Practical roadmap to bring multi-agent to your enterprise:

  1. POC evaluation. Review current single-agent workloads; pick 1-2 use-cases where multi-agent makes sense. 2-3 weeks.
  2. Pattern design. Lead + worker roles, structured artifact schema, evaluator strategy, cost caps, KVKK layer. 4-6 weeks.
  3. Production deploy + observability. Langfuse traces, retry policy, deadlock detection, eval harness. 6-10 weeks.
  4. Training. Workshop for devs and domain teams on Claude Agent SDK + .claude/ workflow + MCP integration.

Reach out via the contact form on the site.

References

  1. , Anthropic ·
  2. , Anthropic ·
  3. , Anthropic ·
  4. , Anthropic ·
  5. , arXiv ·
  6. , Microsoft ·
  7. , LangChain ·
  8. , CrewAI ·
  9. , GitHub ·
  10. , arXiv ·
  11. , arXiv ·
  12. , arXiv ·
  13. , Stanford / Google ·
  14. , Anthropic ·
  15. , Langfuse ·
  16. , Arize ·
  17. , CNCF ·
  18. , Fountain City ·
  19. , Substack ·
  20. , Cognition ·
  21. , arXiv ·
  22. , arXiv ·
  23. , Princeton ·
  24. , Anthropic ·
  25. , Pydantic ·
  26. , Zod ·
  27. , Republic of Turkiye ·
  28. , BDDK ·
  29. , Helicone ·
  30. , EU ·

A living document; the multi-agent ecosystem shifts every quarter, so this guide is revised every quarter.

Consulting Pathways

Consulting pages closest to this article

For the most logical next step after this article, you can review the most relevant solution, role, and industry landing pages here.

Comments

Comments

Connected pillar topics

Pillar topics this article maps to