Anthropic's Multi-Agent Architecture: How the Orchestrator-Worker

1. Why a Single Agent is Not Enough

A single LLM agent with a strong model, tools, and a good system prompt can handle a lot. But on deep research, multi-document analysis, multi-stakeholder reporting, single agents hit three bottlenecks:

Context dilution. As the task grows, files, observations, and intermediate outputs accumulate in the context window. Lost in the middle makes the model forget early steps.
Sequential bottleneck. A single agent cannot think in parallel; it analyzes 10 documents one by one. A 1-hour task becomes 10 hours.
Divided attention. The same agent strategizes and grinds details, doing both at medium quality.

Anthropic solved these with the multi-agent orchestrator-worker pattern, documented in "How we built our multi-agent research system" (March 2025). Result: 90.2% better than single-agent Claude Opus on internal research evals.

Definition

Multi-Agent Orchestrator-Worker Pattern: An agentic AI architecture where a Lead/Orchestrator agent plans the task and dispatches 3-5 parallel Worker subagents. Each subagent runs in isolated context; results return to the Lead as structured artifacts. Documented by Anthropic Engineering and reference-implemented in the Claude Agent SDK.; Also known as: Multi-Agent System, MAS, Orchestrator-Worker; Wikidata: Q1064782

Multi-Agent Timeline: From ReAct to Anthropic

Four waves between 2022 and 2026:

2022 — ReAct. Single-agent think/act/observe loop. Tool use standardized.
2023 — Reflexion, Tree of Thoughts, Generative Agents. Self-reflection, multi-path reasoning, agent simulation.
2024 — AutoGen, CrewAI, LangGraph. First multi-agent Python frameworks. Experimental.
2025 — Anthropic Multi-Agent Research + Claude Agent SDK. First production-grade, benchmark-backed evidence. Pattern matured.

Anthropic's March-2025 publication mattered because it was a production report, not academic research — real users, real workloads, real cost/quality trade-offs.

What Multi-Agent is Not

Multi-step ≠ multi-agent.
LLM chaining ≠ multi-agent.
Tool use ≠ subagent.
Mixture of Experts ≠ multi-agent.

Clarifying these distinctions is critical to identify cases where multi-agent truly adds value.

2. Architecture Anatomy: Lead + Workers

The pattern has five components.

2.1. Lead Orchestrator Agent

Model: Best-in-class (Opus 4.7 in Anthropic's example).
Role: Understand task, decompose into subtasks, dispatch subagents in parallel, merge results, present to user.
Context: High-level plan + subagent output summaries. No details.

2.2. Worker Subagents

Model: Fast and cheap (Sonnet 4.6 or Haiku 4.5).
Role: Execute the Lead's subtask in clean context.
Context isolation: Each subagent has its own context; cannot see other subagents.

2.3. Tools

Subagents have access to web search, code execution, file read, MCP tools.
The Lead typically only has a "spawn subagent" tool, not direct tools.

2.4. Structured Artifact Handoffs

Subagent output is not free text but a JSON-schema artifact.
Example: { key_finding: ..., sources: [...], confidence: 0.x }.
The Lead parses and merges artifacts.

2.5. Evaluator / Critic (optional)

A third subagent type: audits a worker's output for quality/accuracy.
The "Evaluator" part of Planner-Generator-Evaluator.

2.6. Subagent Lifecycle: Seven Phases

Spawn (Lead requests a new subagent instance).
Initialize (system prompt + tool catalog loaded).
Receive Task (parsed from JSON input).
Execute (ReAct loop).
Validate Output (schema validation).
Return Artifact (structured handoff).
Cleanup (release memory, files, network).

A solid orchestrator defines a fallback and retry policy per phase.

2.7. Artifact Schema Principles

Stable top-level fields across all artifacts.
Type-safe with Pydantic/Zod.
Confidence score required (0-1).
Source attribution per finding.
Open questions flagged when uncertain.
Hash signature for tamper detection.
Timestamp for cache invalidation.

Anti-pattern: each subagent uses a different schema, forcing the orchestrator into ad-hoc string parsing.

3. Planner-Generator-Evaluator Pattern

The multi-agent architecture is enriched by a sub-pattern: Planner-Generator-Evaluator.

Multi-Agent Roles and Responsibilities
Role	Responsibility	Typical Model	Context Size
Planner	Decompose task into subtasks	Opus 4.7	High (200K-1M)
Generator (Worker)	Execute subtask	Sonnet 4.6 / Haiku 4.5	Medium (own context)
Evaluator / Critic	Audit output	Sonnet 4.6	Medium
Writer	Compose final	Opus 4.7	High
Reviewer	QA the final	Sonnet 4.6	Medium

Flow

Planner (Lead, Opus 4.7): reads query, produces n subtasks.
n parallel Generators/Workers: each executes a subtask, returns a structured artifact.
Evaluator/Critic: scores artifacts; signals re-try for low-quality ones.
Writer (Lead): merges high-quality artifacts and writes the answer.
Reviewer (optional): QA's the final answer before delivery.

Pattern Variations

Hierarchical PGE (Lead → Sub-lead → Worker) for very large tasks.
PGE with Self-Reflection (Generator double-checks itself).
Adversarial PGE (one Evaluator red-teams another) for high-stakes cases.
Iterative PGE (low-confidence triggers loop, 2-3 iterations).

When PGE is Not the Right Pattern

Very short tasks solvable in one prompt.
Deterministic audit-required pipelines.
Tight token budgets.

4. Practical Implementation: Claude Agent SDK

The Claude Agent SDK (2025) is the reference implementation. Folder layout:

Code Snippet

my-project/
├── .claude/
│   ├── settings.json
│   ├── agents/
│   │   ├── researcher.md
│   │   ├── critic.md
│   │   ├── writer.md
│   │   └── reviewer.md
│   └── mcp.json
└── src/

Subagent Definition (.claude/agents/researcher.md)

Code Snippet

---
name: researcher
description: |
  Use for deep research tasks. Given a question and authorized sources,
  returns a structured artifact with findings, citations, and confidence.
tools: [web_search, fetch, read_file]
model: claude-sonnet-4-6
---

You are a research subagent. For each subtask:

1. Search authoritative sources.
2. Extract key findings with direct quotes.
3. Cite every claim with source URL + date.
4. Return JSON artifact:

~~~json
{
  "subtask_id": "<id>",
  "key_findings": ["..."],
  "sources": [{"url":"...","title":"...","date":"..."}],
  "confidence": 0.0-1.0,
  "open_questions": ["..."]
}

Code Snippet


### Lead Orchestrator (application code)

~~~typescript
import { query } from "@anthropic-ai/claude-agent-sdk";

async function multiAgentResearch(question: string) {
  const plan = await query({
    model: "claude-opus-4-7",
    systemPrompt: "You are a research planner...",
    prompt: "Decompose into 3-5 parallel subtasks: " + question,
  });
  const subtasks = JSON.parse(plan.text).subtasks;

  const workerResults = await Promise.all(
    subtasks.map((task) =>
      query({ agent: "researcher", prompt: JSON.stringify(task) })
    )
  );

  const evaluations = await Promise.all(
    workerResults.map((r) => query({ agent: "critic", prompt: r.text }))
  );

  const final = await query({
    model: "claude-opus-4-7",
    systemPrompt: "You are a research writer. Synthesize the artifacts...",
    prompt: JSON.stringify({ subtasks, results: workerResults, evals: evaluations }),
  });
  return final.text;
}

Config Details

Cost cap: Max tokens + time per subagent.
Concurrency: Max parallel subagents (4-6 most common).
Retry policy: 2x retry on failure, then swap with critic.
Telemetry: Latency, tokens, model, success per subagent.

Markdown Frontmatter for Subagents

Anthropic's .claude/agents/*.md format ships config in frontmatter and prompt in the body. The file lives in git, gets code-reviewed, and shares across teams.

MCP Integration

A multi-agent system pairs naturally with MCP: orchestrator gets spawn/store meta-tools; workers get domain MCP tools (web_search, github, sql, vector_db); critics get read/score tools. .claude/mcp.json controls which MCPs are visible to which subagent.

State Management

State lives in three tiers: per-subagent (transient), orchestrator (combined), persistent (Redis/Postgres/object storage for long tasks).

5. Why 90.2%? Performance Analysis

Four factors behind the gap:

Context Isolation

Each subagent works in clean context — no lost-in-the-middle. Instead of cramming 200 pages into one context, you give 5 subagents 40 pages each.

Parallelism

With 5 parallel subagents, total latency is ~1/5 of single-agent. Anthropic showed the parallel system dominates single-agent on the latency-quality tradeoff.

Model Optimization

Lead with Opus, workers with Sonnet/Haiku — right model in the right place. Strategic thinking on premium model, grunt work on cheap model.

Specialization

A "researcher" agent prompted explicitly for its role beats a generalist single agent that does everything at average quality.

Where the Gap Widens

The 90.2% gap is not universal. It widens on:

Multi-document synthesis (20+ documents, conflicting findings).
Multi-stakeholder reporting (legal + financial + ops in one report).
Deep web research (50+ sources, citation aggregation).
Parallel hypothesis testing.

It is close to zero on single-file code review, short Q&A, summaries, classic RAG retrieval, and code completion.

Benchmark Caveat

90.2% is Anthropic's internal eval on their task set. Independent benchmarks (AgentBench, GAIA, SWE-bench) show gaps in the 15-40% range — still meaningful, but architecture is not the only lever; prompt engineering and role design also matter.

6. Comparison with Other Multi-Agent Frameworks

2026 Multi-Agent Framework Comparison
Framework	Type	Model Agnostic	Production-Ready	Community
Claude Agent SDK	Orchestrator-Worker	Claude only	Yes	Medium-High
LangGraph	Graph-based	Multi-provider	Yes	High
CrewAI	Role-based	Multi-provider	Yes	High
AutoGen (Microsoft)	Conversation	Multi-provider	Yes	High
OpenAI Swarm	Lightweight handoff	OpenAI	Experimental	Medium
Atomic Agents	Minimal	Multi-provider	New	Low

Which Framework, Which Use Case?

Claude Agent SDK: Claude-based stack, .claude/ workflow, deep MCP integration.
LangGraph: Complex state machines and loops; agentic graphs.
CrewAI: Fast POC + role-based design; Python ecosystem.
AutoGen: Agent-to-agent conversation + human-in-the-loop.

Hybrid Stacks

In practice teams combine frameworks: Claude Agent SDK as lead + LangGraph workers; CrewAI for POC then Claude Agent SDK for production; AutoGen for human-in-the-loop + LangGraph for the deterministic part. The trade-off is reduced lock-in vs added complexity.

"Don't Build Multi-Agents" Counterpoint

In September 2025 Cognition AI (Devin) published "Don't Build Multi-Agents" arguing that multi-agent stacks compound latency and error surface, and that single-agent + careful context can match the output. Valid for interactive coding; Anthropic's reported cases are deep research — a different profile. The honest answer: it depends on the use case.

7. Turkish Angle: KVKK, BDDK, and Multi-Stakeholder Work

Three scenarios where multi-agent shines for Turkish companies.

Compliance Automation

KVKK breach review: planner reads complaint, three workers run in parallel — (1) pull VERBİS record, (2) search precedent KVKK rulings, (3) compare against internal policy. Evaluator scores, writer produces report. 4-8 human-hours → 12-18 minutes.

Multi-Document Analysis

Law firms, audit firms, M&A consultancy: simultaneous analysis of 50-200 documents. Single-agent insufficient; multi-agent is the natural choice.

Research + Report Generation

Strategy consulting, sector reports: parallel scanning of multiple sources, merged structured findings, executive summary.

KVKK Considerations

PII redaction pre-orchestrator; subagents bound to EU/TR-hosted endpoints; artifacts written to audit logs.

Use Case Map for Turkish Sectors

Banking/Insurance: M&A contract DD, credit risk evaluation (parallel KYC/AML/financials), KVKK breach automation, fraud triage.
Legal: Contract DD, case-law search + synthesis, regulation impact analysis.
Healthcare: Multi-specialist case discussion (cardiologist + endocrinologist subagents), clinical literature triage (reviewer mandatory).
E-commerce/Marketing: Competitor catalog analysis, customer segmentation research.
Manufacturing/Logistics: Supply-chain risk, supplier due diligence, ops dashboards.

High-stakes sectors (healthcare, legal) require iterative + adversarial PGE.

Turkish Subagent Design

For Turkish customers: (1) system prompt in Turkish, (2) Turkish-first sources (mevzuat, içtihat — Said Surucu MCPs), (3) Turkish citation format, (4) embed KVKK + BDDK awareness in system prompts. This yields a Turkey-first multi-agent architecture rather than "global + translated."

8. Case Study (Anonymized): Turkish Law Firm Contract Analysis

Problem

An Istanbul-based corporate law firm must analyze every contract (90-300) at a target during M&A due diligence. Typical M&A: 12-18 lawyers × 3-4 weeks × 60 hours/week ≈ 2,500-4,000 hours of manual work.

Solution

Multi-agent orchestrator-worker pattern:

Lead (Opus 4.7): categorizes contracts (NDA, services, license, lease, employment, financial); spawns a pipeline per category.
6 parallel workers (Sonnet 4.6): one per category. Risk clauses, change-of-control, indemnity caps, KVKK compliance, error clauses.
Evaluator (Sonnet 4.6): flags conflicting findings and low-confidence artifacts.
Writer (Opus 4.7): drafts executive due diligence report.
Reviewer (senior lawyer): human QA.

KVKK: PII redaction pre-orchestrator; Anthropic EU endpoint; audit log per artifact.

Result

Time: 2,500 hours → 65 human-hours + ~80 AI-processing hours. ~16x speedup.
Risk clauses detected: 23% higher (AI caught issues humans missed).
Subjective quality: partners said "the report is now more consistent" (human teams were fatigue-inconsistent).
Cost: ~$8,500 LLM cost per M&A vs ~$2.4M saved in human hours.

Case 2 (Anonymized) — Turkish Bank: Credit Application Triage

Problem. 8,000-12,000 corporate credit applications/day. Per application: KYC + AML + financial statements + precedent (blacklist, restructuring history) + customer profile. Manual review: 35-50 minutes per analyst.

Solution. Multi-agent triage with five parallel subagents (KYC/AML, financial statements, precedent, sector/macro risk, customer profile + cross-sell), one evaluator, a writer producing the credit memo, and a senior credit analyst reviewer. KVKK/BDDK: PII redaction, EU/TR endpoints, on-prem MCP gateway, audit logs.

Result. Review time 35 min → 8 min human + 6 min AI; risk-score stdev cut from 15 to 6; high-risk catch rate up 18%; ops capacity 1.4x without new hires.

Case 3 (Anonymized) — Turkish Strategy Consultancy: Sector Reports

Problem. A typical sector report: 4 weeks × 3 analysts ≈ 480 research-hours + 80 writing-hours.

Solution. Lead spawns 6 researcher subagents (one per section: market sizing, players, regulation, trends, risks, opportunities), one data subagent pulling TÜİK/KAP/sector association data, a critic, a writer, and a senior partner reviewer.

Result. Report cycle 4 weeks → 6 days; analyst can run 3x more parallel projects; source diversity 40% higher than human-only; client NPS 8 → 9.4.

9. Risks, Costs, and Operational Concerns

Token Cost

Multi-agent spends 4-15x more tokens. The decision criterion: "Does the per-subagent output's value exceed the token cost?" Easily yes for deep research, easily no for casual chat.

Deadlock and Infinite Loops

If subagents can spawn subagents (recursive), infinite loops are possible. Mitigations: call-depth limit (max 3), per-task timeout, global cost cap.

Error Handling

If a subagent fails: (1) fail the whole task, (2) skip and continue, (3) retry up to 2x. Most robust: critic marks artifact as low-confidence; orchestrator decides.

Observability

Track per subagent: latency, tokens in/out, model, success, output size, evaluator score. Tools: Langfuse, Arize Phoenix, Helicone, OpenTelemetry.

Non-Determinism

Subagent outputs are stochastic. The same task twice can yield different results — a challenge for deterministic pipelines (audit, financial). Mitigation: temperature=0, structured output, eval harness.

10. Frequently Asked Questions

10.5. Cost Optimization Strategies

Ten practical levers:

Use Haiku for workers — reserve Sonnet for medium-stake, Opus for lead/writer only.
Prompt caching — subagent system prompts repeat; Anthropic prompt caching is ~10x cheaper on repeats.
Early termination — let the planner stop spawning if "enough info."
Cache subagent outputs — same task returns cached artifact.
Batch subagents — pair small subtasks into one subagent.
Tool-output truncation — summarize huge tool returns before feeding to subagent.
Context optimization — store summaries in Lead context, not raw outputs.
Streaming — start the writer as soon as the first artifact arrives.
Selective re-runs — re-run only the subagent flagged by the critic.
Provisioned throughput — Anthropic Bedrock/Azure committed capacity for high volume.

Together, these levers typically cut multi-agent cost 3-7x.

10.7. Multi-Agent Eval Harness

Intermediate metrics (subagent faithfulness/recall, critic accuracy, plan quality) plus final metrics (end-to-end accuracy, coherence, citation accuracy, latency, cost) — measured with LangFuse hierarchical traces, Anthropic Console workflow viewer, or custom Grafana + OpenTelemetry GenAI dashboards. Without an eval harness, regressions go unseen.

10.8. Multi-Agent Anti-Patterns

Production failure modes:

Echo chamber — all subagents pull from the same source.
Hyper-granular decomposition — 12 subtasks for a small task.
No evaluator — quality unmeasured.
Free-form artifacts — schemaless strings → regex parsing → flake.
Shared mutable state — race conditions.
Synchronous waits — losing parallelism.
Unbounded recursion — runaway cost.
Long-lived Lead context — lost-in-the-middle returns.
No cost cap — blowups.
Skipping reviewer on high-stakes tasks — legal/reputational risk.

10.9. Production Readiness Checklist (16 Items)

Less than 16/16 green is a launch risk.

11. Next Steps

Practical roadmap to bring multi-agent to your enterprise:

POC evaluation. Review current single-agent workloads; pick 1-2 use-cases where multi-agent makes sense. 2-3 weeks.
Pattern design. Lead + worker roles, structured artifact schema, evaluator strategy, cost caps, KVKK layer. 4-6 weeks.
Production deploy + observability. Langfuse traces, retry policy, deadlock detection, eval harness. 6-10 weeks.
Training. Workshop for devs and domain teams on Claude Agent SDK + .claude/ workflow + MCP integration.

Reach out via the contact form on the site.

References

How we built our multi-agent research system — Anthropic Engineering, Anthropic · 2025-03
Building Effective Agents — Anthropic, Anthropic · 2024-12
Claude Agent SDK — Anthropic, Anthropic · 2025
Subagents and .claude/agents/ — Anthropic, Anthropic · 2025
Lost in the Middle: How Language Models Use Long Contexts — Liu et al., arXiv · 2023-07
AutoGen — Microsoft Multi-Agent Conversation — Microsoft, Microsoft · 2024
LangGraph — LangChain, LangChain · 2024
CrewAI — CrewAI, CrewAI · 2024
OpenAI Swarm — OpenAI, GitHub · 2024
ReAct: Synergizing Reasoning and Acting — Yao et al., arXiv · 2022
Reflexion — Shinn et al., arXiv · 2023
Tree of Thoughts — Yao et al., arXiv · 2023
Generative Agents — Park et al., Stanford / Google · 2023
Anthropic Console — Anthropic, Anthropic · 2025
Langfuse — Langfuse, Langfuse · 2025
Arize Phoenix — Arize, Arize · 2025
OpenTelemetry GenAI — CNCF, CNCF · 2025
Fountain City — Multi-Agent Production Patterns — Fountain City, Fountain City · 2025
The AI Engineer Substack: Multi-Agent Deep Dive — The AI Engineer, Substack · 2025
Cognition AI — Don''t Build Multi-Agents (Counterpoint) — Cognition AI, Cognition · 2025
AgentBench — Liu et al., arXiv · 2023
GAIA — Mialon et al., arXiv · 2023
SWE-bench — SWE-bench, Princeton · 2024
Claude Code Documentation — Anthropic, Anthropic · 2025
Pydantic — Pydantic, Pydantic · 2025
Zod — Zod, Zod · 2025
KVKK - Law No. 6698 — Republic of Turkiye - KVKK, Republic of Turkiye · 2016-04-07
BDDK Cloud Services Regulation — BDDK, BDDK · 2023
Helicone — Helicone, Helicone · 2025
EU AI Act — European Commission, EU · 2024-03

A living document; the multi-agent ecosystem shifts every quarter, so this guide is revised every quarter.

Consulting Pathways

Consulting pages closest to this article

For the most logical next step after this article, you can review the most relevant solution, role, and industry landing pages here.

Solution Pages

AI Agents and Workflow Automation

Move beyond single-step chatbots to AI workflows orchestrated with tools, rules and human approval.

ai agentsagentic ai

Open landing

Solution Pages

AI Architecture Audit

Assess your AI architecture through an independent lens of scalability, security, cost and performance.

production readiness

Open landing

Role-Based Pages

Enterprise AI Architecture Consulting for CTOs

Technical leadership consulting to move AI initiatives from isolated PoCs into secure, scalable and production-ready architecture.

Open landing

Explore All Posts

1. Why a Single Agent is Not Enough

Multi-Agent Timeline: From ReAct to Anthropic

What Multi-Agent is Not

2. Architecture Anatomy: Lead + Workers

2.1. Lead Orchestrator Agent

2.2. Worker Subagents

2.3. Tools

2.4. Structured Artifact Handoffs

2.5. Evaluator / Critic (optional)

2.6. Subagent Lifecycle: Seven Phases

2.7. Artifact Schema Principles

3. Planner-Generator-Evaluator Pattern

Flow

Pattern Variations

When PGE is Not the Right Pattern

4. Practical Implementation: Claude Agent SDK

Subagent Definition (.claude/agents/researcher.md)

Config Details

Markdown Frontmatter for Subagents

MCP Integration

State Management

5. Why 90.2%? Performance Analysis

Context Isolation

Parallelism

Model Optimization

Specialization

Where the Gap Widens

Benchmark Caveat

6. Comparison with Other Multi-Agent Frameworks

Which Framework, Which Use Case?

Hybrid Stacks

"Don't Build Multi-Agents" Counterpoint

7. Turkish Angle: KVKK, BDDK, and Multi-Stakeholder Work

Compliance Automation

Multi-Document Analysis

Research + Report Generation

KVKK Considerations

Use Case Map for Turkish Sectors

Turkish Subagent Design

8. Case Study (Anonymized): Turkish Law Firm Contract Analysis

Problem

Solution

Result

Case 2 (Anonymized) — Turkish Bank: Credit Application Triage

Case 3 (Anonymized) — Turkish Strategy Consultancy: Sector Reports

9. Risks, Costs, and Operational Concerns

Token Cost

Deadlock and Infinite Loops

Error Handling

Observability

Non-Determinism

10. Frequently Asked Questions

10.5. Cost Optimization Strategies

10.7. Multi-Agent Eval Harness

10.8. Multi-Agent Anti-Patterns

10.9. Production Readiness Checklist (16 Items)

11. Next Steps

References

Consulting pages closest to this article

AI Agents and Workflow Automation

AI Architecture Audit

Enterprise AI Architecture Consulting for CTOs

Comments

Comments

Agentic AI and Autonomous Systems

AI Governance and EU AI Act Compliance