Skip to content

Subagents

The subagent and multi-agent orchestration patterns referenced throughout Emergence › Subunits draw on the 2024-2026 multi-agent research lineage. This page catalogs the canonical systems, their key innovations, ideal workloads, failure modes, and selection criteria the platform uses when configuring a Subunit composition for a given Unitt.

Reference Systems

Claude Code Subagents

Claude Code subagents are Markdown files in .claude/agents/*.md with YAML frontmatter (name, description, tools, optional model). Each runs in an isolated context window with its own system prompt and a declarative tool allowlist; parent-to-child communication is a single prompt string, child-to-parent is a summary. Innovation: file-based, version-controlled subagent definitions with least-privilege tool scoping. Best for: noisy, self-contained side tasks (repo exploration, log triage, doc review). Failure mode: loss of nuance across the prompt / summary boundary. Reference: Claude API subagents docs.

Anthropic Multi-Agent Research System

The Anthropic multi-agent research system implements an orchestrator-worker pattern. A lead Opus agent decomposes a query, spawns 3-5 Sonnet subagents in parallel, each running 3+ tool calls in parallel. The system outperformed single-agent Opus by 90.2% on internal evaluation; consumed roughly 15× more tokens than chat; roughly 80% of performance variance was explained by token budget. Innovation: explicit scaling rules embedded in prompts so the lead allocates effort to task complexity. Failure mode: over-spawning, duplicate work from vague subtasks. See also: Building Effective Agents.

OpenAI Swarm / Agents SDK

OpenAI Swarm introduced two primitives: Agents (instructions + tools) and handoffs (tools that return another Agent, transferring control). The OpenAI Agents SDK is the production successor adding guardrails, tracing, sessions, MCP integration, and TypeScript support. Innovation: handoff-as-tool; control transfer is just a function the model can call; no central router required. Best for: customer-service-style flows with one agent triaging and routing to specialists. Failure mode: with no supervisor, handoff loops and oscillation between peers can occur. See Orchestrating Agents.

LangGraph

LangGraph is graph-based orchestration where nodes are agents / functions and conditional edges route based on state. The supervisor pattern and hierarchical teams (subgraphs of supervisors) are well-documented. Routing logic lives in code (conditional edges), not LLM prompts. Innovation: explicit state graph with checkpointing, deterministic routing, and human-in-the-loop interrupts. Best for: production agents needing observable state machines, replay, and complex branching. Failure mode: hierarchical layers add latency and cost; avoid hierarchy until roughly six concurrent workers. Reference: Choosing the Right Multi-Agent Architecture.

CrewAI

CrewAI declares agents with role / goal / backstory plus tools, grouped into a Crew. Two process types: Process.sequential (linear pipeline) and Process.hierarchical (manager LLM delegates and validates). A newer consensual mode adds voting. Innovation: persona-first ergonomics; agents resemble job descriptions, lowering authoring friction. Best for: linear content / research pipelines and team-simulation use cases. Failure mode: backstory / role prompts encourage roleplay drift; harder to test deterministically than graph-based systems.

AutoGen / AG2

AG2 (formerly AutoGen) provides ConversableAgent, GroupChat with GroupChatManager for >2 agents over a shared transcript with speaker-selection logic, and Nested chats that package a sub-conversation behind a single agent. AG2 v0.9 unified these into one Group Chat architecture. Innovation: conversation as a first-class abstraction; emergent collaboration through message-passing rather than fixed graphs. Best for: coding / reasoning tasks with iterative critique-and-executor loops. Failure mode: speaker-selection drift, runaway transcripts; bound cost with explicit max_round caps.

MetaGPT

MetaGPT (ICLR 2024) encodes a software-org SOP: Product Manager → Architect → Project Manager → Engineer → QA Engineer. Agents exchange structured artifacts (PRDs, file lists, interface definitions) rather than free-text, which sharply raises code-gen success on benchmarks. Innovation: structured intermediate outputs as the inter-agent contract, eliminating much hallucination drift. Best for: greenfield software generation from a one-line spec; any pipeline that benefits from a rigid SOP. Failure mode: rigidity; non-software domains and ambiguous specs map poorly to the fixed role chain. Source code: FoundationAgents/MetaGPT.

AgentVerse / ChatDev

AgentVerse supports task-solving and social simulation with dynamic role recruitment; ChatDev (ACL 2024) uses a chat chain dividing software development into phased dialogues with "communicative dehallucination." Innovation: role-playing simulation as a research instrument for studying emergent multi-agent behavior. Best for: academic exploration, scenario simulation, design templates for phase-decomposed pipelines. Failure mode: emergent behavior is hard to constrain; production reliability lags supervisor / graph systems.

Coordination Patterns

Across the systems above, three coordination patterns recur. Each has a distinct cost, observability, and failure-mode profile.

Pattern Description Strengths Weaknesses
Supervisor / Orchestrator-Worker Central LLM decides who runs next. Clean traces; easy to govern; parallelizes naturally. Supervisor bottleneck and token sink.
Hierarchical Decomposition Supervisors of supervisors with sub-teams. Scales to many specialists across sub-domains. Token cost and latency multiply per layer.
Peer Handoffs / Debate Agents transfer control or critique each other. Independent critique improves reasoning. Handoff loops; majority conformity in debate.

Failure Modes (MAST Taxonomy)

The MAST taxonomy analyzed 1,600+ traces across multi-agent systems and identified 14 failure modes clustering into three groups:

  • Specification Problems (41.8%); under-specified briefs, ambiguous task contracts.
  • Coordination Failures (36.9%); context loss across boundaries, loops, oscillation.
  • Verification Gaps (21.3%); silent retries, missing validation of subagent output.

Recent research (arXiv:2511.07784, arXiv:2509.11035) further shows that agents in debate compositions tend to conform to majority rather than reason independently; gains depend on minority agents willing to push back. See also: Multi-Agent Collaboration Mechanisms: A Survey.

Selection Criteria

The platform selects a subagent composition per Unitt by reading the workload profile (coordination style, isolation needs, parallelism, cost budget, debug-ability, customization required) and matching against the table below.

System / Pattern Coordination Isolation Parallelism Cost Debug-ability Customization
Claude Code subagents Supervisor (parent delegates) Strong (separate ctx) Yes (parent spawns) Low (Haiku-able) Moderate (summaries only) High (MD + tool allowlist)
Anthropic research system Orchestrator-worker Strong High (3-5 wide) Very High (~15×) Hard (parallel traces) Medium (prompt-tuned)
OpenAI Agents SDK Peer handoffs Per-agent Limited Medium Good (built-in tracing) High (Python / TS)
LangGraph Supervisor / graph Per-node state Yes (parallel nodes) Tunable Strong (state, replay) Very High (code-defined)
CrewAI Sequential / hierarchical Per-agent Limited Medium Moderate High (role personas)
AutoGen / AG2 Group chat Shared transcript Limited High (chatty) Hard (free-form msgs) Very High
MetaGPT Fixed SOP pipeline Per-role, structured artifacts Sequential Medium Good (structured outputs) Low (role chain rigid)
ChatDev / AgentVerse Chat-chain / simulation Per-role Phase-parallel High Moderate High (research-grade)
Debate / consensus Peer Shared Round-parallel High Hard Medium

Picking Heuristic

  • Single Unitt for short, single-domain workloads.
  • Claude Code subagents for context isolation on noisy side-tasks.
  • Supervisor / orchestrator as the default for most enterprise workloads.
  • LangGraph when explicit state machines, replay, and human-in-the-loop are required.
  • CrewAI for linear content pipelines where persona ergonomics dominate.
  • AG2 for conversational, critique-heavy reasoning tasks with strict round caps.
  • MetaGPT for greenfield software generation with rigid SOPs.
  • Hierarchical when there are more than roughly six concurrent specialists.
  • Peer debate only when independent critique demonstrably improves outcomes against the workload's success oracle.

Cross-References