Fabric Setup¶

The multi-agent governance and configuration patterns referenced throughout Fabric › Setup draw on the active multi-agent configuration research lineage. This page catalogs the canonical archetypes, their measured outcomes, and the selection criteria the platform uses when configuring a fabric for a given workload.

Governance Principles¶

Governance of an agent fabric means making five concerns explicit at configuration time: identity (who is the agent, who delegated it), authorization (which tools / data, under whose policy), audit (per-step trace with replayable provenance), escalation (which actions need human approval), and budget (token, wall-clock, hop ceilings). AWS frames this as four pillars; Boundaries, Identity, Visibility, Evaluation; implemented via Bedrock AgentCore Identity (OAuth-scoped agent principals) and AgentCore Policy on Cedar. Google Vertex AI Agent Builder treats each agent as a first-class IAM principal with dedicated service accounts and routes all tool calls through an Agent Gateway policy enforcement point. OpenAI Agents SDK exposes input / output / tool guardrails with blocking vs parallel execution modes.

Configuration Archetypes And Measured Outcomes¶

Supervisor / orchestrator-worker; best for research, planning, decomposable retrieval. The Anthropic multi-agent research system (Opus lead + Sonnet workers) outperformed single-agent Opus by 90.2% on internal research evaluations.
Hierarchical team-of-teams; tree topology (Google ADK style); scales but adds translation overhead. LangChain reports supervisor routing accuracy degrades after 8-12 round trips.
Peer handoff / swarm; LangGraph-Swarm reports roughly 40% lower end-to-end latency vs supervisor for conversational routing; weaker governance.
Debate / consensus; useful on math, code-review, issue-resolution. M3MAD-Bench / Free-MAD (2025) report consensus pressure hurts accuracy and inflates token cost; SWE-Debate uses competitive (not conformity) debate.
MetaGPT-style SOP; encodes assembly-line SOPs into prompt sequences; reports SOTA 85.9% / 87.7% Pass@1 on HumanEval / MBPP-class code-generation benchmarks (MetaGPT).

Role Design¶

A role spec needs five fields: objective, output format, tool set + sources, memory scope, and task boundaries. The Anthropic published subagent template names vague specs as the number-one cause of duplicated work. CrewAI encodes role / backstory / goal + Pydantic-typed outputs; LangGraph models roles as graph nodes with reducer-merged state; MetaGPT pins roles to SOP stages. Claude Code subagents are Markdown + YAML-frontmatter files in .claude/agents/ with scoped tool allowlists; granular tool access is the primary blast-radius control.

Model / Tier Mix¶

The Anthropic BrowseComp ablation reports token budget alone explains roughly 80% of performance variance; tool-call count and model choice account for the remaining roughly 15% (95% combined). A Sonnet 4 upgrade beat doubling the Sonnet 3.7 token budget. Heterogeneous mixing (strong orchestrator + cheaper workers) outperforms homogeneous high-capability fleets. A counterweight: Liu et al. (arXiv 2604.02460) find single-agent LLMs beat multi-agent systems on multi-hop reasoning at equal thinking-token budget; multi-agent wins only when problems are genuinely parallel.

Authorization And Policy Layers¶

Policy must live outside the agent. OPA / Rego at the tool-calling layer (the agent does not decide what is allowed; the engine does) is the established pattern; AWS AgentCore Policy uses Cedar with formal verification; Vertex's Agent Gateway is the equivalent central policy enforcement point. Scoped credential vaults (Anthropic workspace-scoped keys, short-lived narrowly-scoped credentials per Remote Control session) and per-step gates (OpenAI guardrails with run_in_parallel=False to prevent token / tool side-effects when a tripwire fires) round out the layering.

Identity And Isolation¶

Treat every agent as a non-human identity with cryptographic provenance; what code, model, and environment produced it (CSA "identity explosion" framing). The November 2025 MCP specification added tool-scoped authorization (SEP-835) and namespace isolation. Bind tokens to clients via DPoP (RFC 9449) to defeat replay; prefer short-lived tokens. Map controls to OWASP Agentic Top 10 / ASI (ASI03 Identity & Privilege Abuse, ASI04 Supply Chain, ASI07 Insecure Inter-Agent Comms, ASI10 Rogue Agents) and the proposed NIST AI RMF Agentic Profile.

Budget And Rate-Limit Configuration¶

Published Anthropic heuristics:

Simple fact-finding; 1 agent, 3-10 tool calls.
Direct comparison; 2-4 subagents, 10-15 calls each.
Complex research; more than 10 subagents with non-overlapping responsibilities.

Agents use roughly 4× chat-baseline tokens; multi-agent systems roughly 15×. Set hop limits below the supervisor's 8-12-turn routing-accuracy cliff. AgentDropout (ACL 2025) and SupervisorAgent show runtime adaptive supervision cuts roughly 29.45% of tokens on GAIA with no success-rate loss.

Observability Hooks¶

Standardize on OpenTelemetry GenAI semantic conventions (GenAI SIG, experimental since April 2024). Three span operations: chat, invoke_agent, execute_tool, plus standard attributes for prompts / tokens / cost / tool I/O. Backends that consume the conventions: Arize Phoenix (native OpenInference instrumentors), LangSmith, Langfuse, Helicone, Traceloop. Standardizing on OTel avoids vendor lock-in.

Outcome-Optimal Configuration Research¶

Anthropic engineering: 95% of performance variance from token-budget + tool-call-count + model-choice in that order. Mixed-tier centralised topologies beat homogeneous fleets.
"Towards a Science of Scaling Agent Systems" (arXiv 2512.08296) and AI Sweden's Practical Approach to Optimize Multi-Agent Systems (Dec 2025) give scaling laws for agent count vs accuracy plateau.
REALM-Bench and MultiAgentBench measure collaboration / competition over 11 real-world planning scenarios.
LangChain's supervisor-architecture benchmark reports roughly 50% improvement from supervisor implementation fixes alone (prompt + handoff translation).
SWE-Debate: competitive debate beats consensus debate on software issue resolution.

Selection Criteria¶

Archetype	Best Problem Class	Cost (Relative)	Governance Ease	Parallelism	Debuggability	Outcome Evidence
Single agent	Serial multi-hop reasoning, tight budgets	1×	High	None	Highest	arXiv 2604.02460; beats MAS at equal thinking-token budget
Supervisor / orchestrator	Open-ended research, decomposable retrieval	~15× chat	High (single PEP at supervisor)	Medium (with `Send`-style primitives)	High (one trace root)	Anthropic: +90.2% over single Opus; 80% variance ≈ token budget
Hierarchical (team-of-teams)	Large planning, multi-domain workflows	High	High	Medium-High	Medium (multi-root traces)	Google ADK; supervisor accuracy cliff at 8-12 hops
Peer / swarm handoff	Conversational routing, customer support	Low-Medium	Low (distributed policy)	High	Low	LangGraph-Swarm: ~40% latency reduction
Debate / consensus	Math, code review, fact verification	High (round-multiplied)	Medium	Medium	Medium	SWE-Debate (competitive) beats consensus MAD; conformity hurts
MetaGPT-style SOP	Software dev with known stages	Medium	High (SOP-gated)	Medium	High (stage gates)	85.9% / 87.7% Pass@1 on code-gen benchmarks

Picking Heuristic¶

Pick single agent until there is a measurable bottleneck. Move to supervisor when sub-tasks are independent and the budget can absorb roughly 15×. Pick hierarchical only when one supervisor's context starts overflowing past 12 hops. Pick swarm only when latency matters more than auditability. Pick SOP when stages are stable and stage-typed artifacts exist. Pick debate only with a competitive (not conformity) protocol and a verifier that can break ties.

Cross-References¶

Fabric › Setup; the developer-facing platform layer that consumes these patterns.
Reference › Research › Subagents; multi-agent compositions Setup wires.
Reference › Research › Fabric Flow; orchestration patterns Setup configures.