Skip to content

Fabric Setup

The multi-agent governance and configuration patterns referenced throughout Fabric › Setup draw on the active multi-agent configuration research lineage. This page catalogs the canonical archetypes, their measured outcomes, and the selection criteria the platform uses when configuring a fabric for a given workload.

Governance Principles

Governance of an agent fabric means making five concerns explicit at configuration time: identity (who is the agent, who delegated it), authorization (which tools / data, under whose policy), audit (per-step trace with replayable provenance), escalation (which actions need human approval), and budget (token, wall-clock, hop ceilings). AWS frames this as four pillars; Boundaries, Identity, Visibility, Evaluation; implemented via Bedrock AgentCore Identity (OAuth-scoped agent principals) and AgentCore Policy on Cedar. Google Vertex AI Agent Builder treats each agent as a first-class IAM principal with dedicated service accounts and routes all tool calls through an Agent Gateway policy enforcement point. OpenAI Agents SDK exposes input / output / tool guardrails with blocking vs parallel execution modes.

Configuration Archetypes And Measured Outcomes

  • Supervisor / orchestrator-worker; best for research, planning, decomposable retrieval. The Anthropic multi-agent research system (Opus lead + Sonnet workers) outperformed single-agent Opus by 90.2% on internal research evaluations.
  • Hierarchical team-of-teams; tree topology (Google ADK style); scales but adds translation overhead. LangChain reports supervisor routing accuracy degrades after 8-12 round trips.
  • Peer handoff / swarm; LangGraph-Swarm reports roughly 40% lower end-to-end latency vs supervisor for conversational routing; weaker governance.
  • Debate / consensus; useful on math, code-review, issue-resolution. M3MAD-Bench / Free-MAD (2025) report consensus pressure hurts accuracy and inflates token cost; SWE-Debate uses competitive (not conformity) debate.
  • MetaGPT-style SOP; encodes assembly-line SOPs into prompt sequences; reports SOTA 85.9% / 87.7% Pass@1 on HumanEval / MBPP-class code-generation benchmarks (MetaGPT).

Role Design

A role spec needs five fields: objective, output format, tool set + sources, memory scope, and task boundaries. The Anthropic published subagent template names vague specs as the number-one cause of duplicated work. CrewAI encodes role / backstory / goal + Pydantic-typed outputs; LangGraph models roles as graph nodes with reducer-merged state; MetaGPT pins roles to SOP stages. Claude Code subagents are Markdown + YAML-frontmatter files in .claude/agents/ with scoped tool allowlists; granular tool access is the primary blast-radius control.

Model / Tier Mix

The Anthropic BrowseComp ablation reports token budget alone explains roughly 80% of performance variance; tool-call count and model choice account for the remaining roughly 15% (95% combined). A Sonnet 4 upgrade beat doubling the Sonnet 3.7 token budget. Heterogeneous mixing (strong orchestrator + cheaper workers) outperforms homogeneous high-capability fleets. A counterweight: Liu et al. (arXiv 2604.02460) find single-agent LLMs beat multi-agent systems on multi-hop reasoning at equal thinking-token budget; multi-agent wins only when problems are genuinely parallel.

Authorization And Policy Layers

Policy must live outside the agent. OPA / Rego at the tool-calling layer (the agent does not decide what is allowed; the engine does) is the established pattern; AWS AgentCore Policy uses Cedar with formal verification; Vertex's Agent Gateway is the equivalent central policy enforcement point. Scoped credential vaults (Anthropic workspace-scoped keys, short-lived narrowly-scoped credentials per Remote Control session) and per-step gates (OpenAI guardrails with run_in_parallel=False to prevent token / tool side-effects when a tripwire fires) round out the layering.

Identity And Isolation

Treat every agent as a non-human identity with cryptographic provenance; what code, model, and environment produced it (CSA "identity explosion" framing). The November 2025 MCP specification added tool-scoped authorization (SEP-835) and namespace isolation. Bind tokens to clients via DPoP (RFC 9449) to defeat replay; prefer short-lived tokens. Map controls to OWASP Agentic Top 10 / ASI (ASI03 Identity & Privilege Abuse, ASI04 Supply Chain, ASI07 Insecure Inter-Agent Comms, ASI10 Rogue Agents) and the proposed NIST AI RMF Agentic Profile.

Budget And Rate-Limit Configuration

Published Anthropic heuristics:

  • Simple fact-finding; 1 agent, 3-10 tool calls.
  • Direct comparison; 2-4 subagents, 10-15 calls each.
  • Complex research; more than 10 subagents with non-overlapping responsibilities.

Agents use roughly 4× chat-baseline tokens; multi-agent systems roughly 15×. Set hop limits below the supervisor's 8-12-turn routing-accuracy cliff. AgentDropout (ACL 2025) and SupervisorAgent show runtime adaptive supervision cuts roughly 29.45% of tokens on GAIA with no success-rate loss.

Observability Hooks

Standardize on OpenTelemetry GenAI semantic conventions (GenAI SIG, experimental since April 2024). Three span operations: chat, invoke_agent, execute_tool, plus standard attributes for prompts / tokens / cost / tool I/O. Backends that consume the conventions: Arize Phoenix (native OpenInference instrumentors), LangSmith, Langfuse, Helicone, Traceloop. Standardizing on OTel avoids vendor lock-in.

Outcome-Optimal Configuration Research

Selection Criteria

Archetype Best Problem Class Cost (Relative) Governance Ease Parallelism Debuggability Outcome Evidence
Single agent Serial multi-hop reasoning, tight budgets High None Highest arXiv 2604.02460; beats MAS at equal thinking-token budget
Supervisor / orchestrator Open-ended research, decomposable retrieval ~15× chat High (single PEP at supervisor) Medium (with Send-style primitives) High (one trace root) Anthropic: +90.2% over single Opus; 80% variance ≈ token budget
Hierarchical (team-of-teams) Large planning, multi-domain workflows High High Medium-High Medium (multi-root traces) Google ADK; supervisor accuracy cliff at 8-12 hops
Peer / swarm handoff Conversational routing, customer support Low-Medium Low (distributed policy) High Low LangGraph-Swarm: ~40% latency reduction
Debate / consensus Math, code review, fact verification High (round-multiplied) Medium Medium Medium SWE-Debate (competitive) beats consensus MAD; conformity hurts
MetaGPT-style SOP Software dev with known stages Medium High (SOP-gated) Medium High (stage gates) 85.9% / 87.7% Pass@1 on code-gen benchmarks

Picking Heuristic

Pick single agent until there is a measurable bottleneck. Move to supervisor when sub-tasks are independent and the budget can absorb roughly 15×. Pick hierarchical only when one supervisor's context starts overflowing past 12 hops. Pick swarm only when latency matters more than auditability. Pick SOP when stages are stable and stage-typed artifacts exist. Pick debate only with a competitive (not conformity) protocol and a verifier that can break ties.

Cross-References