Fabric Flow¶
The multi-agent flow and failure-protection patterns referenced throughout Fabric › Flow draw on the 2024-2026 agent-reliability research lineage. This page catalogs the canonical flow patterns, the MAST failure taxonomy, the protection mechanisms that prevent runaway loops and deadlocks, and the selection criteria the platform uses when configuring Flow defaults for a given workload.
Staged Flow Patterns¶
Anthropic Building Effective Agents (December 2024) defines the canonical taxonomy still dominant in 2026: prompt chaining (sequential), routing (classifier dispatches to specialists), parallelization (sectioning for speed, voting for confidence), orchestrator-workers (dynamic decomposition by a planner), evaluator-optimizer (generator / critic loop), and fully autonomous agents. Hierarchical / supervisor variants formalize this with a supervisor node routing to peer workers; debate patterns run N agents in parallel rounds and aggregate. Canonical staged flow: Plan → Fan-out(Workers) → Fan-in(Aggregator) → Validate → Finalize, with failure transitions back to Plan only when a retry budget allows.
MAST Failure Taxonomy¶
Cemri et al. (arXiv 2503.13657) analyzed 150+ traces across seven frameworks (κ = 0.88) and codified 14 failure modes under three categories: Specification & System Design (~42%), Inter-Agent Misalignment (~37%), and Task Verification & Termination (~21%). Concrete production examples: GPT-4-based ChatDev agents agreeing to abandon tasks; supervisors closing tickets before sub-agents finished; AutoGen GroupChats devolving into agreement loops without progress.
Infinite Loop Prevention¶
Every major framework ships a hard hop ceiling: LangGraph's recursion_limit defaults to 25 and raises GraphRecursionError; AutoGen GroupChat exposes max_round plus MaxMessageTermination / TokenUsageTermination / TimeoutTermination; OpenAI Agents SDK uses max_turns raising MaxTurnsExceededError; LangChain ReAct exposes max_iterations (LangGraph error guide). Hop limits alone miss semantic loops; agents cycling through paraphrases. Production teams add a LoopDetector middleware that SHA-hashes (tool_name, normalized_args) in a sliding window (typical N = 5, threshold = 3) and forces termination (gantz).
Circuit Breakers¶
Adapted from Hystrix / resilience4j, agent circuit breakers track soft failures invisible to HTTP-layer breakers: schema-invalid outputs, semantic-invariant violations, identical-tool-call streaks, cost-velocity overruns (Portkey, NeuralTrust). Trip conditions in production: ≥ 3 consecutive identical tool calls; ≥ 2 consecutive JSON-schema validation failures; rolling cost rate exceeding threshold; provider 429 / 503 rates exceeding threshold. State machine: CLOSED → OPEN → HALF_OPEN → CLOSED | OPEN.
Retry Budgets And Exponential Backoff¶
Per-call retry uses exponential backoff with full jitter (sleep = random(0, base * 2^attempt)), but 2025 consensus is that per-call retries must be governed by a system-wide token-bucket retry budget capping retry traffic at 10-20% of normal load (SRE School); otherwise a model outage triggers retry storms. AI-specific extension: a hard pre-execution budget gate evaluates token and dollar spend before every model call or tool invocation; exhaustion is a deterministic deny, not retry. Cycles and Truefoundry both report 10× cost blowups when sub-agent spawning is uncapped.
Timeout Policies¶
Three layers compose: per-step timeout (single LLM call, typically 30-120 s), per-stage timeout (a worker's full sub-flow), and overall workflow timeout (top-level deadline). AutoGen's TimeoutTermination and LangGraph's step_timeout configure step-level limits; durable engines (Temporal, Restate) propagate a workflow deadline as context to every activity. Timeout must propagate as an absolute deadline, not a duration, so retries do not reset the clock.
Escalation Paths¶
LangGraph interrupt() (GA 2024, expanded throughout 2025) pauses the graph at a checkpoint, persists state via AsyncPostgresSaver, and waits for a Command(resume=...); supporting approve / edit / reject / respond actions. OpenAI Agents SDK models human escalation as a handoff to a "human agent" (often via Temporal or Dapr Diagrid integrations for durability). Claude tool-use approval uses per-tool tool_choice plus client-side approval gates. Confidence-threshold escalation: if a critic / verifier's score is below τ, route to human review rather than retry.
Idempotency And Saga Patterns¶
Each agent step should be idempotent under an idempotency_key (typically hash(step_id, inputs)) so durable replay returns cached results. Saga compensation: every state-mutating action registers a compensate() function; on failure, the orchestrator runs compensations in LIFO order over actually-executed steps. Temporal, Dapr Workflows (stable in v1.15, 2025), and Restate (Cloud GA 2025) journal each step and replay on crash, returning cached results for completed activities.
Replanning Versus Aborting¶
Replan when the failure is recoverable and bounded: tool returned an error decodable as a constraint violation, schema validation failed once, or a sub-agent returned partial results. Abort when retry budget is exhausted, circuit breaker is open, MAST-class "specification" failure, or repeated-state hash indicates no semantic progress for K iterations. The MAS-Orchestra and Cogent 2026 playbook converge: replan at most twice per stage; on the third failure, escalate or abort.
Deadlock Detection¶
Two classes dominate: peer handoff oscillation (A → B → A …) and supervisor stuck state (supervisor repeatedly dispatches to the same worker that returns "needs more info"). Cogent's 2026 playbook: an agent cannot self-diagnose a loop; an external monitor must prove it. Implementations track a handoff graph and trigger on cycles of length ≤ 3 occurring ≥ 2 times, or on supervisor states whose (active_worker, last_message_hash) repeats. LLMDR (arXiv 2503.00717) demonstrates LLM-driven detection-then-resolution where a separate model arbitrates the stalled pair.
Validation Gates Between Stages¶
Every stage transition should pass through a typed gate: schema validation (Pydantic / JSON Schema on tool outputs), policy validation (AWS Bedrock Guardrails, OPA, or AgentCore Policy authorizing each action pre-execution), semantic invariants (e.g., "answer cites at least one retrieved doc"), and grounding / hallucination checks. AWS's Automated Reasoning checks in Bedrock Guardrails use formal logic for verifiable correctness on policy-encoded facts. Gates produce a structured ValidationResult consumed by the orchestrator's retry / replan / abort decision.
Selection Decision Table¶
| Failure Type | Low-Latency Interactive | Long-Running Batch | High-Stakes / Regulated |
|---|---|---|---|
| Infinite tool loop | max_turns + hash loop detector |
recursion_limit + durable journal |
Loop detector + human escalate |
| Sub-agent deadlock | Short per-stage timeout + abort | External monitor + saga compensation | Monitor + mandatory HITL gate |
| Cost overrun | Per-session token cap, fail-fast | Pre-execution budget gate + alert | Hard budget gate + finance approve |
| Provider 5xx / rate limit | Backoff + jitter, fallback model | Circuit breaker + retry budget | Breaker + escalate; no silent swap |
| Schema / grounding fail | 1 retry then surface to user | Replan ≤ 2 then abort | Validation gate blocks; HITL fix |
| Spec / planner malformed | Abort, return error | Replan with critic; bound replans | Abort + ticket; no auto-replan |
| Semantic no-progress | Hash detector → terminate | Replan once then escalate | Escalate immediately |
| Catastrophic / unsafe output | Guardrail block + safe fallback | Guardrail block + halt workflow | Guardrail block + audit log + HITL |
Picking Heuristic¶
- Always wire hop limits and loop detectors on any production fabric.
- Pick circuit breakers when cost-velocity or consecutive failure are observable signals.
- Use durable workflow engines (Temporal / Restate / Dapr) when horizon exceeds minutes or human approval is required.
- Use validation gates between every stage transition; never let an agent decide its own gate.
- Apply the replan-at-most-twice rule by default and treat deviations as explicit per-fabric configuration.
- Always use external monitors for deadlock detection; agents cannot reliably self-diagnose loops.
Cross-References¶
- Fabric › Flow; developer-facing platform layer that consumes these patterns.
- Reference › Research › Subagents; coordination patterns Flow enforces.
- Reference › Research › Fabric Test; how the protection mechanisms are validated.