Flow¶
Fabric Flow defines how the configured multi-agent system actually executes; the staged process by which agents work together, the failure states each stage can produce, and the protection mechanisms that ensure no failure cascades into a runaway loop, a stuck deadlock, or an uncontrolled cost spike. Where Setup commits the topology and Data commits the data plane, the Flow layer commits the operational behavior: which stage runs when, what passes between them, what happens when a stage fails, and what guarantees the fabric makes about termination, cost, and safety.
Fabric Flow is informed by the active agent reliability research lineage, including Anthropic Building Effective Agents (which defined the canonical taxonomy of prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and autonomous agents), the MAST multi-agent failure taxonomy (which analyzed 150+ traces across seven frameworks and codified 14 failure modes), LangGraph human-in-the-loop interrupts and the recursion_limit ceiling, AutoGen / AG2 termination conditions, the OpenAI Agents SDK max_turns primitive, durable-execution engines including Temporal, Dapr Workflows, and Restate, and the AWS Bedrock AgentCore policy and automated-reasoning layer. Selection criteria for protection mechanisms are documented in Reference › Research › Fabric Flow.
Canonical Staged Flow¶
Every Flow run advances through five stages: plan, fan-out, fan-in, validate, and finalize. Each stage transition passes through a typed validation gate. Each stage is bounded by an explicit budget. Each stage emits structured events to the audit trail. Failure transitions are always explicit; they do not silently retry, escalate, or abort.
flowchart LR
REQ[Request] --> PLAN[Plan]
PLAN --> G1{Plan Gate}
G1 -->|pass| FO[Fan-Out: Worker 1..N]
G1 -->|fail| ESC[Escalation]
FO --> FI[Fan-In: Aggregate]
FI --> G2{Validate Gate}
G2 -->|pass| FIN[Finalize]
G2 -->|replan| PLAN
G2 -->|escalate| ESC
FIN --> OUT[Outcome]
ESC --> HR[Human Review]
ESC --> AB[Controlled Abort]
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class REQ,PLAN,G1,FO,FI,G2,FIN,OUT,ESC,HR,AB stage
The canonical flow is supervisor-shaped by default; the supervisor plans, dispatches workers, aggregates their summaries, validates the aggregate, and either finalizes or replans. Hierarchical, peer-handoff, debate, and SOP topologies are variations on this canonical shape; each ships the same gate, budget, and protection semantics.
Stage Anatomy¶
Every stage in the Flow has the same internal anatomy. The platform applies the same invariants to every stage regardless of which Unitt runs it, which model is pinned, and which connectors it accesses.
flowchart LR
IN[Stage Input + Schema] --> PRE[Pre-Condition Gate]
PRE --> EX[Execute Agent Step]
EX --> POST[Post-Condition Gate]
POST --> OUT[Stage Output + Schema]
EX -. tool calls .-> T[Tools / Connectors]
EX -. memory writes .-> M[Memory Writers]
EX -. tracing .-> O[Observability]
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class IN,PRE,EX,POST,OUT,T,M,O stage
- Pre-condition gate validates input schema, policy scope, budget headroom, and required upstream artifacts.
- Execute runs the agent step under the topology's runtime pattern from Emergence › System.
- Post-condition gate validates output schema, policy compliance, confidence threshold, and downstream contract obligations.
- Output + schema is the typed payload that the next stage's pre-condition gate consumes.
Failure Mode Taxonomy¶
The MAST taxonomy categorizes multi-agent failures into three clusters: Specification & System Design (~42%), Inter-Agent Misalignment (~37%), and Task Verification & Termination (~21%). Every Flow protection mechanism in the platform maps explicitly to one or more failure classes.
| Failure Class | Representative Cases | Protection Mechanism |
|---|---|---|
| Specification | Ambiguous role, malformed plan, vague brief | Pre-condition schema gate; supervisor critique pass; replan budget |
| Inter-Agent | Information loss across handoff, conflicting sub-goals, premature termination | Brief + summary schemas; hop limit; loop detector; deadlock monitor |
| Verification | "Looks done" but unverified; missing exit criterion | Post-condition gate; outcome oracle; cost-per-success metric |
Protection Mechanism Catalog¶
The platform ships nine protection mechanisms that are wired automatically by Setup and enforced at runtime by the Flow layer. Each is configured per fabric and per stage.
Hop Limit And Recursion Limit¶
Every framework ships a hard hop ceiling; LangGraph's recursion_limit defaults to 25 and raises GraphRecursionError, AutoGen GroupChat exposes max_round combined with MaxMessageTermination, the OpenAI Agents SDK uses max_turns raising MaxTurnsExceededError. The platform applies a default hop limit of 5 on peer handoff chains and a default recursion limit of 8 on supervisor / hierarchical depth, both configurable per fabric.
Loop Detector¶
Hop limits alone miss semantic loops, where agents cycle through paraphrases of the same call. The platform runs a LoopDetector middleware that hashes (tool_name, normalized_args) over a sliding window (default N = 5, threshold = 3 identical) and either injects a corrective system message or strips tool_calls to force termination.
flowchart LR
TC[Tool Call] --> HSH[Hash tool + args]
HSH --> WIN[Sliding Window N=5]
WIN --> CNT{Count >= 3?}
CNT -->|no| EX[Execute]
CNT -->|yes| INJ[Inject Correction or Force Terminate]
INJ --> AUD[Audit Event]
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class TC,HSH,WIN,CNT,EX,INJ,AUD stage
Circuit Breaker¶
Agent-aware circuit breakers track soft failures invisible to HTTP-layer breakers: schema-invalid outputs, semantic-invariant violations, identical-tool-call streaks, cost velocity overruns. The state machine follows the Hystrix tradition: CLOSED → OPEN on consecutive failure or cost-rate threshold, OPEN → HALF_OPEN after a cooldown, HALF_OPEN → CLOSED on probe success or OPEN on probe failure. When OPEN, calls route to a configured fallback (cheaper model, static response, or escalation).
flowchart LR
START((Start)) --> CLOSED[CLOSED]
CLOSED -->|failure threshold or cost limit| OPEN[OPEN]
OPEN -->|cooldown elapsed| HALF[HALF_OPEN]
HALF -->|probe success| CLOSED
HALF -->|probe failure| OPEN
OPEN -->|max trips exceeded| STOP((Stop))
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class START,CLOSED,OPEN,HALF,STOP stage
Trip conditions in production: at least 3 consecutive identical tool calls, at least 2 consecutive JSON-schema validation failures, rolling cost rate above the configured per-hour or per-session ceiling, provider 429 / 503 rates exceeding threshold.
Retry Budget With Backoff¶
Per-call retry uses exponential backoff with full jitter (sleep = random(0, base * 2^attempt)), governed by a system-wide token-bucket retry budget capping retry traffic at 10-20% of normal load. Without a system-wide budget, a model outage triggers retry storms that compound cost. A hard pre-execution budget gate evaluates token and dollar spend before every model call or tool invocation; exhaustion is a deterministic deny, never an additional retry.
Timeout Propagation¶
Three layers compose: per-step timeout (single LLM call, typically 30-120 s), per-stage timeout (a worker's full sub-flow), and overall workflow timeout (top-level deadline). Durable engines such as Temporal or Restate propagate a workflow deadline as context to every activity so a child activity can short-circuit when the parent budget is nearly exhausted. The platform propagates timeouts as absolute deadlines, never as durations, so retries do not reset the clock.
Escalation And Human-In-The-Loop¶
The Flow layer integrates with the LangGraph interrupt() pattern (pauses the graph at a checkpoint, persists state, waits for an approve / edit / reject / respond decision), the OpenAI Agents SDK handoff-to-human primitive (often via Temporal or Dapr Diagrid integrations), and the Claude tool-use approval flow. Confidence-threshold escalation routes to human review when a critic / verifier's score is below the configured τ, rather than retrying.
flowchart LR
ST[Stage Result] --> CONF{Confidence >= τ?}
CONF -->|yes| NEXT[Next Stage]
CONF -->|no| ESC[Escalate]
ESC --> CK[Checkpoint State]
CK --> WAIT[Wait For Approval]
WAIT -->|approve| NEXT
WAIT -->|edit| EDIT[Inject Edit] --> NEXT
WAIT -->|reject| AB[Controlled Abort]
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class ST,CONF,NEXT,ESC,CK,WAIT,EDIT,AB stage
Idempotency And Saga Compensation¶
Each agent step is idempotent under an idempotency_key (typically hash(step_id, inputs)) so durable replay returns cached results. State-mutating actions register a compensate() function; on failure, the orchestrator runs compensations in LIFO order across the actually-executed steps. Durable execution engines (Temporal, Dapr Workflows v1.15+, Restate Cloud GA 2025) journal each step and replay on crash, returning cached results for completed activities.
Deadlock Monitor¶
Two classes of deadlock dominate: peer handoff oscillation (A → B → A …) and supervisor stuck state (supervisor repeatedly dispatches to the same worker that returns "needs more info"). An external monitor; not the agents themselves; tracks the handoff graph and triggers on cycles of length ≤ 3 occurring more than once, or on supervisor states whose (active_worker, last_message_hash) repeats. On trigger, the monitor routes the stalled pair to a higher tier or to escalation.
Validation Gates Between Stages¶
Every stage transition passes through a typed gate that runs four checks in order: schema validation (Pydantic / JSON Schema), policy validation (OPA / Cedar / Bedrock AgentCore Policy authorizing each action pre-execution), semantic invariants ("answer cites at least one retrieved doc"), and grounding / hallucination checks. Gates produce a structured ValidationResult consumed by the orchestrator's retry / replan / abort decision.
Failure Decision Table¶
The decision a stage makes on a failure is deterministic; replan, retry, escalate, or controlled abort. Defaults are based on the failure class and the workload tier, and are documented in Reference › Research › Fabric Flow.
| Failure Type | Low-Latency Interactive | Long-Running Batch | High-Stakes / Regulated |
|---|---|---|---|
| Infinite tool loop | max_turns + loop detector |
Recursion limit + durable journal | Loop detector + human escalate |
| Sub-agent deadlock | Per-stage timeout + abort | External monitor + saga compensation | Monitor + mandatory HITL gate |
| Cost overrun | Per-session token cap, fail-fast | Pre-execution budget gate + alert | Hard budget gate + finance approve |
| Provider 5xx / rate-limit | Backoff + jitter, fallback model | Circuit breaker + retry budget | Breaker + escalate; no silent swap |
| Schema / grounding fail | 1 retry then surface to user | Replan ≤ 2 then abort | Validation gate blocks; HITL fix |
| Spec / planner malformed | Abort, return error | Replan with critic; bounded replans | Abort + ticket; no auto-replan |
| Semantic no-progress | Loop detector → terminate | Replan once then escalate | Escalate immediately |
| Catastrophic / unsafe output | Guardrail block + safe fallback | Guardrail block + halt workflow | Guardrail block + audit log + HITL |
The MAS-Orchestra and Cogent 2026 orchestration playbooks both converge on the same rule of thumb: replan at most twice per stage; on the third failure, escalate or abort. The Flow layer enforces this rule by default and treats deviations as explicit per-fabric configuration.
Retry With Escalation¶
Every step that can recover follows the same retry-with-escalation flow. Classification at error time decides whether the failure is transient (backoff + retry within budget), semantic (replan within budget), policy (guardrail block, escalate), or budget-exhausted (fail-fast, escalate).
flowchart TD
CALL[Step Call] --> RES{Success?}
RES -->|yes| DONE[Continue]
RES -->|no| CLS{Classify Error}
CLS -->|transient| BO[Backoff + Jitter]
BO --> BC{Budget OK?}
BC -->|yes| CALL
BC -->|no| ESC[Escalate]
CLS -->|semantic| RPL{Replans < 2?}
RPL -->|yes| PLAN[Replan]
RPL -->|no| ESC
CLS -->|policy| GB[Guardrail Block]
GB --> ESC
CLS -->|budget| FF[Fail-Fast]
FF --> ESC
ESC --> HRI[Human Interrupt]
HRI --> RES2{Resume Action}
RES2 -->|approve| CALL
RES2 -->|edit| CALL
RES2 -->|reject| AB[Controlled Abort]
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class CALL,RES,CLS,BO,BC,RPL,PLAN,GB,FF,ESC,HRI,RES2,DONE,AB stage
Replanning Versus Aborting¶
Replan when the failure is recoverable and bounded: tool returned a decodable constraint violation, schema validation failed once, or a sub-agent returned a partial result. Abort when retry budget is exhausted, circuit breaker is open, the failure is MAST-class specification (plan itself is malformed), or repeated-state hashing indicates no semantic progress for K iterations. The decision is recorded as an explicit governance event; no silent retry, no silent abort.
Cost Velocity Monitoring¶
Cost is a first-class failure signal. The Flow layer integrates a cost-velocity monitor that tracks rolling token and dollar spend per fabric run, per agent, and per stage. When velocity exceeds the configured threshold, the monitor opens the cost circuit breaker, routes the run to a cheaper-model fallback when configured, or escalates to human review. Production reports of 10× cost blowups from uncapped sub-agent spawning are the motivating evidence; the platform applies the cap by default.
Governance Of Flow¶
Every Flow decision is a governance event. Validation gates emit structured ValidationResult events. Retries, replans, escalations, and aborts are recorded with cause and recovery action. Loop and deadlock detector trips are tagged with the cycle they observed. Cost circuit breaker trips include rolling velocity and ceiling. The audit trail is sufficient to reconstruct any Flow run end-to-end.
Flow Governance Requirements
- Every stage has explicit pre-condition and post-condition gates.
- Every fabric has explicit hop, recursion, retry, timeout, and cost budgets.
- Loop and deadlock detection are mandatory; agents cannot self-diagnose loops.
- Escalation paths are explicit; silent retry is not permitted.
- State-mutating side effects have compensating actions registered before execution.
- Every Flow decision emits a structured event to the audit trail.
Cross-References¶
- Setup; topology, identity, budgets that the Flow layer enforces.
- Data; schema-validated payloads that pass through Flow gates.
- Test; how Flow behavior is validated end-to-end including failure modes.
- Emergence › System; runtime patterns each stage can run.
- Emergence › Subunits; composition rules consumed by Flow.
- Reference › Research › Fabric Flow; citations, MAST taxonomy, and protection-mechanism research.