Flow¶

Fabric Flow defines how the configured multi-agent system actually executes; the staged process by which agents work together, the failure states each stage can produce, and the protection mechanisms that ensure no failure cascades into a runaway loop, a stuck deadlock, or an uncontrolled cost spike. Where Setup commits the topology and Data commits the data plane, the Flow layer commits the operational behavior: which stage runs when, what passes between them, what happens when a stage fails, and what guarantees the fabric makes about termination, cost, and safety.

Fabric Flow is informed by the active agent reliability research lineage, including Anthropic Building Effective Agents (which defined the canonical taxonomy of prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and autonomous agents), the MAST multi-agent failure taxonomy (which analyzed 150+ traces across seven frameworks and codified 14 failure modes), LangGraph human-in-the-loop interrupts and the recursion_limit ceiling, AutoGen / AG2 termination conditions, the OpenAI Agents SDK max_turns primitive, durable-execution engines including Temporal, Dapr Workflows, and Restate, and the AWS Bedrock AgentCore policy and automated-reasoning layer. Selection criteria for protection mechanisms are documented in Reference › Research › Fabric Flow.

Canonical Staged Flow¶

Every Flow run advances through five stages: plan, fan-out, fan-in, validate, and finalize. Each stage transition passes through a typed validation gate. Each stage is bounded by an explicit budget. Each stage emits structured events to the audit trail. Failure transitions are always explicit; they do not silently retry, escalate, or abort.

flowchart LR
    REQ[Request] --> PLAN[Plan]
    PLAN --> G1{Plan Gate}
    G1 -->|pass| FO[Fan-Out: Worker 1..N]
    G1 -->|fail| ESC[Escalation]

    FO --> FI[Fan-In: Aggregate]
    FI --> G2{Validate Gate}
    G2 -->|pass| FIN[Finalize]
    G2 -->|replan| PLAN
    G2 -->|escalate| ESC

    FIN --> OUT[Outcome]
    ESC --> HR[Human Review]
    ESC --> AB[Controlled Abort]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class REQ,PLAN,G1,FO,FI,G2,FIN,OUT,ESC,HR,AB stage

The canonical flow is supervisor-shaped by default; the supervisor plans, dispatches workers, aggregates their summaries, validates the aggregate, and either finalizes or replans. Hierarchical, peer-handoff, debate, and SOP topologies are variations on this canonical shape; each ships the same gate, budget, and protection semantics.

Stage Anatomy¶

Every stage in the Flow has the same internal anatomy. The platform applies the same invariants to every stage regardless of which Unitt runs it, which model is pinned, and which connectors it accesses.

flowchart LR
    IN[Stage Input + Schema] --> PRE[Pre-Condition Gate]
    PRE --> EX[Execute Agent Step]
    EX --> POST[Post-Condition Gate]
    POST --> OUT[Stage Output + Schema]

    EX -. tool calls .-> T[Tools / Connectors]
    EX -. memory writes .-> M[Memory Writers]
    EX -. tracing .-> O[Observability]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class IN,PRE,EX,POST,OUT,T,M,O stage

Pre-condition gate validates input schema, policy scope, budget headroom, and required upstream artifacts.
Execute runs the agent step under the topology's runtime pattern from Emergence › System.
Post-condition gate validates output schema, policy compliance, confidence threshold, and downstream contract obligations.
Output + schema is the typed payload that the next stage's pre-condition gate consumes.

Failure Mode Taxonomy¶

The MAST taxonomy categorizes multi-agent failures into three clusters: Specification & System Design (~42%), Inter-Agent Misalignment (~37%), and Task Verification & Termination (~21%). Every Flow protection mechanism in the platform maps explicitly to one or more failure classes.

Failure Class	Representative Cases	Protection Mechanism
Specification	Ambiguous role, malformed plan, vague brief	Pre-condition schema gate; supervisor critique pass; replan budget
Inter-Agent	Information loss across handoff, conflicting sub-goals, premature termination	Brief + summary schemas; hop limit; loop detector; deadlock monitor
Verification	"Looks done" but unverified; missing exit criterion	Post-condition gate; outcome oracle; cost-per-success metric

Protection Mechanism Catalog¶

The platform ships nine protection mechanisms that are wired automatically by Setup and enforced at runtime by the Flow layer. Each is configured per fabric and per stage.

Hop Limit And Recursion Limit¶

Every framework ships a hard hop ceiling; LangGraph's recursion_limit defaults to 25 and raises GraphRecursionError, AutoGen GroupChat exposes max_round combined with MaxMessageTermination, the OpenAI Agents SDK uses max_turns raising MaxTurnsExceededError. The platform applies a default hop limit of 5 on peer handoff chains and a default recursion limit of 8 on supervisor / hierarchical depth, both configurable per fabric.

Loop Detector¶

Hop limits alone miss semantic loops, where agents cycle through paraphrases of the same call. The platform runs a LoopDetector middleware that hashes (tool_name, normalized_args) over a sliding window (default N = 5, threshold = 3 identical) and either injects a corrective system message or strips tool_calls to force termination.

flowchart LR
    TC[Tool Call] --> HSH[Hash tool + args]
    HSH --> WIN[Sliding Window N=5]
    WIN --> CNT{Count >= 3?}
    CNT -->|no| EX[Execute]
    CNT -->|yes| INJ[Inject Correction or Force Terminate]
    INJ --> AUD[Audit Event]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class TC,HSH,WIN,CNT,EX,INJ,AUD stage

Circuit Breaker¶

Agent-aware circuit breakers track soft failures invisible to HTTP-layer breakers: schema-invalid outputs, semantic-invariant violations, identical-tool-call streaks, cost velocity overruns. The state machine follows the Hystrix tradition: CLOSED → OPEN on consecutive failure or cost-rate threshold, OPEN → HALF_OPEN after a cooldown, HALF_OPEN → CLOSED on probe success or OPEN on probe failure. When OPEN, calls route to a configured fallback (cheaper model, static response, or escalation).

flowchart LR
    START((Start)) --> CLOSED[CLOSED]
    CLOSED -->|failure threshold or cost limit| OPEN[OPEN]
    OPEN -->|cooldown elapsed| HALF[HALF_OPEN]
    HALF -->|probe success| CLOSED
    HALF -->|probe failure| OPEN
    OPEN -->|max trips exceeded| STOP((Stop))

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class START,CLOSED,OPEN,HALF,STOP stage

Trip conditions in production: at least 3 consecutive identical tool calls, at least 2 consecutive JSON-schema validation failures, rolling cost rate above the configured per-hour or per-session ceiling, provider 429 / 503 rates exceeding threshold.

Retry Budget With Backoff¶

Per-call retry uses exponential backoff with full jitter (sleep = random(0, base * 2^attempt)), governed by a system-wide token-bucket retry budget capping retry traffic at 10-20% of normal load. Without a system-wide budget, a model outage triggers retry storms that compound cost. A hard pre-execution budget gate evaluates token and dollar spend before every model call or tool invocation; exhaustion is a deterministic deny, never an additional retry.

Timeout Propagation¶

Three layers compose: per-step timeout (single LLM call, typically 30-120 s), per-stage timeout (a worker's full sub-flow), and overall workflow timeout (top-level deadline). Durable engines such as Temporal or Restate propagate a workflow deadline as context to every activity so a child activity can short-circuit when the parent budget is nearly exhausted. The platform propagates timeouts as absolute deadlines, never as durations, so retries do not reset the clock.

Escalation And Human-In-The-Loop¶

The Flow layer integrates with the LangGraph interrupt() pattern (pauses the graph at a checkpoint, persists state, waits for an approve / edit / reject / respond decision), the OpenAI Agents SDK handoff-to-human primitive (often via Temporal or Dapr Diagrid integrations), and the Claude tool-use approval flow. Confidence-threshold escalation routes to human review when a critic / verifier's score is below the configured τ, rather than retrying.

flowchart LR
    ST[Stage Result] --> CONF{Confidence >= τ?}
    CONF -->|yes| NEXT[Next Stage]
    CONF -->|no| ESC[Escalate]
    ESC --> CK[Checkpoint State]
    CK --> WAIT[Wait For Approval]
    WAIT -->|approve| NEXT
    WAIT -->|edit| EDIT[Inject Edit] --> NEXT
    WAIT -->|reject| AB[Controlled Abort]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class ST,CONF,NEXT,ESC,CK,WAIT,EDIT,AB stage

Idempotency And Saga Compensation¶

Each agent step is idempotent under an idempotency_key (typically hash(step_id, inputs)) so durable replay returns cached results. State-mutating actions register a compensate() function; on failure, the orchestrator runs compensations in LIFO order across the actually-executed steps. Durable execution engines (Temporal, Dapr Workflows v1.15+, Restate Cloud GA 2025) journal each step and replay on crash, returning cached results for completed activities.

Deadlock Monitor¶

Two classes of deadlock dominate: peer handoff oscillation (A → B → A …) and supervisor stuck state (supervisor repeatedly dispatches to the same worker that returns "needs more info"). An external monitor; not the agents themselves; tracks the handoff graph and triggers on cycles of length ≤ 3 occurring more than once, or on supervisor states whose (active_worker, last_message_hash) repeats. On trigger, the monitor routes the stalled pair to a higher tier or to escalation.

Validation Gates Between Stages¶

Every stage transition passes through a typed gate that runs four checks in order: schema validation (Pydantic / JSON Schema), policy validation (OPA / Cedar / Bedrock AgentCore Policy authorizing each action pre-execution), semantic invariants ("answer cites at least one retrieved doc"), and grounding / hallucination checks. Gates produce a structured ValidationResult consumed by the orchestrator's retry / replan / abort decision.

Failure Decision Table¶

The decision a stage makes on a failure is deterministic; replan, retry, escalate, or controlled abort. Defaults are based on the failure class and the workload tier, and are documented in Reference › Research › Fabric Flow.

Failure Type	Low-Latency Interactive	Long-Running Batch	High-Stakes / Regulated
Infinite tool loop	`max_turns` + loop detector	Recursion limit + durable journal	Loop detector + human escalate
Sub-agent deadlock	Per-stage timeout + abort	External monitor + saga compensation	Monitor + mandatory HITL gate
Cost overrun	Per-session token cap, fail-fast	Pre-execution budget gate + alert	Hard budget gate + finance approve
Provider 5xx / rate-limit	Backoff + jitter, fallback model	Circuit breaker + retry budget	Breaker + escalate; no silent swap
Schema / grounding fail	1 retry then surface to user	Replan ≤ 2 then abort	Validation gate blocks; HITL fix
Spec / planner malformed	Abort, return error	Replan with critic; bounded replans	Abort + ticket; no auto-replan
Semantic no-progress	Loop detector → terminate	Replan once then escalate	Escalate immediately
Catastrophic / unsafe output	Guardrail block + safe fallback	Guardrail block + halt workflow	Guardrail block + audit log + HITL

The MAS-Orchestra and Cogent 2026 orchestration playbooks both converge on the same rule of thumb: replan at most twice per stage; on the third failure, escalate or abort. The Flow layer enforces this rule by default and treats deviations as explicit per-fabric configuration.

Retry With Escalation¶

Every step that can recover follows the same retry-with-escalation flow. Classification at error time decides whether the failure is transient (backoff + retry within budget), semantic (replan within budget), policy (guardrail block, escalate), or budget-exhausted (fail-fast, escalate).

flowchart TD
    CALL[Step Call] --> RES{Success?}
    RES -->|yes| DONE[Continue]
    RES -->|no| CLS{Classify Error}

    CLS -->|transient| BO[Backoff + Jitter]
    BO --> BC{Budget OK?}
    BC -->|yes| CALL
    BC -->|no| ESC[Escalate]

    CLS -->|semantic| RPL{Replans < 2?}
    RPL -->|yes| PLAN[Replan]
    RPL -->|no| ESC

    CLS -->|policy| GB[Guardrail Block]
    GB --> ESC

    CLS -->|budget| FF[Fail-Fast]
    FF --> ESC

    ESC --> HRI[Human Interrupt]
    HRI --> RES2{Resume Action}
    RES2 -->|approve| CALL
    RES2 -->|edit| CALL
    RES2 -->|reject| AB[Controlled Abort]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class CALL,RES,CLS,BO,BC,RPL,PLAN,GB,FF,ESC,HRI,RES2,DONE,AB stage

Replanning Versus Aborting¶

Replan when the failure is recoverable and bounded: tool returned a decodable constraint violation, schema validation failed once, or a sub-agent returned a partial result. Abort when retry budget is exhausted, circuit breaker is open, the failure is MAST-class specification (plan itself is malformed), or repeated-state hashing indicates no semantic progress for K iterations. The decision is recorded as an explicit governance event; no silent retry, no silent abort.

Cost Velocity Monitoring¶

Cost is a first-class failure signal. The Flow layer integrates a cost-velocity monitor that tracks rolling token and dollar spend per fabric run, per agent, and per stage. When velocity exceeds the configured threshold, the monitor opens the cost circuit breaker, routes the run to a cheaper-model fallback when configured, or escalates to human review. Production reports of 10× cost blowups from uncapped sub-agent spawning are the motivating evidence; the platform applies the cap by default.

Governance Of Flow¶

Every Flow decision is a governance event. Validation gates emit structured ValidationResult events. Retries, replans, escalations, and aborts are recorded with cause and recovery action. Loop and deadlock detector trips are tagged with the cycle they observed. Cost circuit breaker trips include rolling velocity and ceiling. The audit trail is sufficient to reconstruct any Flow run end-to-end.

Flow Governance Requirements

Every stage has explicit pre-condition and post-condition gates.
Every fabric has explicit hop, recursion, retry, timeout, and cost budgets.
Loop and deadlock detection are mandatory; agents cannot self-diagnose loops.
Escalation paths are explicit; silent retry is not permitted.
State-mutating side effects have compensating actions registered before execution.
Every Flow decision emits a structured event to the audit trail.

Cross-References¶

Setup; topology, identity, budgets that the Flow layer enforces.
Data; schema-validated payloads that pass through Flow gates.
Test; how Flow behavior is validated end-to-end including failure modes.
Emergence › System; runtime patterns each stage can run.
Emergence › Subunits; composition rules consumed by Flow.
Reference › Research › Fabric Flow; citations, MAST taxonomy, and protection-mechanism research.