Skip to content

Flow

Fabric Flow defines how the configured multi-agent system actually executes; the staged process by which agents work together, the failure states each stage can produce, and the protection mechanisms that ensure no failure cascades into a runaway loop, a stuck deadlock, or an uncontrolled cost spike. Where Setup commits the topology and Data commits the data plane, the Flow layer commits the operational behavior: which stage runs when, what passes between them, what happens when a stage fails, and what guarantees the fabric makes about termination, cost, and safety.

Fabric Flow is informed by the active agent reliability research lineage, including Anthropic Building Effective Agents (which defined the canonical taxonomy of prompt chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and autonomous agents), the MAST multi-agent failure taxonomy (which analyzed 150+ traces across seven frameworks and codified 14 failure modes), LangGraph human-in-the-loop interrupts and the recursion_limit ceiling, AutoGen / AG2 termination conditions, the OpenAI Agents SDK max_turns primitive, durable-execution engines including Temporal, Dapr Workflows, and Restate, and the AWS Bedrock AgentCore policy and automated-reasoning layer. Selection criteria for protection mechanisms are documented in Reference › Research › Fabric Flow.

Canonical Staged Flow

Every Flow run advances through five stages: plan, fan-out, fan-in, validate, and finalize. Each stage transition passes through a typed validation gate. Each stage is bounded by an explicit budget. Each stage emits structured events to the audit trail. Failure transitions are always explicit; they do not silently retry, escalate, or abort.

flowchart LR
    REQ[Request] --> PLAN[Plan]
    PLAN --> G1{Plan Gate}
    G1 -->|pass| FO[Fan-Out: Worker 1..N]
    G1 -->|fail| ESC[Escalation]

    FO --> FI[Fan-In: Aggregate]
    FI --> G2{Validate Gate}
    G2 -->|pass| FIN[Finalize]
    G2 -->|replan| PLAN
    G2 -->|escalate| ESC

    FIN --> OUT[Outcome]
    ESC --> HR[Human Review]
    ESC --> AB[Controlled Abort]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class REQ,PLAN,G1,FO,FI,G2,FIN,OUT,ESC,HR,AB stage

The canonical flow is supervisor-shaped by default; the supervisor plans, dispatches workers, aggregates their summaries, validates the aggregate, and either finalizes or replans. Hierarchical, peer-handoff, debate, and SOP topologies are variations on this canonical shape; each ships the same gate, budget, and protection semantics.

Stage Anatomy

Every stage in the Flow has the same internal anatomy. The platform applies the same invariants to every stage regardless of which Unitt runs it, which model is pinned, and which connectors it accesses.

flowchart LR
    IN[Stage Input + Schema] --> PRE[Pre-Condition Gate]
    PRE --> EX[Execute Agent Step]
    EX --> POST[Post-Condition Gate]
    POST --> OUT[Stage Output + Schema]

    EX -. tool calls .-> T[Tools / Connectors]
    EX -. memory writes .-> M[Memory Writers]
    EX -. tracing .-> O[Observability]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class IN,PRE,EX,POST,OUT,T,M,O stage
  • Pre-condition gate validates input schema, policy scope, budget headroom, and required upstream artifacts.
  • Execute runs the agent step under the topology's runtime pattern from Emergence › System.
  • Post-condition gate validates output schema, policy compliance, confidence threshold, and downstream contract obligations.
  • Output + schema is the typed payload that the next stage's pre-condition gate consumes.

Failure Mode Taxonomy

The MAST taxonomy categorizes multi-agent failures into three clusters: Specification & System Design (~42%), Inter-Agent Misalignment (~37%), and Task Verification & Termination (~21%). Every Flow protection mechanism in the platform maps explicitly to one or more failure classes.

Failure Class Representative Cases Protection Mechanism
Specification Ambiguous role, malformed plan, vague brief Pre-condition schema gate; supervisor critique pass; replan budget
Inter-Agent Information loss across handoff, conflicting sub-goals, premature termination Brief + summary schemas; hop limit; loop detector; deadlock monitor
Verification "Looks done" but unverified; missing exit criterion Post-condition gate; outcome oracle; cost-per-success metric

Protection Mechanism Catalog

The platform ships nine protection mechanisms that are wired automatically by Setup and enforced at runtime by the Flow layer. Each is configured per fabric and per stage.

Hop Limit And Recursion Limit

Every framework ships a hard hop ceiling; LangGraph's recursion_limit defaults to 25 and raises GraphRecursionError, AutoGen GroupChat exposes max_round combined with MaxMessageTermination, the OpenAI Agents SDK uses max_turns raising MaxTurnsExceededError. The platform applies a default hop limit of 5 on peer handoff chains and a default recursion limit of 8 on supervisor / hierarchical depth, both configurable per fabric.

Loop Detector

Hop limits alone miss semantic loops, where agents cycle through paraphrases of the same call. The platform runs a LoopDetector middleware that hashes (tool_name, normalized_args) over a sliding window (default N = 5, threshold = 3 identical) and either injects a corrective system message or strips tool_calls to force termination.

flowchart LR
    TC[Tool Call] --> HSH[Hash tool + args]
    HSH --> WIN[Sliding Window N=5]
    WIN --> CNT{Count >= 3?}
    CNT -->|no| EX[Execute]
    CNT -->|yes| INJ[Inject Correction or Force Terminate]
    INJ --> AUD[Audit Event]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class TC,HSH,WIN,CNT,EX,INJ,AUD stage

Circuit Breaker

Agent-aware circuit breakers track soft failures invisible to HTTP-layer breakers: schema-invalid outputs, semantic-invariant violations, identical-tool-call streaks, cost velocity overruns. The state machine follows the Hystrix tradition: CLOSED → OPEN on consecutive failure or cost-rate threshold, OPEN → HALF_OPEN after a cooldown, HALF_OPEN → CLOSED on probe success or OPEN on probe failure. When OPEN, calls route to a configured fallback (cheaper model, static response, or escalation).

flowchart LR
    START((Start)) --> CLOSED[CLOSED]
    CLOSED -->|failure threshold or cost limit| OPEN[OPEN]
    OPEN -->|cooldown elapsed| HALF[HALF_OPEN]
    HALF -->|probe success| CLOSED
    HALF -->|probe failure| OPEN
    OPEN -->|max trips exceeded| STOP((Stop))

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class START,CLOSED,OPEN,HALF,STOP stage

Trip conditions in production: at least 3 consecutive identical tool calls, at least 2 consecutive JSON-schema validation failures, rolling cost rate above the configured per-hour or per-session ceiling, provider 429 / 503 rates exceeding threshold.

Retry Budget With Backoff

Per-call retry uses exponential backoff with full jitter (sleep = random(0, base * 2^attempt)), governed by a system-wide token-bucket retry budget capping retry traffic at 10-20% of normal load. Without a system-wide budget, a model outage triggers retry storms that compound cost. A hard pre-execution budget gate evaluates token and dollar spend before every model call or tool invocation; exhaustion is a deterministic deny, never an additional retry.

Timeout Propagation

Three layers compose: per-step timeout (single LLM call, typically 30-120 s), per-stage timeout (a worker's full sub-flow), and overall workflow timeout (top-level deadline). Durable engines such as Temporal or Restate propagate a workflow deadline as context to every activity so a child activity can short-circuit when the parent budget is nearly exhausted. The platform propagates timeouts as absolute deadlines, never as durations, so retries do not reset the clock.

Escalation And Human-In-The-Loop

The Flow layer integrates with the LangGraph interrupt() pattern (pauses the graph at a checkpoint, persists state, waits for an approve / edit / reject / respond decision), the OpenAI Agents SDK handoff-to-human primitive (often via Temporal or Dapr Diagrid integrations), and the Claude tool-use approval flow. Confidence-threshold escalation routes to human review when a critic / verifier's score is below the configured τ, rather than retrying.

flowchart LR
    ST[Stage Result] --> CONF{Confidence >= τ?}
    CONF -->|yes| NEXT[Next Stage]
    CONF -->|no| ESC[Escalate]
    ESC --> CK[Checkpoint State]
    CK --> WAIT[Wait For Approval]
    WAIT -->|approve| NEXT
    WAIT -->|edit| EDIT[Inject Edit] --> NEXT
    WAIT -->|reject| AB[Controlled Abort]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class ST,CONF,NEXT,ESC,CK,WAIT,EDIT,AB stage

Idempotency And Saga Compensation

Each agent step is idempotent under an idempotency_key (typically hash(step_id, inputs)) so durable replay returns cached results. State-mutating actions register a compensate() function; on failure, the orchestrator runs compensations in LIFO order across the actually-executed steps. Durable execution engines (Temporal, Dapr Workflows v1.15+, Restate Cloud GA 2025) journal each step and replay on crash, returning cached results for completed activities.

Deadlock Monitor

Two classes of deadlock dominate: peer handoff oscillation (A → B → A …) and supervisor stuck state (supervisor repeatedly dispatches to the same worker that returns "needs more info"). An external monitor; not the agents themselves; tracks the handoff graph and triggers on cycles of length ≤ 3 occurring more than once, or on supervisor states whose (active_worker, last_message_hash) repeats. On trigger, the monitor routes the stalled pair to a higher tier or to escalation.

Validation Gates Between Stages

Every stage transition passes through a typed gate that runs four checks in order: schema validation (Pydantic / JSON Schema), policy validation (OPA / Cedar / Bedrock AgentCore Policy authorizing each action pre-execution), semantic invariants ("answer cites at least one retrieved doc"), and grounding / hallucination checks. Gates produce a structured ValidationResult consumed by the orchestrator's retry / replan / abort decision.

Failure Decision Table

The decision a stage makes on a failure is deterministic; replan, retry, escalate, or controlled abort. Defaults are based on the failure class and the workload tier, and are documented in Reference › Research › Fabric Flow.

Failure Type Low-Latency Interactive Long-Running Batch High-Stakes / Regulated
Infinite tool loop max_turns + loop detector Recursion limit + durable journal Loop detector + human escalate
Sub-agent deadlock Per-stage timeout + abort External monitor + saga compensation Monitor + mandatory HITL gate
Cost overrun Per-session token cap, fail-fast Pre-execution budget gate + alert Hard budget gate + finance approve
Provider 5xx / rate-limit Backoff + jitter, fallback model Circuit breaker + retry budget Breaker + escalate; no silent swap
Schema / grounding fail 1 retry then surface to user Replan ≤ 2 then abort Validation gate blocks; HITL fix
Spec / planner malformed Abort, return error Replan with critic; bounded replans Abort + ticket; no auto-replan
Semantic no-progress Loop detector → terminate Replan once then escalate Escalate immediately
Catastrophic / unsafe output Guardrail block + safe fallback Guardrail block + halt workflow Guardrail block + audit log + HITL

The MAS-Orchestra and Cogent 2026 orchestration playbooks both converge on the same rule of thumb: replan at most twice per stage; on the third failure, escalate or abort. The Flow layer enforces this rule by default and treats deviations as explicit per-fabric configuration.

Retry With Escalation

Every step that can recover follows the same retry-with-escalation flow. Classification at error time decides whether the failure is transient (backoff + retry within budget), semantic (replan within budget), policy (guardrail block, escalate), or budget-exhausted (fail-fast, escalate).

flowchart TD
    CALL[Step Call] --> RES{Success?}
    RES -->|yes| DONE[Continue]
    RES -->|no| CLS{Classify Error}

    CLS -->|transient| BO[Backoff + Jitter]
    BO --> BC{Budget OK?}
    BC -->|yes| CALL
    BC -->|no| ESC[Escalate]

    CLS -->|semantic| RPL{Replans < 2?}
    RPL -->|yes| PLAN[Replan]
    RPL -->|no| ESC

    CLS -->|policy| GB[Guardrail Block]
    GB --> ESC

    CLS -->|budget| FF[Fail-Fast]
    FF --> ESC

    ESC --> HRI[Human Interrupt]
    HRI --> RES2{Resume Action}
    RES2 -->|approve| CALL
    RES2 -->|edit| CALL
    RES2 -->|reject| AB[Controlled Abort]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class CALL,RES,CLS,BO,BC,RPL,PLAN,GB,FF,ESC,HRI,RES2,DONE,AB stage

Replanning Versus Aborting

Replan when the failure is recoverable and bounded: tool returned a decodable constraint violation, schema validation failed once, or a sub-agent returned a partial result. Abort when retry budget is exhausted, circuit breaker is open, the failure is MAST-class specification (plan itself is malformed), or repeated-state hashing indicates no semantic progress for K iterations. The decision is recorded as an explicit governance event; no silent retry, no silent abort.

Cost Velocity Monitoring

Cost is a first-class failure signal. The Flow layer integrates a cost-velocity monitor that tracks rolling token and dollar spend per fabric run, per agent, and per stage. When velocity exceeds the configured threshold, the monitor opens the cost circuit breaker, routes the run to a cheaper-model fallback when configured, or escalates to human review. Production reports of 10× cost blowups from uncapped sub-agent spawning are the motivating evidence; the platform applies the cap by default.

Governance Of Flow

Every Flow decision is a governance event. Validation gates emit structured ValidationResult events. Retries, replans, escalations, and aborts are recorded with cause and recovery action. Loop and deadlock detector trips are tagged with the cycle they observed. Cost circuit breaker trips include rolling velocity and ceiling. The audit trail is sufficient to reconstruct any Flow run end-to-end.

Flow Governance Requirements

  • Every stage has explicit pre-condition and post-condition gates.
  • Every fabric has explicit hop, recursion, retry, timeout, and cost budgets.
  • Loop and deadlock detection are mandatory; agents cannot self-diagnose loops.
  • Escalation paths are explicit; silent retry is not permitted.
  • State-mutating side effects have compensating actions registered before execution.
  • Every Flow decision emits a structured event to the audit trail.

Cross-References