Skip to content

Test

Fabric Test defines how the configured multi-agent system is validated end-to-end against real and synthetic data, with every release decision tied to the full cost of a single successful run of the entire fabric. Where Flow commits the operational behavior, the Test layer commits the validation strategy: which scenarios the fabric must pass, which judges score the outcome, which metrics gate the release, and which cost ceilings define "successful run" before any candidate fabric is permitted to advance to Publish.

Fabric Test is informed by the active agent-evaluation research lineage, including τ-bench and τ²-bench (the canonical pass^k reliability metric for tool-using agents), SWE-bench Verified, MLE-bench, AgentBoard sub-goal trajectory evaluation, the Anthropic Demystifying Evals guidance, Inspect AI by UK AISI, AgentDojo adversarial prompt-injection eval, DSPy assertions, RAGAS faithfulness, Cleanlab TLM, and shadow / canary deployment patterns. Selection criteria for testing techniques are documented in Reference › Research › Fabric Test.

What The Test Layer Validates

The Test layer commits four validation surfaces and treats every release decision as a function of all four. No surface can be skipped; a candidate fabric that passes outcome but fails cost-per-success is not permitted to advance.

flowchart LR
    TC[Test Layer] --> OUT[Outcome Correctness]
    TC --> TRAJ[Trajectory Quality]
    TC --> FAITH[Faithfulness / Hallucination]
    TC --> COST[Cost-Per-Success]
    TC --> SAFE[Safety / Adversarial]

    OUT --> REL{Release Gate}
    TRAJ --> REL
    FAITH --> REL
    COST --> REL
    SAFE --> REL

    REL -->|pass| PUB[Publish]
    REL -->|regress| BLK[Block + Issue]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class TC,OUT,TRAJ,FAITH,COST,SAFE,REL,PUB,BLK stage

Test Suites

Test material in the platform is partitioned into five suites, each with a distinct purpose, scoring method, and refresh cadence. Suites correspond to the same partitioning used in WorldSim so scenarios flow seamlessly between simulation-time and production-time evaluation.

Suite Source Purpose Refresh
Validation Curated golden trajectories Release gate; pass-or-block Manual review per release
Regression Captured production traces (PII-scrubbed) Detect outcome drift on every change Auto-promoted from production traffic
Exploration Synthetic scenarios generated from persona pools Probe adjacent behaviors uncovered by current scenarios Refreshed weekly
Stress Cost / latency / token-budget worst cases Verify the fabric survives load and edge tails Each release
Adversarial AgentDojo + Promptfoo + bespoke red-team Prompt injection, memory poisoning, tool misuse Pre-launch + monthly

Outcome Scoring

Outcome scoring is multi-axis by design; a single pass / fail signal hides regressions in cost or safety. The platform scores every test run on at least five axes and surfaces them as independent metrics, never collapsed into a single number.

Pass^k Reliability

Production-grade fabrics report pass^k; the probability of k consecutive successes on the same task; rather than pass^1. τ-bench shows frontier models routinely drop to pass^8 < 25% even when pass^1 ≈ 50%, so single-shot success rate is not a sufficient reliability signal for stochastic policies. The platform applies a minimum pass^k threshold per release tier, configurable in Setup.

Sub-Goal Trajectory Progress

AgentBoard sub-goal progress rate scores partial completion when the overall outcome fails, allowing operators to localize a regression to a specific stage or agent. The Test layer uses trajectory progress as a triage signal rather than a release gate; AgentRewardBench shows trajectory judges disagree with humans roughly 30% of the time, which is acceptable for triage but not for a hard gate.

Faithfulness And Hallucination

For RAG-using agents, faithfulness is scored with RAGAS atomic-claim decomposition, Vectara HHEM-2.1, and Cleanlab TLM trustworthiness scoring. TLM combines self-reflection, sample-consistency, and token-logprob uncertainty and benchmarks first across FinanceBench, PubMedQA, and four RAG suites.

Safety And Adversarial

Adversarial evaluation runs against AgentDojo (97 prompt-injection tasks across 629 test cases over real tool execution), Promptfoo's 500+ attack-vector library, and bespoke red-team scenarios derived from the OWASP Top 10 for Agentic Applications. The platform applies a maximum Attack Success Rate threshold per release tier.

Cost-Per-Success

Cost-per-success is the platform's primary release-economics metric. It is computed as total_dollars / successful_tasks over a fixed scenario set, multiplied by a retry-amortization factor, with cache hits, batch discounts, and reserved capacity included so the metric reflects the production cost surface rather than the rack-rate forecast. The full computation is documented in the next section.

Cost-Per-Success Computation

Every release decision is anchored on cost-per-success. The metric is computed against the union of the validation and regression suites and is logged per release in the audit trail. The platform's evaluation pipeline emits the metric automatically; operators are not permitted to hand-compute or override it.

flowchart LR
    SC[Scenarios] --> RUN[Run Fabric in Sandbox]
    RUN --> OUT[Outcome Per Scenario]
    OUT --> SUM[Aggregate Success Count]
    RUN --> TOK[Token Spend]
    RUN --> TC[Tool Spend]
    RUN --> CC[Connector Spend]

    TOK --> CSP[Total Dollar Spend]
    TC --> CSP
    CC --> CSP

    CSP --> CPS[Cost / Successful Task]
    SUM --> CPS

    CPS --> AMORT[Retry-Amortized]
    AMORT --> REL{Release Gate}

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class SC,RUN,OUT,SUM,TOK,TC,CC,CSP,CPS,AMORT,REL stage
Input Source
Successful tasks Outcome oracle predicate (state hash, unit tests, schema match, judge verdict).
Token spend Provider billing API, with cache-hit and batch-discount adjustments applied.
Tool spend Per-tool cost ledger.
Connector spend Per-connector cost ledger (API charges, data transfer, vector store reads).
Retry amortization Mean retry count across the run set; cost is scaled by (1 + retry_mean).

The release gate applies a per-fabric cost-per-success ceiling, defaulted from prior production cost and refined by WorldSim evolutionary runs.

Real-Data Validation

The platform's production-data validation pipeline follows the four-stage industry pattern: log → replay → shadow → canary. Each stage is governed by an explicit promotion criterion before traffic advances.

flowchart LR
    PROD[Production Traffic] --> LOG[Log + PII Scrub]
    LOG --> DS[Replay Dataset]
    DS --> RP[Replay Against Candidate]
    RP --> SH[Shadow Mode 0%]
    SH --> CA[Canary 5%]
    CA --> CA2[Canary 25%]
    CA2 --> GA[Full Rollout]

    SH -. regress .-> BLK[Block]
    CA -. regress .-> RB[Rollback]
    CA2 -. regress .-> RB

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class PROD,LOG,DS,RP,SH,CA,CA2,GA,BLK,RB stage
  • Log + PII Scrub captures every production request and strips sensitive material. LangSmith and Phoenix expose dataset-from-trace promotion so last week's traffic becomes an eval set with one operation.
  • Replay runs the candidate fabric against the dataset in an isolated runtime with no live side effects.
  • Shadow runs the candidate in parallel with production but does not surface its outputs to users; humans remain decision-makers.
  • Canary ramps the candidate to 5% → 25% → 100% with rolling pass^k and cost-per-success gates between each step.

Synthetic Data

Synthetic scenarios fill gaps where production traces are scarce, where edge cases are rare, or where adversarial coverage is required. The platform supports three synthetic-data sources:

  • Persona-driven synthetic users; the τ-bench / τ²-bench foundation; an LLM simulates the user under a persona card and a goal, enabling pass^k by re-sampling.
  • DSPy assertions; boolean constraints embedded in the agent program that double as programmatic eval gates and self-refinement signals; stable enough to run in CI.
  • Inspect AI procedurally generated scenarios; sandbox-isolated scenarios programmatically generated across solver / task primitives.

Hybrid real + synthetic data is the dominant 2025-2026 evaluation pattern: production traces anchor distribution and prevent collapse, synthetic traces add coverage of rare and adversarial cases.

LLM-As-Judge

For open-ended outputs where a deterministic oracle is unavailable, the platform uses calibrated LLM-as-judge scoring. Judge configuration applies the 2024-2026 mitigations against the four documented biases; position, verbosity, self-preference, and style:

Bias Mitigation
Position Pairwise scoring with position swap; aggregate both orientations.
Verbosity Length-controlled rubric; reward density rather than volume.
Self-preference Use a different model family for the judge than the agent.
Style Style-controlled prompts; calibration set with known-equivalent outputs.

Judge ensembles plus calibrated rubrics plus bias-corrected confidence intervals are required before any judge score is treated as a release metric.

Regression Replay

Every release pins a "golden trajectory" dataset of 200-1000 production traces. On every change, the platform replays the candidate fabric against the dataset, compares outputs with a calibrated pairwise judge, and gates on regression-rate plus cost-delta. Any regression beyond the configured threshold automatically opens an issue with the offending trace IDs attached.

flowchart LR
    TR[Production Traces] --> DS[Golden Dataset]
    DS --> RB[Run Baseline]
    DS --> RC[Run Candidate]
    RB --> DJ[Calibrated Judge]
    RC --> DJ
    DJ --> SC[Δ pass^k + Δ $/task + Δ Trajectory]
    SC --> G{Regress > Threshold?}
    G -->|no| PROMO[Promote]
    G -->|yes| ISS[Auto-File Issue + Block]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class TR,DS,RB,RC,DJ,SC,G,PROMO,ISS stage

Reliability Under Stress

Agent stress tests differ from web load tests. The platform's stress suite covers three documented failure modes: context-length blow-up (latency and token cost grow super-linearly once context exceeds a workload-specific threshold), horizontal-scaling limits (model weights and KV caches are GPU-bound), and token-budget exhaustion under unexpected load. Stress tests are run pre-release and continuously in production via a small fraction of synthetic traffic.

Eval Pipeline

The platform composes synthetic and real evaluation into a single pipeline that is gated, observable, and reproducible. The pipeline emits structured EvalResult records keyed to the candidate fabric version and the scenario set version.

flowchart LR
    PP[Persona Pool] --> SCN[Synthetic Scenarios]
    DS[Production Trace Dataset] --> SCN
    DSPY[DSPy Assertions] --> SCN

    SCN --> SBX[Sandbox: Inspect AI]
    SBX --> JB[Outcome Judge]
    SBX --> JT[Trajectory Judge]
    SBX --> JF[Faithfulness: RAGAS + TLM]
    SBX --> RT[Red-Team: AgentDojo + Promptfoo]

    JB --> AGG[Aggregate: pass^k + $/task + regression-delta]
    JT --> AGG
    JF --> AGG
    RT --> AGG

    AGG --> REL{Release Gate}
    REL -->|pass| SH[Shadow + Canary]
    REL -->|block| RG[Regression Report]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class PP,SCN,DS,DSPY,SBX,JB,JT,JF,RT,AGG,REL,SH,RG stage

Open-Source Stack

The platform integrates with the open agent-evaluation stack rather than reinventing it. Recommended integrations:

Layer Tool
Sandboxing + scaffolding Inspect AI
Adversarial vectors AgentDojo, Promptfoo
Pytest-style agent metrics DeepEval
RAG faithfulness RAGAS
Trust score Cleanlab TLM
Trajectory eval LangChain agentevals
Trace capture + dataset promotion LangSmith, Arize Phoenix, Helicone, Langfuse
OTel-standard tracing OpenTelemetry GenAI semantic conventions

Governance Of Testing

Test is the gate between candidate and production. Every Test decision is versioned, signed, and auditable.

Test Governance Requirements

  • Every release records the scenario-set version, candidate fabric version, model pins, and aggregate metrics.
  • Cost-per-success is computed automatically and cannot be hand-overridden.
  • Production traces used as eval data are PII-scrubbed and consented per policy.
  • Adversarial coverage is mandatory; the OWASP Agentic Top 10 must score within threshold.
  • Shadow / canary gates are mandatory; full rollout without canary is not permitted for production fabrics.
  • Regressions automatically file an issue with offending trace IDs; silent regressions are not permitted.

Cross-References

  • Setup; release tiers, budgets, and governance scope that the Test layer enforces.
  • Data; data shapes the Test layer ingests and the egress chain it validates.
  • Flow; failure modes the Test layer exercises end-to-end.
  • Publish; what happens after the Test layer's release gate passes.
  • Emergence › WorldSim; simulation-time validation that flows scenarios into the Test layer.
  • Reference › Research › Fabric Test; citations, selection criteria, and source research.