Test¶
Fabric Test defines how the configured multi-agent system is validated end-to-end against real and synthetic data, with every release decision tied to the full cost of a single successful run of the entire fabric. Where Flow commits the operational behavior, the Test layer commits the validation strategy: which scenarios the fabric must pass, which judges score the outcome, which metrics gate the release, and which cost ceilings define "successful run" before any candidate fabric is permitted to advance to Publish.
Fabric Test is informed by the active agent-evaluation research lineage, including τ-bench and τ²-bench (the canonical pass^k reliability metric for tool-using agents), SWE-bench Verified, MLE-bench, AgentBoard sub-goal trajectory evaluation, the Anthropic Demystifying Evals guidance, Inspect AI by UK AISI, AgentDojo adversarial prompt-injection eval, DSPy assertions, RAGAS faithfulness, Cleanlab TLM, and shadow / canary deployment patterns. Selection criteria for testing techniques are documented in Reference › Research › Fabric Test.
What The Test Layer Validates¶
The Test layer commits four validation surfaces and treats every release decision as a function of all four. No surface can be skipped; a candidate fabric that passes outcome but fails cost-per-success is not permitted to advance.
flowchart LR
TC[Test Layer] --> OUT[Outcome Correctness]
TC --> TRAJ[Trajectory Quality]
TC --> FAITH[Faithfulness / Hallucination]
TC --> COST[Cost-Per-Success]
TC --> SAFE[Safety / Adversarial]
OUT --> REL{Release Gate}
TRAJ --> REL
FAITH --> REL
COST --> REL
SAFE --> REL
REL -->|pass| PUB[Publish]
REL -->|regress| BLK[Block + Issue]
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class TC,OUT,TRAJ,FAITH,COST,SAFE,REL,PUB,BLK stage
Test Suites¶
Test material in the platform is partitioned into five suites, each with a distinct purpose, scoring method, and refresh cadence. Suites correspond to the same partitioning used in WorldSim so scenarios flow seamlessly between simulation-time and production-time evaluation.
| Suite | Source | Purpose | Refresh |
|---|---|---|---|
| Validation | Curated golden trajectories | Release gate; pass-or-block | Manual review per release |
| Regression | Captured production traces (PII-scrubbed) | Detect outcome drift on every change | Auto-promoted from production traffic |
| Exploration | Synthetic scenarios generated from persona pools | Probe adjacent behaviors uncovered by current scenarios | Refreshed weekly |
| Stress | Cost / latency / token-budget worst cases | Verify the fabric survives load and edge tails | Each release |
| Adversarial | AgentDojo + Promptfoo + bespoke red-team | Prompt injection, memory poisoning, tool misuse | Pre-launch + monthly |
Outcome Scoring¶
Outcome scoring is multi-axis by design; a single pass / fail signal hides regressions in cost or safety. The platform scores every test run on at least five axes and surfaces them as independent metrics, never collapsed into a single number.
Pass^k Reliability¶
Production-grade fabrics report pass^k; the probability of k consecutive successes on the same task; rather than pass^1. τ-bench shows frontier models routinely drop to pass^8 < 25% even when pass^1 ≈ 50%, so single-shot success rate is not a sufficient reliability signal for stochastic policies. The platform applies a minimum pass^k threshold per release tier, configurable in Setup.
Sub-Goal Trajectory Progress¶
AgentBoard sub-goal progress rate scores partial completion when the overall outcome fails, allowing operators to localize a regression to a specific stage or agent. The Test layer uses trajectory progress as a triage signal rather than a release gate; AgentRewardBench shows trajectory judges disagree with humans roughly 30% of the time, which is acceptable for triage but not for a hard gate.
Faithfulness And Hallucination¶
For RAG-using agents, faithfulness is scored with RAGAS atomic-claim decomposition, Vectara HHEM-2.1, and Cleanlab TLM trustworthiness scoring. TLM combines self-reflection, sample-consistency, and token-logprob uncertainty and benchmarks first across FinanceBench, PubMedQA, and four RAG suites.
Safety And Adversarial¶
Adversarial evaluation runs against AgentDojo (97 prompt-injection tasks across 629 test cases over real tool execution), Promptfoo's 500+ attack-vector library, and bespoke red-team scenarios derived from the OWASP Top 10 for Agentic Applications. The platform applies a maximum Attack Success Rate threshold per release tier.
Cost-Per-Success¶
Cost-per-success is the platform's primary release-economics metric. It is computed as total_dollars / successful_tasks over a fixed scenario set, multiplied by a retry-amortization factor, with cache hits, batch discounts, and reserved capacity included so the metric reflects the production cost surface rather than the rack-rate forecast. The full computation is documented in the next section.
Cost-Per-Success Computation¶
Every release decision is anchored on cost-per-success. The metric is computed against the union of the validation and regression suites and is logged per release in the audit trail. The platform's evaluation pipeline emits the metric automatically; operators are not permitted to hand-compute or override it.
flowchart LR
SC[Scenarios] --> RUN[Run Fabric in Sandbox]
RUN --> OUT[Outcome Per Scenario]
OUT --> SUM[Aggregate Success Count]
RUN --> TOK[Token Spend]
RUN --> TC[Tool Spend]
RUN --> CC[Connector Spend]
TOK --> CSP[Total Dollar Spend]
TC --> CSP
CC --> CSP
CSP --> CPS[Cost / Successful Task]
SUM --> CPS
CPS --> AMORT[Retry-Amortized]
AMORT --> REL{Release Gate}
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class SC,RUN,OUT,SUM,TOK,TC,CC,CSP,CPS,AMORT,REL stage
| Input | Source |
|---|---|
| Successful tasks | Outcome oracle predicate (state hash, unit tests, schema match, judge verdict). |
| Token spend | Provider billing API, with cache-hit and batch-discount adjustments applied. |
| Tool spend | Per-tool cost ledger. |
| Connector spend | Per-connector cost ledger (API charges, data transfer, vector store reads). |
| Retry amortization | Mean retry count across the run set; cost is scaled by (1 + retry_mean). |
The release gate applies a per-fabric cost-per-success ceiling, defaulted from prior production cost and refined by WorldSim evolutionary runs.
Real-Data Validation¶
The platform's production-data validation pipeline follows the four-stage industry pattern: log → replay → shadow → canary. Each stage is governed by an explicit promotion criterion before traffic advances.
flowchart LR
PROD[Production Traffic] --> LOG[Log + PII Scrub]
LOG --> DS[Replay Dataset]
DS --> RP[Replay Against Candidate]
RP --> SH[Shadow Mode 0%]
SH --> CA[Canary 5%]
CA --> CA2[Canary 25%]
CA2 --> GA[Full Rollout]
SH -. regress .-> BLK[Block]
CA -. regress .-> RB[Rollback]
CA2 -. regress .-> RB
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class PROD,LOG,DS,RP,SH,CA,CA2,GA,BLK,RB stage
- Log + PII Scrub captures every production request and strips sensitive material. LangSmith and Phoenix expose dataset-from-trace promotion so last week's traffic becomes an eval set with one operation.
- Replay runs the candidate fabric against the dataset in an isolated runtime with no live side effects.
- Shadow runs the candidate in parallel with production but does not surface its outputs to users; humans remain decision-makers.
- Canary ramps the candidate to 5% → 25% → 100% with rolling
pass^kand cost-per-success gates between each step.
Synthetic Data¶
Synthetic scenarios fill gaps where production traces are scarce, where edge cases are rare, or where adversarial coverage is required. The platform supports three synthetic-data sources:
- Persona-driven synthetic users; the τ-bench / τ²-bench foundation; an LLM simulates the user under a persona card and a goal, enabling
pass^kby re-sampling. - DSPy assertions; boolean constraints embedded in the agent program that double as programmatic eval gates and self-refinement signals; stable enough to run in CI.
- Inspect AI procedurally generated scenarios; sandbox-isolated scenarios programmatically generated across solver / task primitives.
Hybrid real + synthetic data is the dominant 2025-2026 evaluation pattern: production traces anchor distribution and prevent collapse, synthetic traces add coverage of rare and adversarial cases.
LLM-As-Judge¶
For open-ended outputs where a deterministic oracle is unavailable, the platform uses calibrated LLM-as-judge scoring. Judge configuration applies the 2024-2026 mitigations against the four documented biases; position, verbosity, self-preference, and style:
| Bias | Mitigation |
|---|---|
| Position | Pairwise scoring with position swap; aggregate both orientations. |
| Verbosity | Length-controlled rubric; reward density rather than volume. |
| Self-preference | Use a different model family for the judge than the agent. |
| Style | Style-controlled prompts; calibration set with known-equivalent outputs. |
Judge ensembles plus calibrated rubrics plus bias-corrected confidence intervals are required before any judge score is treated as a release metric.
Regression Replay¶
Every release pins a "golden trajectory" dataset of 200-1000 production traces. On every change, the platform replays the candidate fabric against the dataset, compares outputs with a calibrated pairwise judge, and gates on regression-rate plus cost-delta. Any regression beyond the configured threshold automatically opens an issue with the offending trace IDs attached.
flowchart LR
TR[Production Traces] --> DS[Golden Dataset]
DS --> RB[Run Baseline]
DS --> RC[Run Candidate]
RB --> DJ[Calibrated Judge]
RC --> DJ
DJ --> SC[Δ pass^k + Δ $/task + Δ Trajectory]
SC --> G{Regress > Threshold?}
G -->|no| PROMO[Promote]
G -->|yes| ISS[Auto-File Issue + Block]
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class TR,DS,RB,RC,DJ,SC,G,PROMO,ISS stage
Reliability Under Stress¶
Agent stress tests differ from web load tests. The platform's stress suite covers three documented failure modes: context-length blow-up (latency and token cost grow super-linearly once context exceeds a workload-specific threshold), horizontal-scaling limits (model weights and KV caches are GPU-bound), and token-budget exhaustion under unexpected load. Stress tests are run pre-release and continuously in production via a small fraction of synthetic traffic.
Eval Pipeline¶
The platform composes synthetic and real evaluation into a single pipeline that is gated, observable, and reproducible. The pipeline emits structured EvalResult records keyed to the candidate fabric version and the scenario set version.
flowchart LR
PP[Persona Pool] --> SCN[Synthetic Scenarios]
DS[Production Trace Dataset] --> SCN
DSPY[DSPy Assertions] --> SCN
SCN --> SBX[Sandbox: Inspect AI]
SBX --> JB[Outcome Judge]
SBX --> JT[Trajectory Judge]
SBX --> JF[Faithfulness: RAGAS + TLM]
SBX --> RT[Red-Team: AgentDojo + Promptfoo]
JB --> AGG[Aggregate: pass^k + $/task + regression-delta]
JT --> AGG
JF --> AGG
RT --> AGG
AGG --> REL{Release Gate}
REL -->|pass| SH[Shadow + Canary]
REL -->|block| RG[Regression Report]
classDef stage fill:#ffd541,stroke:#222021,color:#222021
class PP,SCN,DS,DSPY,SBX,JB,JT,JF,RT,AGG,REL,SH,RG stage
Open-Source Stack¶
The platform integrates with the open agent-evaluation stack rather than reinventing it. Recommended integrations:
| Layer | Tool |
|---|---|
| Sandboxing + scaffolding | Inspect AI |
| Adversarial vectors | AgentDojo, Promptfoo |
| Pytest-style agent metrics | DeepEval |
| RAG faithfulness | RAGAS |
| Trust score | Cleanlab TLM |
| Trajectory eval | LangChain agentevals |
| Trace capture + dataset promotion | LangSmith, Arize Phoenix, Helicone, Langfuse |
| OTel-standard tracing | OpenTelemetry GenAI semantic conventions |
Governance Of Testing¶
Test is the gate between candidate and production. Every Test decision is versioned, signed, and auditable.
Test Governance Requirements
- Every release records the scenario-set version, candidate fabric version, model pins, and aggregate metrics.
- Cost-per-success is computed automatically and cannot be hand-overridden.
- Production traces used as eval data are PII-scrubbed and consented per policy.
- Adversarial coverage is mandatory; the OWASP Agentic Top 10 must score within threshold.
- Shadow / canary gates are mandatory; full rollout without canary is not permitted for production fabrics.
- Regressions automatically file an issue with offending trace IDs; silent regressions are not permitted.
Cross-References¶
- Setup; release tiers, budgets, and governance scope that the Test layer enforces.
- Data; data shapes the Test layer ingests and the egress chain it validates.
- Flow; failure modes the Test layer exercises end-to-end.
- Publish; what happens after the Test layer's release gate passes.
- Emergence › WorldSim; simulation-time validation that flows scenarios into the Test layer.
- Reference › Research › Fabric Test; citations, selection criteria, and source research.