Skip to content

Fabric Test

The agent evaluation and testing methodology referenced throughout Fabric › Test draws on the 2024-2026 agent-evaluation research lineage. This page catalogs the canonical benchmarks, scoring metrics, synthetic and real data approaches, judge frameworks, and the selection criteria the platform uses when configuring Test for a given workload.

Benchmarks For Full Agent Fabrics

Full-system / outcome-graded benchmarks:

  • τ-bench (Sierra) measures end-to-end tool-agent-user conversations against domain APIs and policies, scoring on goal-state hash, not intermediate steps; τ²-bench (Jun 2025) adds dual-control where both user and agent mutate shared state.
  • SWE-bench Verified (500 human-validated GitHub issues) and MLE-bench (75 Kaggle competitions, ICLR 2025 Oral) score on patch / submission outcomes vs real test suites and leaderboards.
  • GAIA and BrowseComp (1,266 inverted-question browsing problems) grade final-answer correctness only.

Process / intermediate benchmarks:

  • AgentBench (ICLR 2024) reports per-task success and exposes failure categories.
  • AgentBoard (NeurIPS 2024 Oral) is the canonical sub-goal / progress-rate benchmark.
  • WebArena (812 templated tasks) and OSWorld are outcome-graded but include URL-prefix / DOM-state checkers that effectively partial-credit.
  • OSUniverse (2025) modernizes OSWorld with modular evaluators.

Cost-Per-Success Metrics

The dominant 2025 reframing: report $/successful task, not $/token. τ-bench's pass^k (probability of k consecutive successes on the same task) is the de-facto reliability metric; GPT-4-class agents drop to pass^8 < 25% on retail despite roughly 50% pass^1. Production teams (Anthropic engineering, Braintrust, QuantumBlack) report a success-weighted cost: total tokens (or dollars) divided by tasks completed end-to-end, multiplied by a retry-amortization factor. Rack-rate forecasts overstate true spend 3-10× because cache hits, batch discounts, and reserved capacity dominate. Anthropic Demystifying Evals recommends logging cost per trajectory alongside outcome to detect runaway-loop regressions before they hit billing.

Synthetic Data Generation

Persona-driven synthetic users are the τ-bench / τ²-bench foundation; an LLM simulates the customer with a persona card and a goal, enabling pass^k by re-sampling. DSPy provides dspy.Assert / dspy.Suggest boolean constraints that double as programmatic eval gates and self-refinement signals; assertions are stable enough to power CI. Inspect AI (UK AISI) ships solver / task primitives for procedurally generating scenarios with sandbox isolation. LMArena-style pairwise crowd ratings remain the gold reference but are too slow for inner-loop dev.

Real-Data Validation

The standard pipeline is log → replay → shadow → canary. Log-and-replay: capture every production request, re-run through the candidate agent in an isolated runtime (no DB writes), diff outputs with an LLM judge. Shadow mode runs candidate agents in parallel with the production system; humans remain decision-makers. Canary expands 5% → 10% → 25% → 100% with rate-of-error gates. LangSmith and Phoenix both expose dataset-from-trace promotion; turn last week's production traffic into an eval set with one click. (Brightlume shadow-mode rollouts, Tian Pan canary).

LLM-As-Judge

The 2024-2026 literature converged on four documented biases: position, verbosity, self-preference, and style (now considered dominant, 0.76-0.92 effect size across models; larger than position). Pairwise judging (MT-Bench, Chatbot Arena) is more reliable than pointwise scoring but 2× cost; G-Eval (chain-of-thought + form-filled scoring) and Auto-J (open 13B judge) remain the open-weights baselines. Mitigations: position swapping + multi-judge ensembles + calibrated rubrics + bias-corrected estimators with confidence intervals.

Trajectory Versus Outcome Evaluation

Outcome eval (SWE-bench, GAIA, τ-bench goal hash) is cheap and unambiguous but hides "right answer, wrong reason." Trajectory eval (AgentBoard progress rate, LangChain agentevals, TRACE's Hierarchical Trajectory Utility Function, Feb 2026) scores the full trace: tool-call order, intermediate reasoning, sub-goal achievement. Multi-agent fabrics need both; outcome to gate releases, trajectory to localize regressions. AgentRewardBench (Apr 2025) shows trajectory judges disagree with humans roughly 30% of the time, so trajectory eval is best used as a triage signal.

Hallucination / Faithfulness For RAG

RAGAS faithfulness decomposes the answer into atomic claims and verifies each against retrieved context with an LLM judge; agentic RAG (2025) added task_completion and tool-context-faithfulness metrics. Vectara HHEM-2.1 is a fine-tuned classifier; more robust than RAGAS' LLM-judge approach. Cleanlab TLM combines self-reflection, sample-consistency, and token-logprob uncertainty and benchmarks first across FinanceBench, PubMedQA, and four RAG suites.

Adversarial / Red-Team Testing

AgentDojo (ETH Zurich; integrated into UK AISI Inspect) provides 97 prompt-injection tasks across 629 test cases. A NIST / CAISI / UK AISI / Gray Swan 2025 competition logged 250k+ attacks across 400+ participants; at least one successful hijack on every frontier model; novel attack templates raised task-hijack rate from 11% to 81%. OWASP Top 10 for Agentic Applications (December 2025) names indirect prompt injection, memory poisoning, tool misuse, and excessive agency as canonical risks. Promptfoo ships 500+ attack vectors.

Regression Testing And Replay

LangSmith: node-level state diffs, full agent graphs, replay against new model versions; deepest integration for LangGraph fabrics. Arize Phoenix: drift detection, embedding analysis, OSS eval primitives. Helicone: proxy-level capture, shallower but install-free. Standard pattern: pin a "golden trajectory" dataset of 200-1000 production traces; on every change, replay, compare outputs with a calibrated pairwise judge, and gate on regression-rate + cost-delta.

Reliability Under Stress

Agent stress tests fail differently from web load tests (The New Stack). Documented failure modes: context-length blow-up (latency 1.2 s → 8.7 s with 9× token cost when context hit 21k tokens in one case); horizontal scaling limits because model weights / KV-cache are GPU-bound; token-budget exhaustion (one team burned the API budget in 90 minutes). Test for context-scaling curve, tool-orchestration depth under realistic queries, and chain runaway from low-temperature regressions.

Open-Source Frameworks

  • Inspect AI (UK AISI); sandboxing toolkit + AgentDojo + frontier-grade scaffolding.
  • Promptfoo; YAML config, 55+ assertions including G-Eval, 500+ red-team vectors.
  • DeepEval; Python pytest-style, 50+ metrics.
  • RAGAS; RAG-specific, claim decomposition.
  • lm-eval-harness (EleutherAI); base-model academic benchmarks.
  • OpenAI Evals; graders for OpenAI-hosted runs.
  • HELM (Stanford CRFM); holistic transparency leaderboard.
  • Cleanlab Cortex / TLM; uncertainty-based trust scoring.

Selection Criteria

Technique Best Workload Cost / Run Signal Quality Use When
τ-bench / τ²-bench pass^k Tool-using customer-service agents High Very high (reliability not just capability) Release gate for a stable fabric
AgentBoard progress rate Early-stage agents missing sub-goals Medium High (localizes failure) Diagnosing where in the trajectory failure occurs
SWE-bench Verified / MLE-bench Coding & ML-engineering agents High (compute heavy) High (real test suites) Domain-specific capability claims
WebArena / OSWorld / GAIA / BrowseComp Browser / desktop / general agents Medium-High Outcome-only Cross-vendor capability comparison
DSPy assertions Inner-loop dev Very low Medium (boolean constraints) Every commit, CI gate
RAGAS / TLM / HHEM RAG-using agents Low-Medium High for faithfulness Any RAG step in the fabric
LLM-as-judge (pairwise, calibrated) Open-ended outputs Medium Medium (needs bias control) When ground truth is absent
AgentDojo / Promptfoo red-team Tool-calling agents Medium High for security Pre-launch + monthly
LangSmith / Phoenix replay Regression after model / prompt change Low (cached traces) Very high (real distribution) Every release
Shadow / canary Pre-GA validation Operational Highest (real users) Final stage before 100%
Stress / load Scaling reliability Medium High for ops Before traffic ramp

Picking Heuristic

  • Default release gate: pass^k on the validation suite plus cost-per-success ceiling plus regression replay.
  • Add adversarial: AgentDojo + Promptfoo plus bespoke OWASP Agentic Top 10 scenarios before any production launch.
  • Add faithfulness: RAGAS + TLM whenever the fabric performs retrieval.
  • Add trajectory: AgentBoard progress rate when the failure surface needs localization.
  • Always shadow + canary any new fabric or new model pin before full rollout.

Cross-References