World Simulation¶
The simulation environments and outcome-validation benchmarks referenced throughout Emergence › WorldSim draw on the 2023-2026 agentic benchmarking and simulation research lineage. This page catalogs the canonical environments, their scoring methodologies, and the selection criteria the platform uses when configuring a WorldSim environment for a given Unitt.
Reference Environments¶
Generative Agents / Smallville¶
Generative Agents (Park et al., 2023) placed 25 LLM-driven townspeople in a Sims-like grid with memory streams, reflection, and planning. Scoring is qualitative; human-rated believability and emergent social behavior (a Valentine's party spreading by word of mouth). Excellent fidelity-of-emergence model; weak for outcome validation because there is no machine-checkable success criterion.
Voyager¶
Voyager (Wang et al., 2023) demonstrated a Minecraft agent with a self-curated, code-as-skill library and curriculum. Scored on tech-tree progress (items crafted, biomes visited, distance traveled). Strong demonstration of skill compounding and replayable telemetry; narrow domain limits transfer.
AgentBench / AgentBoard¶
AgentBench and AgentBoard are multi-environment harnesses spanning OS, DB, web, card games, and household tasks. AgentBoard adds fine-grained progress sub-goals instead of binary success. Good breadth and per-step credit assignment; integration cost is non-trivial because of heterogeneous task schemas.
WebArena / VisualWebArena¶
WebArena and VisualWebArena are self-hosted reproductions of GitLab, Reddit, OpenStreetMap, Shopify, and similar sites. Scoring is programmatic; URL / DOM / state predicates check the final world state. Gold standard for outcome fidelity on the web; demands Docker-stack maintenance.
OSWorld¶
OSWorld (Xie et al., 2024) provides 369 real-OS tasks across Ubuntu / Windows / macOS covering multi-app workflows. Each task ships an executable verification script that introspects filesystem, registry, and UI state. Highest realism among desktop benchmarks; very slow and flaky to run at scale.
ALFWorld / ScienceWorld¶
ALFWorld bridges text TextWorld and the embodied ALFRED housework environment, enabling text-pretraining to embodied transfer. ScienceWorld is a text-only K-12 science sandbox with roughly 30 tasks. Both score against deterministic goal predicates; cheap to run, but toy semantics under-represent real failure modes.
SWE-bench / SWE-bench Verified¶
SWE-bench provides 2,294 real GitHub issues from popular Python repositories; the agent must produce a patch that passes the hidden test suite. SWE-bench Verified is a 500-task human-filtered subset with corrected unit tests. Outcome validation is essentially perfect (tests pass or do not); coverage is Python-only and patch-shaped.
τ-bench¶
τ-bench (Sierra, 2024) is a customer-service tool-use evaluation (retail, airline) where the agent must satisfy a user and update backend state correctly. Scores include pass^k; task must pass k times consecutively, exposing policy fragility. Best-in-class for stochastic-policy validation; small domain.
GAIA¶
GAIA (Mialon et al., 2023) is 466 questions requiring web browsing, file I/O, multimodality, and tool use, with single-string ground-truth answers. Exact-match scoring tiered by difficulty (L1 / L2 / L3). Cheap, human-easy / AI-hard; binary scoring loses partial-credit signal.
BrowseComp¶
BrowseComp (OpenAI, 2025) is 1,266 deliberately hard browsing questions with verified answers requiring deep, long-horizon search. Exact-match against a hidden gold answer; deliberately resistant to memorization. Great for live-web realism; non-deterministic web causes flaky reruns.
MLE-bench¶
MLE-bench (OpenAI, 2024) provides 75 Kaggle competitions. The agent must train and submit and is scored on the held-out leaderboard metric, then bucketed against human medal thresholds. Authentic ML-engineering loop; runs are expensive (GPU-hours).
MiroFish / OASIS Social Simulation¶
MiroFish is the open-source multi-agent prediction engine that ingests a seed document, uses GraphRAG to extract entities and relationships into a knowledge graph, spawns large populations of LLM-driven personas (reportedly up to one million in demos), and drops them into simulated Twitter-like and Reddit-like social platforms to emit a structured prediction report tracking sentiment evolution, topic propagation, and influence dynamics. The underlying social-sim runtime is OASIS from CAMEL-AI (23 social-action types, scales to roughly one million agents). Notable forks include nikmcfly/MiroFish-Offline (English fork, local-only via Neo4j + Ollama) and amadad/mirofish-cli. Emergence clones the same four-stage pipeline (seed → GraphRAG → persona population → social-sim runtime → emergent-behavior report) inside WorldSim as one available environment mode.
Selection Criteria¶
The platform selects a WorldSim environment per Unitt by reading the workload profile (scenario fidelity required, scoring rigor required, scale, feedback granularity, integration cost tolerance) and matching against the table below.
| Environment | Scenario Fidelity | Scoring Rigor | Scale (Tasks) | Feedback Granularity | Integration Ease |
|---|---|---|---|---|---|
| Generative Agents | High (social) | Low (human-rated) | 1 world | Trajectory-level | Low (custom engine) |
| Voyager | Med (game) | Med (tech-tree) | Open-ended | Per-skill | Med (MC server) |
| AgentBench | Med | High (programmatic) | 8 envs | Task-level | Med |
| AgentBoard | Med | High (sub-goals) | 9 envs | Per sub-goal | Med |
| WebArena | High (web) | High (state preds) | 812 | Task-level | Low (Docker) |
| OSWorld | Very High (OS) | High (verify scripts) | 369 | Task-level | Low (VMs) |
| ALFWorld / ScienceWorld | Low (toy) | High (predicates) | ~30-100 | Per-step | High |
| SWE-bench Verified | High (code) | Very High (unit tests) | 500 | Test-level | High |
| τ-bench | High (CS) | Very High (pass^k) | 2 domains | Turn + state | Med |
| GAIA | High (real tasks) | Med (exact match) | 466 | Final-answer only | High |
| BrowseComp | High (live web) | Med (exact match) | 1,266 | Final-answer only | High |
| MLE-bench | Very High (Kaggle) | High (leaderboard) | 75 | Metric value | Low (GPU) |
| MiroFish / OASIS social-sim | High (social population) | Med (emergent metrics) | Scenarios × 10⁶ agents | Population-level signals | Med (GraphRAG + sim runtime) |
Patterns For An Internal Sim Environment¶
The Emergence WorldSim layer assembles its own internal sim environment using primitives drawn from the systems above. A World owns an authoritative state store (entities, relations, locations, inventories) and a tick clock. Events are immutable, append-only; every NPC action, agent tool call, and environment perturbation is an event. NPCs are stochastic policies (rule-based or small LLMs) seeded from the world's RNG. A Scenario is the tuple (initial world snapshot, agent goal / prompt, success oracle, time budget, resource budget). The ScoringRubric is multi-axis: outcome correctness (oracle predicate), efficiency (ticks / tokens / dollars), safety (constraint violations), and trajectory quality (sub-goal hits in the AgentBoard style).
Every run is identified by a (scenario_id, agent_config_hash, seed) triple. The event log is persisted; replay reconstructs the world without re-querying the LLM by replacing the model with a recorded-response shim. The evolutionary loop treats the Unitt configuration as a genome, samples a population, runs each across N scenarios in parallel, aggregates the scoring rubric into a fitness signal, and mutates via prompt-level edits and pattern swaps. LLM-response caching keyed by (prompt, model, seed) allows cousin genomes to share work; the single largest cost lever.
Picking Heuristic¶
- WebArena + OSWorld for state-fidelity outcome validation across web and OS workloads.
- SWE-bench Verified + τ-bench for hard-pass rigor on coding and customer-service workloads.
- AgentBoard when sub-goal credit assignment matters for trajectory-quality scoring.
- GAIA + BrowseComp as open-web smoke tests across general assistant workloads.
- MiroFish / OASIS social-sim for outcome forecasting and influence-dynamics workloads where the success signal is population-level.
- Custom internal sim for any production failure that needs to be promoted into a deterministic regression scenario.
Practical build order: start with WebArena-style state-predicate scoring on a tiny self-hosted world (3-5 entities, 1 NPC, 1 tool); add deterministic replay; add the multi-axis rubric; then add population-scale or domain-specific environments; only after that does the evolutionary outer loop produce signal worth optimizing.
Cross-References¶
- Emergence › WorldSim; the developer-facing platform layer that consumes these environments.
- Emergence › Subunits; multi-agent compositions WorldSim stress-tests.
- Fabric › Test; fabric-level validation strategy that consumes WorldSim outputs.