Skip to content

World Simulation

The simulation environments and outcome-validation benchmarks referenced throughout Emergence › WorldSim draw on the 2023-2026 agentic benchmarking and simulation research lineage. This page catalogs the canonical environments, their scoring methodologies, and the selection criteria the platform uses when configuring a WorldSim environment for a given Unitt.

Reference Environments

Generative Agents / Smallville

Generative Agents (Park et al., 2023) placed 25 LLM-driven townspeople in a Sims-like grid with memory streams, reflection, and planning. Scoring is qualitative; human-rated believability and emergent social behavior (a Valentine's party spreading by word of mouth). Excellent fidelity-of-emergence model; weak for outcome validation because there is no machine-checkable success criterion.

Voyager

Voyager (Wang et al., 2023) demonstrated a Minecraft agent with a self-curated, code-as-skill library and curriculum. Scored on tech-tree progress (items crafted, biomes visited, distance traveled). Strong demonstration of skill compounding and replayable telemetry; narrow domain limits transfer.

AgentBench / AgentBoard

AgentBench and AgentBoard are multi-environment harnesses spanning OS, DB, web, card games, and household tasks. AgentBoard adds fine-grained progress sub-goals instead of binary success. Good breadth and per-step credit assignment; integration cost is non-trivial because of heterogeneous task schemas.

WebArena / VisualWebArena

WebArena and VisualWebArena are self-hosted reproductions of GitLab, Reddit, OpenStreetMap, Shopify, and similar sites. Scoring is programmatic; URL / DOM / state predicates check the final world state. Gold standard for outcome fidelity on the web; demands Docker-stack maintenance.

OSWorld

OSWorld (Xie et al., 2024) provides 369 real-OS tasks across Ubuntu / Windows / macOS covering multi-app workflows. Each task ships an executable verification script that introspects filesystem, registry, and UI state. Highest realism among desktop benchmarks; very slow and flaky to run at scale.

ALFWorld / ScienceWorld

ALFWorld bridges text TextWorld and the embodied ALFRED housework environment, enabling text-pretraining to embodied transfer. ScienceWorld is a text-only K-12 science sandbox with roughly 30 tasks. Both score against deterministic goal predicates; cheap to run, but toy semantics under-represent real failure modes.

SWE-bench / SWE-bench Verified

SWE-bench provides 2,294 real GitHub issues from popular Python repositories; the agent must produce a patch that passes the hidden test suite. SWE-bench Verified is a 500-task human-filtered subset with corrected unit tests. Outcome validation is essentially perfect (tests pass or do not); coverage is Python-only and patch-shaped.

τ-bench

τ-bench (Sierra, 2024) is a customer-service tool-use evaluation (retail, airline) where the agent must satisfy a user and update backend state correctly. Scores include pass^k; task must pass k times consecutively, exposing policy fragility. Best-in-class for stochastic-policy validation; small domain.

GAIA

GAIA (Mialon et al., 2023) is 466 questions requiring web browsing, file I/O, multimodality, and tool use, with single-string ground-truth answers. Exact-match scoring tiered by difficulty (L1 / L2 / L3). Cheap, human-easy / AI-hard; binary scoring loses partial-credit signal.

BrowseComp

BrowseComp (OpenAI, 2025) is 1,266 deliberately hard browsing questions with verified answers requiring deep, long-horizon search. Exact-match against a hidden gold answer; deliberately resistant to memorization. Great for live-web realism; non-deterministic web causes flaky reruns.

MLE-bench

MLE-bench (OpenAI, 2024) provides 75 Kaggle competitions. The agent must train and submit and is scored on the held-out leaderboard metric, then bucketed against human medal thresholds. Authentic ML-engineering loop; runs are expensive (GPU-hours).

MiroFish / OASIS Social Simulation

MiroFish is the open-source multi-agent prediction engine that ingests a seed document, uses GraphRAG to extract entities and relationships into a knowledge graph, spawns large populations of LLM-driven personas (reportedly up to one million in demos), and drops them into simulated Twitter-like and Reddit-like social platforms to emit a structured prediction report tracking sentiment evolution, topic propagation, and influence dynamics. The underlying social-sim runtime is OASIS from CAMEL-AI (23 social-action types, scales to roughly one million agents). Notable forks include nikmcfly/MiroFish-Offline (English fork, local-only via Neo4j + Ollama) and amadad/mirofish-cli. Emergence clones the same four-stage pipeline (seed → GraphRAG → persona population → social-sim runtime → emergent-behavior report) inside WorldSim as one available environment mode.

Selection Criteria

The platform selects a WorldSim environment per Unitt by reading the workload profile (scenario fidelity required, scoring rigor required, scale, feedback granularity, integration cost tolerance) and matching against the table below.

Environment Scenario Fidelity Scoring Rigor Scale (Tasks) Feedback Granularity Integration Ease
Generative Agents High (social) Low (human-rated) 1 world Trajectory-level Low (custom engine)
Voyager Med (game) Med (tech-tree) Open-ended Per-skill Med (MC server)
AgentBench Med High (programmatic) 8 envs Task-level Med
AgentBoard Med High (sub-goals) 9 envs Per sub-goal Med
WebArena High (web) High (state preds) 812 Task-level Low (Docker)
OSWorld Very High (OS) High (verify scripts) 369 Task-level Low (VMs)
ALFWorld / ScienceWorld Low (toy) High (predicates) ~30-100 Per-step High
SWE-bench Verified High (code) Very High (unit tests) 500 Test-level High
τ-bench High (CS) Very High (pass^k) 2 domains Turn + state Med
GAIA High (real tasks) Med (exact match) 466 Final-answer only High
BrowseComp High (live web) Med (exact match) 1,266 Final-answer only High
MLE-bench Very High (Kaggle) High (leaderboard) 75 Metric value Low (GPU)
MiroFish / OASIS social-sim High (social population) Med (emergent metrics) Scenarios × 10⁶ agents Population-level signals Med (GraphRAG + sim runtime)

Patterns For An Internal Sim Environment

The Emergence WorldSim layer assembles its own internal sim environment using primitives drawn from the systems above. A World owns an authoritative state store (entities, relations, locations, inventories) and a tick clock. Events are immutable, append-only; every NPC action, agent tool call, and environment perturbation is an event. NPCs are stochastic policies (rule-based or small LLMs) seeded from the world's RNG. A Scenario is the tuple (initial world snapshot, agent goal / prompt, success oracle, time budget, resource budget). The ScoringRubric is multi-axis: outcome correctness (oracle predicate), efficiency (ticks / tokens / dollars), safety (constraint violations), and trajectory quality (sub-goal hits in the AgentBoard style).

Every run is identified by a (scenario_id, agent_config_hash, seed) triple. The event log is persisted; replay reconstructs the world without re-querying the LLM by replacing the model with a recorded-response shim. The evolutionary loop treats the Unitt configuration as a genome, samples a population, runs each across N scenarios in parallel, aggregates the scoring rubric into a fitness signal, and mutates via prompt-level edits and pattern swaps. LLM-response caching keyed by (prompt, model, seed) allows cousin genomes to share work; the single largest cost lever.

Picking Heuristic

  • WebArena + OSWorld for state-fidelity outcome validation across web and OS workloads.
  • SWE-bench Verified + τ-bench for hard-pass rigor on coding and customer-service workloads.
  • AgentBoard when sub-goal credit assignment matters for trajectory-quality scoring.
  • GAIA + BrowseComp as open-web smoke tests across general assistant workloads.
  • MiroFish / OASIS social-sim for outcome forecasting and influence-dynamics workloads where the success signal is population-level.
  • Custom internal sim for any production failure that needs to be promoted into a deterministic regression scenario.

Practical build order: start with WebArena-style state-predicate scoring on a tiny self-hosted world (3-5 entities, 1 NPC, 1 tool); add deterministic replay; add the multi-axis rubric; then add population-scale or domain-specific environments; only after that does the evolutionary outer loop produce signal worth optimizing.

Cross-References