World Simulation¶

The simulation environments and outcome-validation benchmarks referenced throughout Emergence › WorldSim draw on the 2023-2026 agentic benchmarking and simulation research lineage. This page catalogs the canonical environments, their scoring methodologies, and the selection criteria the platform uses when configuring a WorldSim environment for a given Unitt.

Reference Environments¶

Generative Agents / Smallville¶

Generative Agents (Park et al., 2023) placed 25 LLM-driven townspeople in a Sims-like grid with memory streams, reflection, and planning. Scoring is qualitative; human-rated believability and emergent social behavior (a Valentine's party spreading by word of mouth). Excellent fidelity-of-emergence model; weak for outcome validation because there is no machine-checkable success criterion.

Voyager¶

Voyager (Wang et al., 2023) demonstrated a Minecraft agent with a self-curated, code-as-skill library and curriculum. Scored on tech-tree progress (items crafted, biomes visited, distance traveled). Strong demonstration of skill compounding and replayable telemetry; narrow domain limits transfer.

AgentBench / AgentBoard¶

AgentBench and AgentBoard are multi-environment harnesses spanning OS, DB, web, card games, and household tasks. AgentBoard adds fine-grained progress sub-goals instead of binary success. Good breadth and per-step credit assignment; integration cost is non-trivial because of heterogeneous task schemas.

WebArena / VisualWebArena¶

WebArena and VisualWebArena are self-hosted reproductions of GitLab, Reddit, OpenStreetMap, Shopify, and similar sites. Scoring is programmatic; URL / DOM / state predicates check the final world state. Gold standard for outcome fidelity on the web; demands Docker-stack maintenance.

OSWorld¶

OSWorld (Xie et al., 2024) provides 369 real-OS tasks across Ubuntu / Windows / macOS covering multi-app workflows. Each task ships an executable verification script that introspects filesystem, registry, and UI state. Highest realism among desktop benchmarks; very slow and flaky to run at scale.

ALFWorld / ScienceWorld¶

ALFWorld bridges text TextWorld and the embodied ALFRED housework environment, enabling text-pretraining to embodied transfer. ScienceWorld is a text-only K-12 science sandbox with roughly 30 tasks. Both score against deterministic goal predicates; cheap to run, but toy semantics under-represent real failure modes.

SWE-bench / SWE-bench Verified¶

SWE-bench provides 2,294 real GitHub issues from popular Python repositories; the agent must produce a patch that passes the hidden test suite. SWE-bench Verified is a 500-task human-filtered subset with corrected unit tests. Outcome validation is essentially perfect (tests pass or do not); coverage is Python-only and patch-shaped.

τ-bench¶

τ-bench (Sierra, 2024) is a customer-service tool-use evaluation (retail, airline) where the agent must satisfy a user and update backend state correctly. Scores include pass^k; task must pass k times consecutively, exposing policy fragility. Best-in-class for stochastic-policy validation; small domain.

GAIA¶

GAIA (Mialon et al., 2023) is 466 questions requiring web browsing, file I/O, multimodality, and tool use, with single-string ground-truth answers. Exact-match scoring tiered by difficulty (L1 / L2 / L3). Cheap, human-easy / AI-hard; binary scoring loses partial-credit signal.

BrowseComp¶

BrowseComp (OpenAI, 2025) is 1,266 deliberately hard browsing questions with verified answers requiring deep, long-horizon search. Exact-match against a hidden gold answer; deliberately resistant to memorization. Great for live-web realism; non-deterministic web causes flaky reruns.

MLE-bench¶

MLE-bench (OpenAI, 2024) provides 75 Kaggle competitions. The agent must train and submit and is scored on the held-out leaderboard metric, then bucketed against human medal thresholds. Authentic ML-engineering loop; runs are expensive (GPU-hours).

MiroFish is the open-source multi-agent prediction engine that ingests a seed document, uses GraphRAG to extract entities and relationships into a knowledge graph, spawns large populations of LLM-driven personas (reportedly up to one million in demos), and drops them into simulated Twitter-like and Reddit-like social platforms to emit a structured prediction report tracking sentiment evolution, topic propagation, and influence dynamics. The underlying social-sim runtime is OASIS from CAMEL-AI (23 social-action types, scales to roughly one million agents). Notable forks include nikmcfly/MiroFish-Offline (English fork, local-only via Neo4j + Ollama) and amadad/mirofish-cli. Emergence clones the same four-stage pipeline (seed → GraphRAG → persona population → social-sim runtime → emergent-behavior report) inside WorldSim as one available environment mode.

Selection Criteria¶

The platform selects a WorldSim environment per Unitt by reading the workload profile (scenario fidelity required, scoring rigor required, scale, feedback granularity, integration cost tolerance) and matching against the table below.

Environment	Scenario Fidelity	Scoring Rigor	Scale (Tasks)	Feedback Granularity	Integration Ease
Generative Agents	High (social)	Low (human-rated)	1 world	Trajectory-level	Low (custom engine)
Voyager	Med (game)	Med (tech-tree)	Open-ended	Per-skill	Med (MC server)
AgentBench	Med	High (programmatic)	8 envs	Task-level	Med
AgentBoard	Med	High (sub-goals)	9 envs	Per sub-goal	Med
WebArena	High (web)	High (state preds)	812	Task-level	Low (Docker)
OSWorld	Very High (OS)	High (verify scripts)	369	Task-level	Low (VMs)
ALFWorld / ScienceWorld	Low (toy)	High (predicates)	~30-100	Per-step	High
SWE-bench Verified	High (code)	Very High (unit tests)	500	Test-level	High
τ-bench	High (CS)	Very High (pass^k)	2 domains	Turn + state	Med
GAIA	High (real tasks)	Med (exact match)	466	Final-answer only	High
BrowseComp	High (live web)	Med (exact match)	1,266	Final-answer only	High
MLE-bench	Very High (Kaggle)	High (leaderboard)	75	Metric value	Low (GPU)
MiroFish / OASIS social-sim	High (social population)	Med (emergent metrics)	Scenarios × 10⁶ agents	Population-level signals	Med (GraphRAG + sim runtime)

Patterns For An Internal Sim Environment¶

The Emergence WorldSim layer assembles its own internal sim environment using primitives drawn from the systems above. A World owns an authoritative state store (entities, relations, locations, inventories) and a tick clock. Events are immutable, append-only; every NPC action, agent tool call, and environment perturbation is an event. NPCs are stochastic policies (rule-based or small LLMs) seeded from the world's RNG. A Scenario is the tuple (initial world snapshot, agent goal / prompt, success oracle, time budget, resource budget). The ScoringRubric is multi-axis: outcome correctness (oracle predicate), efficiency (ticks / tokens / dollars), safety (constraint violations), and trajectory quality (sub-goal hits in the AgentBoard style).

Every run is identified by a (scenario_id, agent_config_hash, seed) triple. The event log is persisted; replay reconstructs the world without re-querying the LLM by replacing the model with a recorded-response shim. The evolutionary loop treats the Unitt configuration as a genome, samples a population, runs each across N scenarios in parallel, aggregates the scoring rubric into a fitness signal, and mutates via prompt-level edits and pattern swaps. LLM-response caching keyed by (prompt, model, seed) allows cousin genomes to share work; the single largest cost lever.

Picking Heuristic¶

WebArena + OSWorld for state-fidelity outcome validation across web and OS workloads.
SWE-bench Verified + τ-bench for hard-pass rigor on coding and customer-service workloads.
AgentBoard when sub-goal credit assignment matters for trajectory-quality scoring.
GAIA + BrowseComp as open-web smoke tests across general assistant workloads.
MiroFish / OASIS social-sim for outcome forecasting and influence-dynamics workloads where the success signal is population-level.
Custom internal sim for any production failure that needs to be promoted into a deterministic regression scenario.

Practical build order: start with WebArena-style state-predicate scoring on a tiny self-hosted world (3-5 entities, 1 NPC, 1 tool); add deterministic replay; add the multi-axis rubric; then add population-scale or domain-specific environments; only after that does the evolutionary outer loop produce signal worth optimizing.

Cross-References¶

Emergence › WorldSim; the developer-facing platform layer that consumes these environments.
Emergence › Subunits; multi-agent compositions WorldSim stress-tests.
Fabric › Test; fabric-level validation strategy that consumes WorldSim outputs.