Skip to content

WorldSim

WorldSim defines the simulation environments and outcome-validation systems that Emergence uses to stress-test, score, mutate, and evolve agentic runtime configurations before they are promoted into production. Where Memory, System, State, and Subunits define how a Unitt is composed, WorldSim defines how the composed Unitt is proven; placed inside a deterministic, replayable world, run against a library of validated scenarios, scored against a structured rubric, and either accepted, mutated, or rejected on the basis of measured outcomes. WorldSim is the empirical core of the platform's outcome-driven design principle.

WorldSim is informed by the agentic benchmarking and simulation lineage of 2023-2026, including Generative Agents (Smallville), Voyager skill curricula, AgentBench and AgentBoard multi-env harnesses, WebArena and VisualWebArena state-predicate scoring, OSWorld real-OS task verification, ALFWorld and ScienceWorld, SWE-bench Verified, τ-bench tool-use evaluations, GAIA, BrowseComp, and MLE-bench. The platform additionally clones the social-simulation primitives popularized by MiroFish and the underlying OASIS social-sim runtime to stand up multi-agent prediction environments for outcome forecasting. Selection criteria for environment choice are documented in Research › World Simulation.

Why WorldSim

Outcome-driven systems require outcome-driven validation. A reasoning trace can look correct and produce the wrong end state. A tool call can succeed and still violate policy. A subunit composition can pass on one scenario and silently regress on another. WorldSim addresses this by treating every validated outcome as a fixed reference point that the platform can replay, score, and compare across runtime configurations. The four functions WorldSim performs for every Unitt are validation, regression, exploration, and evolution.

flowchart LR
    UC[Unitt Config] --> SIM[WorldSim Run]
    SIM --> SC[Scenario Library]
    SIM --> SCO[Scoring Rubric]
    SIM --> OUT[Outcome Report]
    OUT --> MUT[Mutate / Accept / Reject]
    MUT --> UC

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class UC,SIM,SC,SCO,OUT,MUT stage

World Model

Every WorldSim run instantiates a deterministic World; an authoritative state store of entities, relations, locations, inventories, NPCs, connectors, and external systems; plus an explicit tick clock that advances either by discrete steps or by "until next scheduled event." Every NPC action, agent tool call, environment perturbation, and external event is recorded as an immutable append-only event in the world log. Determinism is non-negotiable: every run is identified by a (scenario_id, agent_config_hash, seed) triple, and the world log is sufficient to reconstruct the entire run without re-querying the model.

flowchart TD
    W[World State Store] --- E[Entities]
    W --- L[Locations]
    W --- N[NPCs]
    W --- C[Connectors]
    W --- TK[Tick Clock]
    W --- EV[Event Log]

    EV -. immutable .-> RP[Replay Engine]
    TK -. advances .-> EV

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class W,E,L,N,C,TK,EV,RP stage

NPCs are bounded stochastic policies; rule-based, small-model, or full Unitt; seeded from the world's RNG so they behave identically on replay. Connectors are stubbed against the real connector schema but resolved by the simulation rather than the real backend, allowing deterministic reproduction of network responses and external state changes.

Scenario Library

A Scenario is a tuple of (initial world snapshot, agent objective, success oracle, time budget, resource budget, policy overrides). Scenarios are versioned, namespaced by domain, and tagged by the runtime capabilities they exercise; memory recall, planner depth, subunit coordination, governance enforcement, long-horizon stability, regression coverage. Every production failure observed in deployment can be promoted into a Scenario by serializing the world state at the moment of failure plus a synthesized oracle ("the agent should eventually reach state X").

flowchart LR
    PFL[Production Failure] -. promote .-> NS[New Scenario]
    NS --> SL[Scenario Library]
    SL --> VS[Validation Suite]
    SL --> RS[Regression Suite]
    SL --> ES[Exploration Suite]
    SL --> SS[Stress Suite]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class PFL,NS,SL,VS,RS,ES,SS stage

The Scenario Library is partitioned into four practical suites. The Validation Suite gates every Unitt promotion. The Regression Suite re-runs every prior production scenario on every change. The Exploration Suite probes adjacent behaviors that no current scenario covers. The Stress Suite drives the runtime to budget, latency, and reliability limits.

Scoring Rubric

Scoring is multi-axis by design. A single pass / fail signal hides too much information; a single-axis quality score hides cost. The recommended WorldSim rubric scores every run on at least four axes; outcome correctness, efficiency, safety, and trajectory quality; each independently weighted by the Unitt's operational profile.

flowchart LR
    R[Run Trajectory] --> O[Oracle Predicate]
    R --> EF[Efficiency Cost]
    R --> SF[Safety Violations]
    R --> TQ[Trajectory Quality]

    O --> SCO[Composite Score]
    EF --> SCO
    SF --> SCO
    TQ --> SCO

    SCO --> ACC[Accept / Mutate / Reject]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class R,O,EF,SF,TQ,SCO,ACC stage
  • Outcome correctness. A deterministic oracle predicate over the final world state; modeled on WebArena state-predicate scoring, OSWorld verification scripts, and SWE-bench unit tests.
  • Efficiency. Tick count, token consumption, dollar cost, wall-clock latency, retry count.
  • Safety. Policy violations, escalation rate, sensitive-data leakage events, budget overruns.
  • Trajectory quality. Sub-goal hits in the style of AgentBoard's fine-grained progress signal; credit for partial completion even on overall failure.

Replay And Regression

WorldSim is fully replayable. Every run's event log can be replayed against an updated Unitt configuration by replacing the live model call with a recorded-response shim, so regression testing is essentially free once the original trajectory has been captured. When the agent or memory layer changes, the platform re-scores the historical event logs and surfaces the diff as a regression signal.

flowchart LR
    EL[Event Log] --> RP[Replay Engine]
    UC1[Original Config] -. recorded .-> RP
    UC2[Updated Config] --> RP
    RP --> NEW[New Trajectory]
    NEW --> DF[Outcome Diff]
    DF --> REG[Regression Report]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class EL,RP,UC1,UC2,NEW,DF,REG stage

The replay engine pairs naturally with the durable-state checkpoints described in State. Any production session can be promoted into a regression scenario by attaching the captured checkpoint stream plus a synthesized oracle.

Multi-Agent Sim Cloned From MiroFish

For Unitts whose objective is to forecast or influence multi-agent dynamics; market sentiment, support-volume forecasting, social diffusion, policy reaction modeling; WorldSim clones the population-scale social-simulation primitives popularized by MiroFish and built on the OASIS social-sim runtime. The pattern is a four-stage pipeline that the platform stands up for any Unitt that declares a social-sim validation target.

flowchart LR
    SD[Seed Document or Topic] --> KG[GraphRAG: Entity + Relation Extraction]
    KG --> POP[Persona Population]
    POP --> SIM[Social Simulation Runtime]
    SIM --> EV[Emergent-Behavior Report]
    EV --> SC[Scoring Rubric]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class SD,KG,POP,SIM,EV,SC stage
  • Seed. A document, topic, ticket, financial report, news story, or policy brief becomes the simulation seed.
  • GraphRAG. Entities and relationships are extracted into a knowledge graph that anchors the simulation.
  • Persona population. Up to hundreds of thousands of LLM-driven personas are generated, each with a personality, long-term memory, and social history, seeded from the world RNG.
  • Social runtime. Personas post, reply, follow, argue, and shift opinions inside simulated social platforms; the runtime supports the same action types popularized by OASIS.
  • Emergent-behavior report. Sentiment evolution, topic propagation, influence dynamics, and outcome forecasts are emitted as structured signals to the scoring rubric.

This cloned MiroFish-style environment is a single mode of WorldSim; Unitts that operate over web, OS, code, or workflow domains use the corresponding domain-specific environments (WebArena, OSWorld, SWE-bench-style, τ-bench-style) instead. Pattern selection is documented in Research › World Simulation.

Evolutionary Loop

WorldSim is the substrate behind the platform's evolutionary optimization loop. The Unitt configuration; system prompt, tool allowlist, memory hyperparameters, planner depth, reflection cadence, subunit composition; is treated as a genome. Each generation samples a population, runs every variant across the validation, regression, and stress suites in parallel, aggregates the scoring rubric into a fitness signal, and selects survivors. Mutation operates on prompt fragments, tool sets, pattern selections, and memory configuration; crossover recombines sub-prompts and pattern compositions across the population.

flowchart LR
    POP[Population: Unitt Variants] --> RUN[Parallel WorldSim Runs]
    RUN --> SCO[Aggregate Fitness]
    SCO --> SEL[Selection]
    SEL --> MUT[Mutation + Crossover]
    MUT --> POP
    SEL -. surviving champions .-> PROD[Promotion to Production]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class POP,RUN,SCO,SEL,MUT,PROD stage

LLM-response caching keyed by (prompt, model, seed) allows cousin genomes to share work, which is the single largest cost lever in the evolutionary loop. Multi-objective optimization; outcome correctness, cost, safety, latency; is supported through Pareto selection so the platform does not collapse to a single fitness axis.

Practical Build Order

WorldSim is intentionally layered so a Unitt can opt into the level of validation rigor it needs. The recommended build order matches the path successful teams take in production:

  1. Single-scenario validation with deterministic state-predicate scoring on a small self-hosted world.
  2. Replay-based regression once the event log captures real production trajectories.
  3. Multi-axis scoring once the operational profile of the Unitt is stable enough to weight outcome, cost, safety, and trajectory.
  4. Population-scale or domain-specific environments (MiroFish-style social-sim, WebArena clones, code-test-suite scenarios) when the Unitt's objective demands them.
  5. Evolutionary optimization as the outer loop, only after the inner validation and scoring loops are themselves trustworthy.

Validation Surfaces

WorldSim validation surfaces map cleanly onto Emergence components. Every Emergence layer is independently testable inside WorldSim, and the combined runtime is also testable end-to-end.

Emergence Layer WorldSim Validation Surface
Memory Recall accuracy, recall latency, eviction correctness, multi-session continuity.
System Pattern outcome, deliberation depth, replan cadence, tool-use fidelity.
State Context-cache hit rate, compaction fidelity, lost-in-the-middle resistance.
Subunits Coordination overhead, brief / summary fidelity, failure mode coverage.
Composed Unitt End-to-end outcome correctness, cost, safety, latency.

Governance Of Simulation

WorldSim itself is governed. Scenarios may contain sensitive material; simulation runs may consume substantial compute; evolutionary loops may produce candidate configurations that violate policy. Every WorldSim run is recorded in the audit trail with full configuration, scenario list, scoring rubric, and outcome diff. Configurations promoted from WorldSim into production pass an explicit governance gate that re-verifies policy compliance against the production environment.

WorldSim Governance Requirements

  • Every run records scenario IDs, configuration hash, model versions, seed, and rubric.
  • Sensitive scenarios are policy-scoped so they cannot run against live connectors.
  • Evolutionary candidates pass an explicit governance gate before any production promotion.
  • Replay outputs are deterministic and reproducible from the event log alone.
  • Multi-axis scores are surfaced to the operator; single-axis collapse is not permitted.

Cross-References

  • Memory supplies the durable substrate whose behavior WorldSim validates.
  • System supplies the runtime pattern whose outcomes WorldSim scores.
  • State supplies the durable checkpoints that drive replay-based regression.
  • Subunits supplies the composed multi-agent systems WorldSim stress-tests.
  • Fabric › Test supplies the fabric-level validation strategy that consumes WorldSim outputs.
  • Research › World Simulation documents the citations, selection criteria, and reference environments behind every WorldSim mode.