Runtime Systems¶

The runtime execution patterns referenced throughout Emergence › System draw on the active agentic-reasoning research lineage from 2022 through 2026. This page catalogs the canonical patterns, their loop mechanics, key innovations, ideal workloads, and selection criteria the platform uses when configuring the active runtime pattern for a given Unitt.

Reference Patterns¶

ReAct¶

ReAct (Yao et al., ICLR 2023) emits an interleaved Thought → Action → Observation trace from a single LLM. Each Action calls a tool, the environment returns an Observation, and the next Thought conditions on the accumulated prefix; the loop terminates on a Finish action. Innovation: unifies reasoning traces with grounded action; reasoning steers tool use, observations correct hallucination. Best for: short-to-medium-horizon tool use (QA, web shopping, ALFWorld-style navigation) where each step needs fresh evidence. Failure modes: context bloat from accumulating observations, thrashing on tasks needing global planning, early "first-step lock-in." Source code: ysymyth/ReAct.

Reflexion¶

Reflexion (Shinn et al., NeurIPS 2023) wraps an Actor (typically ReAct) inside a trial / evaluate / reflect cycle. An Evaluator scores the trajectory; a Self-Reflection module writes a natural-language post-mortem into persistent episodic memory; the next trial prepends that memory. Innovation: verbal reinforcement learning; policy improvement via written reflections rather than gradient updates. Best for: tasks with a verifiable reward signal across multiple attempts (HumanEval-style coding, tool-use benchmarks, sandboxed games). Failure modes: requires an Evaluator; reflections can ossify into superstition; high per-task cost. Source code: noahshinn/reflexion.

ReWOO¶

ReWOO (Xu et al., 2023) decouples planning from observation. A Planner emits the complete dependency graph of tool calls upfront using variable placeholders (#E1, #E2…); Workers execute the tool calls (parallel where dependencies allow); a Solver receives the original question plus the resolved evidence table and writes the final answer. Innovation: roughly 5× token reduction versus ReAct because intermediate tool outputs do not re-enter the planner's context. Best for: predictable multi-hop retrieval / QA where the plan is inferable from the question alone (HotpotQA, fixed-schema research). Failure modes: cannot adapt to surprising observations; brittle when tool outputs invalidate later plan steps.

Plan-and-Execute / Plan-and-Solve¶

Plan-and-Solve (Wang et al., 2023) and the LangChain Planning Agents pattern separate strategic planning from tactical execution. A heavyweight Planner writes an explicit ordered plan; an Executor (often a smaller ReAct sub-agent) runs one step at a time; a Replanner sees the updated past_steps after each step and either revises the remaining plan or emits the final answer. Innovation: separates expensive planning from cheap execution while supporting mid-flight replanning (unlike ReWOO). Best for: multi-step workflows where plan structure is non-trivial but partially observable. Failure modes: planner over- or under-specification; replanner churn when step results are ambiguous; coordination overhead on short tasks.

Tree-of-Thoughts¶

Tree-of-Thoughts (Yao et al., NeurIPS 2023) lifts inference from autoregressive generation to deliberate search. At each frontier state, a Thought Generator proposes k candidate next-thoughts; a State Evaluator scores each (value or vote); a search algorithm (BFS / DFS with pruning) expands the most promising frontier and optionally backtracks. Innovation: enables lookahead and backtracking over discrete reasoning states. Best for: combinatorial reasoning with checkable intermediate states (Game of 24, crosswords, constrained generation). Failure modes: token cost explodes with branching factor × depth; evaluator quality is the bottleneck. Source code: princeton-nlp/tree-of-thought-llm.

Graph-of-Thoughts¶

Graph-of-Thoughts (Besta et al., AAAI 2024) generalizes ToT; thoughts are nodes in an arbitrary DAG. Beyond expansion, GoT supports aggregation (merge multiple thoughts into one) and refinement (self-loop edges); a Controller schedules generate / aggregate / refine / score operations per a user-defined Graph-of-Operations. Innovation: non-tree topology useful when subproblems share substructure or when synthesizing partial solutions yields a stronger whole. Best for: sorting / merging, set-intersection reasoning, document synthesis from chunked drafts, multi-source summarization. Failure modes: operator-graph design burden falls on the developer; harder to debug than ToT. Source code: spcl/graph-of-thoughts.

LATS¶

LATS (Zhou et al., ICML 2024) runs MCTS over agent trajectories. Each iteration selects a leaf via UCB, expands by sampling ReAct-style actions, simulates and evaluates with a language-model value function (plus optional environment rollout), backpropagates value up the tree, and on failure reflects to update a verbal critique that biases future selection. Innovation: unifies reasoning, acting, planning, and reflection inside a principled search. Best for: high-stakes tasks worth heavy compute (code generation, web shopping, decisional agents). Failure modes: very expensive; needs a reversible environment or simulator for true rollouts; value-LM miscalibration wastes exploration. Project page: LanguageAgentTreeSearch.

Self-Refine¶

Self-Refine (Madaan et al., NeurIPS 2023) uses the same LLM to generate an initial output, produce structured feedback on that output, and then refine the output using the feedback; iterating until a stop condition. Innovation: single-model iterative refinement with no extra training, no tools, no external evaluator; ~20% absolute task gain on average. Best for: constrained generation (style, math, code review, dialogue) where the model is a competent critic of its own output. Failure modes: critic blindness; convergence to a local optimum; sycophantic feedback. Project site: selfrefine.info.

CodeAct¶

CodeAct (Wang et al., ICML 2024) is ReAct-shaped but the Action channel is Python source executed in a stateful interpreter rather than a JSON tool call. The Observation is stdout / stderr / return values. Variables persist across turns, enabling composition, loops, conditionals, and library calls as first-class agent actions. Innovation: code is a strictly more expressive action space than JSON; fewer turns, native control flow, self-debugging via exception traces; ~30% step reduction versus JSON tools. Best for: data analysis, scientific workflows, anything with a Python ecosystem; the default for Manus, Open Interpreter, OpenDevin. Failure modes: sandbox-escape risk, long-running code, state pollution. Source code: xingyaoww/code-act.

Voyager¶

Voyager (Wang et al., 2023) composes an Automatic Curriculum (proposes the next task to maximize novelty / exploration), Iterative Prompting (generates code attempts and refines until passing), and a Skill Library (stores verified code keyed by embedding for future retrieval) into a lifelong-learning runtime. Innovation: open-ended learning without weight updates; capabilities compound as the skill library grows. Best for: open-ended embodied or sandbox environments (Minecraft, web-of-the-world agents, robot skill bootstrapping). Failure modes: curriculum can wander; bad skills poison retrieval; needs a verifier signal; domain-specific scaffolding. Project: voyager.minedojo.org.

Reasoning-Native Models In Agent Loops¶

DeepSeek-R1 and OpenAI o-series reasoning models emit a long internal chain-of-thought (<think>…</think>) before each action; RL training (GRPO-style for R1) rewarded correctness, producing emergent self-reflection, verification, and strategy-switching inside one model call. The agentic harness collapses to "thin ReAct"; the reasoning model handles deliberation, the harness handles tools and termination. Innovation: moves deliberation from a multi-call scaffold (ToT / LATS) into a single test-time-scaled forward pass. Best for: hard single-shot reasoning embedded in agent steps. Failure modes: long thinking tokens dominate latency / cost; weaker on broad world-knowledge planning than on closed reasoning; thought leakage. Source code: deepseek-ai/DeepSeek-R1.

Agentic-OS / Long-Horizon Runtimes¶

Long-horizon agentic runtimes (Manus, Devin-style) decompose a goal into a hierarchical task graph; specialized sub-agents (Browser, Shell, Knowledge, Editor) run asynchronously in a persistent cloud VM with a real filesystem, terminal, and browser. State is durable across hours or days; CodeAct is the action substrate; a scheduler resumes paused tasks on external events. Innovation: treats the agent as an OS process; durable state, multi-process concurrency, async wake-ups, recoverable failure. Best for: software engineering autopilots, research / analyst workflows, multi-day ops. Failure modes: drift over long horizons, expensive failures, credential / sandbox blast radius, coordination bugs between sub-agents, cost spikes. Reference: arXiv:2505.02024.

Selection Criteria¶

The platform selects a runtime pattern per Unitt by reading the workload profile (task length, latency target, cost budget, deliberation depth, branching needs, tool fidelity, observability requirement) and matching against the table below. Legend: L = Low, M = Medium, H = High, V = Very high.

Pattern	Task Length	Latency	Cost	Deliberation	Branching	Tool Fidelity	Observability
ReAct	Short-Med	L	L	L	None	M	H
Reflexion	Med (multi-trial)	M	M	M	None	M	H
ReWOO	Short-Med	L (parallel workers)	L	L (planner-only)	None	M	M
Plan-and-Execute	Med-Long	M	L-M	M	None	M	H
Tree-of-Thoughts	Short (per query)	H	H	H	H	L	M
Graph-of-Thoughts	Short-Med	H	H	H	V (DAG)	L	L
LATS	Med	V	V	V	V (MCTS)	H	L
Self-Refine	Short	M	M	M	None	L	H
CodeAct	Short-Long	L-M	L	L-M	None	V	M
Voyager	Long (lifelong)	H	H	M	M (retrieval)	H	M
Reasoning-native	Short-Med	M-H	M-H	V (intra-model)	None visible	M-H	L (hidden CoT)
Agentic-OS	V Long (hours-days)	V	V	M-H per step	M (sub-agents)	V	L-M

Picking Heuristic¶

ReAct when the loop is short, tools are well-typed, and debuggability matters.
ReWOO when the plan is inferable upfront and token savings plus parallel tool execution dominate.
Plan-and-Execute for multi-step ops needing a cheap executor and a smart planner.
Reflexion or Self-Refine when there is a verifiable reward (tests, judge, oracle).
ToT / GoT / LATS when correctness is more important than cost and intermediate states are scorable.
CodeAct when actions are computationally expressive (data, scripting, composition).
Voyager for open-ended skill accumulation in a durable environment.
Reasoning-native in-loop to flatten scaffolds; let the model deliberate instead of orchestrating ToT.
Agentic-OS only when the horizon is hours-to-days, state must persist, and async wake-ups are required.

Production stacks routinely compose these. A typical configuration uses an Agentic-OS orchestrator running CodeAct actions on a reasoning-native model, with Reflexion-style memory across runs and LATS reserved for the hardest sub-task.

Cross-References¶

Emergence › System; the developer-facing platform layer that consumes these patterns.
Emergence › Subunits; composing patterns across sub-agents.
Emergence › WorldSim; empirical validation and pattern selection feedback.