Skip to content

Context & State

The context management patterns referenced throughout Emergence › State draw on the 2024-2026 context-engineering and durable-state research lineage. This page catalogs the canonical patterns, their tradeoffs, and the selection criteria the platform uses when configuring the State strategy for a given Unitt.

Reference Patterns

Context Window Architecture

Modern agentic SDKs assemble a request from ordered segments. Anthropic's Messages API processes them as tools → system → message history → user turn with assistant scratchpad and thinking blocks interleaved. OpenAI's Responses API is the equivalent agentic primitive: an instructions field plus a tools array plus a typed event stream including function_call, function_call_output, reasoning, and message. The OpenAI Agents SDK wraps this with Agent(name, instructions, tools) plus handoffs and guardrails. Layout rule of thumb: static prefix → warm memory → hot window → current user turn, with cache breakpoints aligned to layer boundaries.

Prompt Caching

Anthropic prompt caching caches a token prefix at explicit cache_control breakpoints (up to four). The cache key is the exact byte prefix, so ordering is load-bearing; tools first, then system, then stable history, then volatile tail. Minimum cacheable block is 1024 tokens (2048 for Haiku); writes cost roughly 25% more, reads roughly 10% of base; longer-TTL blocks must precede shorter-TTL blocks. Up to 85% latency reduction on long prompts.

Hierarchical / Tiered Context (MemGPT)

MemGPT frames the LLM as an OS kernel managing a two-tier memory: main context (system + working + FIFO message queue) and external context (archival + recall stores). The model calls functions to page data between tiers. Follow-ons include Letta, Mem0, and A-MEM.

Sliding Window + Attention Sinks (StreamingLLM)

StreamingLLM (Xiao et al., ICLR 2024) showed that naive sliding-window KV eviction destroys quality because the softmax depends on early-token "attention sinks." Keeping the first 4 tokens' KV plus a rolling recent window restores stability up to 4M tokens with up to 22× speedup over recompute. This is a serving-layer technique, not a prompt-engineering one. Source code: mit-han-lab/streaming-llm.

Summarization Compaction

Rolling and hierarchical summarization replaces older turns with a <summary> block. Claude Code runs autocompact at roughly 95% capacity and recommends /compact proactively at roughly 60% to keep the working set sharp. The Anthropic API exposes server-side compaction that monitors per-turn tokens, injects a summary prompt, and replaces stale tool I/O while preserving completed work, current state, files modified, in-progress work, next steps, and constraints.

Hybrid Retrieval

2025 production stacks converge on a two-stage funnel: parallel BM25 + dense retrieval fused with Reciprocal Rank Fusion, followed by a cross-encoder or ColBERT late-interaction rerank of the top N. BM25 catches exact identifiers (error codes, SKUs, function names) that dense embeddings miss; dense catches paraphrase. Late chunking (Jina) embeds long documents first and pools per-chunk vectors after, preserving cross-chunk context in the embedding.

Anthropic Contextual Retrieval

Per the Anthropic Contextual Retrieval post (September 2024), prepending a 50-100 token chunk-specific context preamble generated by Claude before embedding each chunk improves top-20 retrieval recall by 35% (49% combined with contextual BM25, 67% with a reranker). The preamble generation is itself prompt-cacheable against the whole document.

Long Context (LongRoPE)

LongRoPE (Microsoft, ICML 2024) uses evolutionary search over per-dimension non-uniform rescaling to extend context beyond pretraining length, reaching 2M tokens with roughly 1k fine-tuning steps. LongRoPE2 (2025) is near-lossless. Needle-in-haystack (Kamradt) and successors (BABILong, Sequential-NIAH, NeedleBench v2) show that nominal context length overstates usable context; even frontier models drop below 65% on multi-needle sequential tasks at 128k+. Source code: microsoft/LongRoPE.

Context Isolation Via Subagents

Anthropic's multi-agent researcher and Claude Code subagents spawn workers with private context windows. Each subagent receives only its task brief and tools and returns a condensed summary; the orchestrator never sees the worker's verbose tool I/O. Cognition's "Don't Build Multi-Agents" (June 2025) counters that subagents can produce inconsistent decisions when the orchestrator cannot see their reasoning; the pragmatic middle ground used in Claude Code is to use subagents for read-only, bounded tasks (search, test runs, log scans) and single-thread for decision-making.

State Checkpointing

LangGraph persistence provides a Checkpointer (SqliteSaver, PostgresSaver, DynamoDBSaver, Bedrock Session Service) that writes a state snapshot per super-step keyed by thread_id, supporting resume, replay, time-travel, and human-in-the-loop interrupts. Per-task writes provide pending-write recovery so successful nodes need not re-execute on resume. For crash-safe long-running workflows, pair checkpoints with an external workflow engine (Temporal, Dapr Workflows) for at-least-once tool invocation semantics.

Episode Boundaries

Three signals drive a rollover: token pressure (compact at the configured threshold), goal completion (commit summary to long-term memory, start a fresh thread), and topic shift (embedding distance between recent and prior turns exceeds the configured threshold). Anthropic's Claude memory feature (September 2025 → Pro / Max October 2025) auto-synthesizes a memory summary roughly every 24 hours; an explicit session boundary.

Context Engineering As A Discipline

Framed by Karpathy and formalized by Anthropic; Effective context engineering and LangChain; Context Engineering for Agents (2025). LangChain's taxonomy: write, select, compress, isolate. Anthropic's principle: "find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome," favoring just-in-time retrieval over upfront stuffing.

Selection Criteria

The platform selects context-management strategies per Unitt by reading the workload profile (context length, multi-session continuity, cost budget, latency target, fidelity sensitivity, complexity tolerance) and matching against the table below.

Pattern Context Length Multi-Session Cost Latency Fidelity Loss Complexity
Layered window + caching Small-Med No Very Low (on hit) Very Low None Low
Tiered memory (MemGPT) Unbounded Yes Med (tool calls) Med Low-Med High
StreamingLLM Unbounded stream No Low Very Low High (no recall) Med (serving)
Compaction Large Partial Low Low Med (lossy) Low
Hybrid RAG Unbounded Yes Med (index + rerank) Med Low Med
Contextual Retrieval Unbounded Yes Med (index) / Low (query) Med Very Low Med
Long-context (LongRoPE) Very Large No High (tokens) High Med (lost-in-middle) Low (model-side)
Subagent isolation Large No Med (parallel) Med Low (with good summaries) Med
Checkpointing n/a Yes Low (DB) Low None Med
Episode rollover n/a Yes Low Low Med Med

Picking Heuristic

  • Default starting point: layered window with prompt caching + lightweight compaction.
  • Add hybrid retrieval when corpora exceed the working window or exact-match recall is required.
  • Add contextual retrieval preamble when chunks lose meaning in isolation (legal, financial, code).
  • Add MemGPT-style tiering only when multi-session continuity is a strict requirement.
  • Add subunit isolation when individual workflow stages would otherwise pollute the parent context.
  • Add StreamingLLM for real-time streaming workloads (voice agents, log monitoring).
  • Add LongRoPE only after NIAH-style validation confirms the workload genuinely benefits.
  • Always pair with checkpointing for any session running longer than a minute.

Cross-References