Context & State¶

The context management patterns referenced throughout Emergence › State draw on the 2024-2026 context-engineering and durable-state research lineage. This page catalogs the canonical patterns, their tradeoffs, and the selection criteria the platform uses when configuring the State strategy for a given Unitt.

Reference Patterns¶

Context Window Architecture¶

Modern agentic SDKs assemble a request from ordered segments. Anthropic's Messages API processes them as tools → system → message history → user turn with assistant scratchpad and thinking blocks interleaved. OpenAI's Responses API is the equivalent agentic primitive: an instructions field plus a tools array plus a typed event stream including function_call, function_call_output, reasoning, and message. The OpenAI Agents SDK wraps this with Agent(name, instructions, tools) plus handoffs and guardrails. Layout rule of thumb: static prefix → warm memory → hot window → current user turn, with cache breakpoints aligned to layer boundaries.

Prompt Caching¶

Anthropic prompt caching caches a token prefix at explicit cache_control breakpoints (up to four). The cache key is the exact byte prefix, so ordering is load-bearing; tools first, then system, then stable history, then volatile tail. Minimum cacheable block is 1024 tokens (2048 for Haiku); writes cost roughly 25% more, reads roughly 10% of base; longer-TTL blocks must precede shorter-TTL blocks. Up to 85% latency reduction on long prompts.

Hierarchical / Tiered Context (MemGPT)¶

MemGPT frames the LLM as an OS kernel managing a two-tier memory: main context (system + working + FIFO message queue) and external context (archival + recall stores). The model calls functions to page data between tiers. Follow-ons include Letta, Mem0, and A-MEM.

Sliding Window + Attention Sinks (StreamingLLM)¶

StreamingLLM (Xiao et al., ICLR 2024) showed that naive sliding-window KV eviction destroys quality because the softmax depends on early-token "attention sinks." Keeping the first 4 tokens' KV plus a rolling recent window restores stability up to 4M tokens with up to 22× speedup over recompute. This is a serving-layer technique, not a prompt-engineering one. Source code: mit-han-lab/streaming-llm.

Summarization Compaction¶

Rolling and hierarchical summarization replaces older turns with a <summary> block. Claude Code runs autocompact at roughly 95% capacity and recommends /compact proactively at roughly 60% to keep the working set sharp. The Anthropic API exposes server-side compaction that monitors per-turn tokens, injects a summary prompt, and replaces stale tool I/O while preserving completed work, current state, files modified, in-progress work, next steps, and constraints.

Hybrid Retrieval¶

2025 production stacks converge on a two-stage funnel: parallel BM25 + dense retrieval fused with Reciprocal Rank Fusion, followed by a cross-encoder or ColBERT late-interaction rerank of the top N. BM25 catches exact identifiers (error codes, SKUs, function names) that dense embeddings miss; dense catches paraphrase. Late chunking (Jina) embeds long documents first and pools per-chunk vectors after, preserving cross-chunk context in the embedding.

Anthropic Contextual Retrieval¶

Per the Anthropic Contextual Retrieval post (September 2024), prepending a 50-100 token chunk-specific context preamble generated by Claude before embedding each chunk improves top-20 retrieval recall by 35% (49% combined with contextual BM25, 67% with a reranker). The preamble generation is itself prompt-cacheable against the whole document.

Long Context (LongRoPE)¶

LongRoPE (Microsoft, ICML 2024) uses evolutionary search over per-dimension non-uniform rescaling to extend context beyond pretraining length, reaching 2M tokens with roughly 1k fine-tuning steps. LongRoPE2 (2025) is near-lossless. Needle-in-haystack (Kamradt) and successors (BABILong, Sequential-NIAH, NeedleBench v2) show that nominal context length overstates usable context; even frontier models drop below 65% on multi-needle sequential tasks at 128k+. Source code: microsoft/LongRoPE.

Context Isolation Via Subagents¶

Anthropic's multi-agent researcher and Claude Code subagents spawn workers with private context windows. Each subagent receives only its task brief and tools and returns a condensed summary; the orchestrator never sees the worker's verbose tool I/O. Cognition's "Don't Build Multi-Agents" (June 2025) counters that subagents can produce inconsistent decisions when the orchestrator cannot see their reasoning; the pragmatic middle ground used in Claude Code is to use subagents for read-only, bounded tasks (search, test runs, log scans) and single-thread for decision-making.

State Checkpointing¶

LangGraph persistence provides a Checkpointer (SqliteSaver, PostgresSaver, DynamoDBSaver, Bedrock Session Service) that writes a state snapshot per super-step keyed by thread_id, supporting resume, replay, time-travel, and human-in-the-loop interrupts. Per-task writes provide pending-write recovery so successful nodes need not re-execute on resume. For crash-safe long-running workflows, pair checkpoints with an external workflow engine (Temporal, Dapr Workflows) for at-least-once tool invocation semantics.

Episode Boundaries¶

Three signals drive a rollover: token pressure (compact at the configured threshold), goal completion (commit summary to long-term memory, start a fresh thread), and topic shift (embedding distance between recent and prior turns exceeds the configured threshold). Anthropic's Claude memory feature (September 2025 → Pro / Max October 2025) auto-synthesizes a memory summary roughly every 24 hours; an explicit session boundary.

Context Engineering As A Discipline¶

Framed by Karpathy and formalized by Anthropic; Effective context engineering and LangChain; Context Engineering for Agents (2025). LangChain's taxonomy: write, select, compress, isolate. Anthropic's principle: "find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome," favoring just-in-time retrieval over upfront stuffing.

Selection Criteria¶

The platform selects context-management strategies per Unitt by reading the workload profile (context length, multi-session continuity, cost budget, latency target, fidelity sensitivity, complexity tolerance) and matching against the table below.

Pattern	Context Length	Multi-Session	Cost	Latency	Fidelity Loss	Complexity
Layered window + caching	Small-Med	No	Very Low (on hit)	Very Low	None	Low
Tiered memory (MemGPT)	Unbounded	Yes	Med (tool calls)	Med	Low-Med	High
StreamingLLM	Unbounded stream	No	Low	Very Low	High (no recall)	Med (serving)
Compaction	Large	Partial	Low	Low	Med (lossy)	Low
Hybrid RAG	Unbounded	Yes	Med (index + rerank)	Med	Low	Med
Contextual Retrieval	Unbounded	Yes	Med (index) / Low (query)	Med	Very Low	Med
Long-context (LongRoPE)	Very Large	No	High (tokens)	High	Med (lost-in-middle)	Low (model-side)
Subagent isolation	Large	No	Med (parallel)	Med	Low (with good summaries)	Med
Checkpointing	n/a	Yes	Low (DB)	Low	None	Med
Episode rollover	n/a	Yes	Low	Low	Med	Med

Picking Heuristic¶

Default starting point: layered window with prompt caching + lightweight compaction.
Add hybrid retrieval when corpora exceed the working window or exact-match recall is required.
Add contextual retrieval preamble when chunks lose meaning in isolation (legal, financial, code).
Add MemGPT-style tiering only when multi-session continuity is a strict requirement.
Add subunit isolation when individual workflow stages would otherwise pollute the parent context.
Add StreamingLLM for real-time streaming workloads (voice agents, log monitoring).
Add LongRoPE only after NIAH-style validation confirms the workload genuinely benefits.
Always pair with checkpointing for any session running longer than a minute.

Cross-References¶

Emergence › State; the developer-facing platform layer that consumes these strategies.
Emergence › Memory; the durable substrate behind retrieval and tiering.
Emergence › Subunits; context isolation across sub-agents.
Emergence › WorldSim; empirical validation of context strategies under replay.