Fabric Data¶

The data-plane patterns referenced throughout Fabric › Data draw on the active agentic data-architecture research lineage. This page catalogs the canonical bindings, retrieval pipelines, schema-enforcement primitives, and egress controls, with selection criteria the platform uses when configuring data flow for a given Unitt fabric.

Model Context Protocol¶

The Model Context Protocol is an open client / server protocol that exposes three primitives; Tools (model-controlled actions), Resources (application-controlled URI-addressed reads), and Prompts (reusable templates); plus Sampling (server-initiated LLM calls back through the client). The 2025-11-25 spec is the largest revision since launch: async tasks, server-side agent loops, elicitation, Client ID Metadata Documents, enhanced sampling with tool-choice control, and an extensions system. The 2026 roadmap focuses on stateless streamable-HTTP transport, MCP Server Cards for discovery, and richer agent-to-agent coordination. Security remains the protocol's weakest seam; see the egress section below.

External Bindings And Enterprise Connectors¶

Modern fabrics layer four binding shapes: REST / OpenAPI (default), gRPC (low-latency internal), GraphQL (typed traversal), and webhooks (push). Vendor-managed MCP servers dominate enterprise integration: Snowflake Managed MCP, Databricks Managed MCP, ServiceNow AI Agent Fabric with MCP + A2A bridges to SAP, Salesforce, Snowflake, Databricks, BigQuery using Zero-Copy Connectors. Authentication has consolidated on OAuth 2.1 + PKCE with short-lived per-agent tokens fronted by per-tenant scoped vaults (HashiCorp Vault namespaces, AWS Secrets Manager hierarchical paths).

RAG Ingestion Pipelines¶

Anthropic Contextual Retrieval (September 2024) prepends an LLM-generated 50-100 token contextual preamble to each chunk before embedding and BM25 indexing, cutting retrieval error by 49% versus naive RAG and 67% combined with a reranker. Late chunking (Jina) is the efficient alternative; embed the whole doc with a long-context encoder, then pool token vectors into chunk vectors. Hybrid retrieval; dense + BM25 + Reciprocal Rank Fusion + cross-encoder rerank; is the durable production stack. Vector store positioning (2026): pgvector + pgvectorscale wins under roughly 50M vectors and Postgres-resident workloads; Qdrant leads filtered hybrid; Weaviate uses ACORN for selective filters; Pinecone is easiest-managed; Turbopuffer offers cheap object-store-backed indices.

Streaming Data¶

Kafka has become the agent "bloodstream"; agents subscribe to topics rather than poll. Apache Flink Agents (FLIP-531) and Confluent Streaming Agents make agents condition-triggered. The emerging stack is Kafka + Flink + MCP + A2A: Kafka for durable event log, Flink for stateful joins / enrichment + agent loops, MCP for tool / resource access, A2A for inter-agent messaging (Falconer). Materialize, RisingWave, Timeplus, and Fluss provide point-in-time SQL over the same streams.

Schema Enforcement¶

Anthropic shipped Structured Outputs (public beta Nov 14, 2025; GA on Opus 4.5 / 4.6 / 4.7, Sonnet 4.5 / 4.6, Haiku 4.5) using grammar-constrained sampling that compiles JSON Schema into a token-level grammar; a mathematical guarantee, not prompt prayer. OpenAI offers parallel response_format={"type":"json_schema","strict":true}. Outside vendor support: Outlines, XGrammar, and llguidance provide open constrained decoding; Instructor wraps validation + retry; Guardrails AI RAIL specs validate post-hoc.

Egress Controls / Output Guardrails¶

NeMo Guardrails uses Colang DSL for programmable input / output / dialog rails. Guardrails AI is validator-centric with RAIL / XML schemas and re-ask loops. Llama Guard 3 / 4 is a fine-tuned classifier categorizing by MLCommons taxonomy. Azure Prompt Shield targets indirect prompt-injection + jailbreak. April 2025 Attack Success Rate benchmarks (arXiv 2504.11168) show meaningful evasion gaps across all single-layer guardrails; production patterns use layered defense (regex / Presidio PII redaction → small classifier → LLM judge → DLP egress proxy → tamper-evident audit log). The OWASP LLM Top 10 v2025 formalized LLM08:2025 Vector & Embedding Weaknesses and made output handling a first-class concern.

Data Lineage And Provenance¶

OpenLineage (CNCF sandbox) is the open standard; Marquez is its reference server; DataHub ingests OpenLineage events natively. The community has extended OpenLineage facets to capture prompt → retrieval-set → model-version → tool-calls → output as a single run. Vendors layering on top: DataHub (AI asset model with model / dataset / feature linkage), Atlan, Monte Carlo, Snowflake Horizon. The OpenLineage AI extensions and aiCatalog facet (2025) plug MCP resource URIs as lineage inputs.

Synthetic + Real Hybrid Data¶

2024-2025 pre-training pipelines now mix synthetic and real data by design (Raschka 2025). Hybrid wins in vertical / domain tasks: synthetic generation (LLM-authored Q&A, rule-based perturbations, simulators) augments scarce real data; real samples anchor distribution and prevent collapse (arXiv 2503.14023). For agent evaluation, synthetic trajectories generated by stronger models plus PII-scrubbed real production traces is the standard hybrid pattern.

Multi-Tenant Data Isolation¶

Five isolation layers (strongest to weakest): dedicated compute / storage → per-tenant Kubernetes namespace + KMS keys → logical row-level security → query-time filters → shared with ACL. For agent fabrics: vector DB collections / namespaces per tenant are mandatory under OWASP LLM08:2025; per-tenant MCP server instances or tenant-scoped MCP gateway; Vault namespaces as Vault-within-Vault; tenant-context propagation attached server-side to every tool call (Blaxel, Prefactor).

Just-In-Time Retrieval¶

Anthropic's Effective Context Engineering (September 2025) pivots from "stuff the window" to "smallest set of high-signal tokens" via JIT retrieval: agents hold lightweight identifiers (file paths, query handles, URLs) and load data through tools only when needed. LangChain's Context Engineering for Agents recommends LangGraph long-term memory + langgraph-bigtool (semantic search over tool descriptions) for tool-heavy agents. Heuristic: upfront stuff for stable sub-20k-token material; JIT for everything large, mutable, or selectively relevant.

Selection Criteria¶

Data Source	Binding Pattern	Ingestion Mode	Egress Controls	Cost Signal
Transactional DB	MCP over read-replica + SQL tool with strict schema	JIT query	Row-level filter, PII redactor, query allowlist	$
Doc corpus	RAG: contextual chunking + hybrid retrieval	Batch nightly; incremental on change-feed	Source-citation, license / DLP filter	$$
Streaming events	Kafka topic → Flink Agent / Materialize	Continuous, event-triggered	Schema registry, rate limit, audit topic	$$$
External REST / SaaS API	Managed MCP server (preferred)	JIT tool call, cached	OAuth 2.1 scoped tokens, response validator	$ per call
File system / object store	MCP Resources with URI refs	JIT load by path	Path allowlist, MIME / AV scan, redaction	$
Data warehouse	Managed MCP + federated SQL / Cortex Agents	JIT semantic SQL	Column masking, tenant filter, query budget	$$
Knowledge graph	GraphQL / Cypher tool	JIT traversal with depth cap	Relationship allowlist, output schema	$$
Real-time market / pricing	gRPC stream → in-memory cache	Continuous push	TTL, staleness check, structured output	$$$
User memory / preferences	Per-tenant vector namespace + KV	JIT by user / session key	Tenant scope check, encryption at rest	$
Synthetic eval data	Offline pipeline → eval store	Batch generation, versioned	Provenance tag, hold-out enforcement	$

Cross-References¶

Fabric › Data; developer-facing platform layer that consumes these patterns.
Reference › Research › Context & State; context curation patterns the Data layer feeds.
Reference › Research › Memory Systems; durable memory substrates the Data layer reads and writes.