Fabric Data¶
The data-plane patterns referenced throughout Fabric › Data draw on the active agentic data-architecture research lineage. This page catalogs the canonical bindings, retrieval pipelines, schema-enforcement primitives, and egress controls, with selection criteria the platform uses when configuring data flow for a given Unitt fabric.
Model Context Protocol¶
The Model Context Protocol is an open client / server protocol that exposes three primitives; Tools (model-controlled actions), Resources (application-controlled URI-addressed reads), and Prompts (reusable templates); plus Sampling (server-initiated LLM calls back through the client). The 2025-11-25 spec is the largest revision since launch: async tasks, server-side agent loops, elicitation, Client ID Metadata Documents, enhanced sampling with tool-choice control, and an extensions system. The 2026 roadmap focuses on stateless streamable-HTTP transport, MCP Server Cards for discovery, and richer agent-to-agent coordination. Security remains the protocol's weakest seam; see the egress section below.
External Bindings And Enterprise Connectors¶
Modern fabrics layer four binding shapes: REST / OpenAPI (default), gRPC (low-latency internal), GraphQL (typed traversal), and webhooks (push). Vendor-managed MCP servers dominate enterprise integration: Snowflake Managed MCP, Databricks Managed MCP, ServiceNow AI Agent Fabric with MCP + A2A bridges to SAP, Salesforce, Snowflake, Databricks, BigQuery using Zero-Copy Connectors. Authentication has consolidated on OAuth 2.1 + PKCE with short-lived per-agent tokens fronted by per-tenant scoped vaults (HashiCorp Vault namespaces, AWS Secrets Manager hierarchical paths).
RAG Ingestion Pipelines¶
Anthropic Contextual Retrieval (September 2024) prepends an LLM-generated 50-100 token contextual preamble to each chunk before embedding and BM25 indexing, cutting retrieval error by 49% versus naive RAG and 67% combined with a reranker. Late chunking (Jina) is the efficient alternative; embed the whole doc with a long-context encoder, then pool token vectors into chunk vectors. Hybrid retrieval; dense + BM25 + Reciprocal Rank Fusion + cross-encoder rerank; is the durable production stack. Vector store positioning (2026): pgvector + pgvectorscale wins under roughly 50M vectors and Postgres-resident workloads; Qdrant leads filtered hybrid; Weaviate uses ACORN for selective filters; Pinecone is easiest-managed; Turbopuffer offers cheap object-store-backed indices.
Streaming Data¶
Kafka has become the agent "bloodstream"; agents subscribe to topics rather than poll. Apache Flink Agents (FLIP-531) and Confluent Streaming Agents make agents condition-triggered. The emerging stack is Kafka + Flink + MCP + A2A: Kafka for durable event log, Flink for stateful joins / enrichment + agent loops, MCP for tool / resource access, A2A for inter-agent messaging (Falconer). Materialize, RisingWave, Timeplus, and Fluss provide point-in-time SQL over the same streams.
Schema Enforcement¶
Anthropic shipped Structured Outputs (public beta Nov 14, 2025; GA on Opus 4.5 / 4.6 / 4.7, Sonnet 4.5 / 4.6, Haiku 4.5) using grammar-constrained sampling that compiles JSON Schema into a token-level grammar; a mathematical guarantee, not prompt prayer. OpenAI offers parallel response_format={"type":"json_schema","strict":true}. Outside vendor support: Outlines, XGrammar, and llguidance provide open constrained decoding; Instructor wraps validation + retry; Guardrails AI RAIL specs validate post-hoc.
Egress Controls / Output Guardrails¶
NeMo Guardrails uses Colang DSL for programmable input / output / dialog rails. Guardrails AI is validator-centric with RAIL / XML schemas and re-ask loops. Llama Guard 3 / 4 is a fine-tuned classifier categorizing by MLCommons taxonomy. Azure Prompt Shield targets indirect prompt-injection + jailbreak. April 2025 Attack Success Rate benchmarks (arXiv 2504.11168) show meaningful evasion gaps across all single-layer guardrails; production patterns use layered defense (regex / Presidio PII redaction → small classifier → LLM judge → DLP egress proxy → tamper-evident audit log). The OWASP LLM Top 10 v2025 formalized LLM08:2025 Vector & Embedding Weaknesses and made output handling a first-class concern.
Data Lineage And Provenance¶
OpenLineage (CNCF sandbox) is the open standard; Marquez is its reference server; DataHub ingests OpenLineage events natively. The community has extended OpenLineage facets to capture prompt → retrieval-set → model-version → tool-calls → output as a single run. Vendors layering on top: DataHub (AI asset model with model / dataset / feature linkage), Atlan, Monte Carlo, Snowflake Horizon. The OpenLineage AI extensions and aiCatalog facet (2025) plug MCP resource URIs as lineage inputs.
Synthetic + Real Hybrid Data¶
2024-2025 pre-training pipelines now mix synthetic and real data by design (Raschka 2025). Hybrid wins in vertical / domain tasks: synthetic generation (LLM-authored Q&A, rule-based perturbations, simulators) augments scarce real data; real samples anchor distribution and prevent collapse (arXiv 2503.14023). For agent evaluation, synthetic trajectories generated by stronger models plus PII-scrubbed real production traces is the standard hybrid pattern.
Multi-Tenant Data Isolation¶
Five isolation layers (strongest to weakest): dedicated compute / storage → per-tenant Kubernetes namespace + KMS keys → logical row-level security → query-time filters → shared with ACL. For agent fabrics: vector DB collections / namespaces per tenant are mandatory under OWASP LLM08:2025; per-tenant MCP server instances or tenant-scoped MCP gateway; Vault namespaces as Vault-within-Vault; tenant-context propagation attached server-side to every tool call (Blaxel, Prefactor).
Just-In-Time Retrieval¶
Anthropic's Effective Context Engineering (September 2025) pivots from "stuff the window" to "smallest set of high-signal tokens" via JIT retrieval: agents hold lightweight identifiers (file paths, query handles, URLs) and load data through tools only when needed. LangChain's Context Engineering for Agents recommends LangGraph long-term memory + langgraph-bigtool (semantic search over tool descriptions) for tool-heavy agents. Heuristic: upfront stuff for stable sub-20k-token material; JIT for everything large, mutable, or selectively relevant.
Selection Criteria¶
| Data Source | Binding Pattern | Ingestion Mode | Egress Controls | Cost Signal |
|---|---|---|---|---|
| Transactional DB | MCP over read-replica + SQL tool with strict schema | JIT query | Row-level filter, PII redactor, query allowlist | $ |
| Doc corpus | RAG: contextual chunking + hybrid retrieval | Batch nightly; incremental on change-feed | Source-citation, license / DLP filter | $$ |
| Streaming events | Kafka topic → Flink Agent / Materialize | Continuous, event-triggered | Schema registry, rate limit, audit topic | $$$ |
| External REST / SaaS API | Managed MCP server (preferred) | JIT tool call, cached | OAuth 2.1 scoped tokens, response validator | $ per call |
| File system / object store | MCP Resources with URI refs | JIT load by path | Path allowlist, MIME / AV scan, redaction | $ |
| Data warehouse | Managed MCP + federated SQL / Cortex Agents | JIT semantic SQL | Column masking, tenant filter, query budget | $$ |
| Knowledge graph | GraphQL / Cypher tool | JIT traversal with depth cap | Relationship allowlist, output schema | $$ |
| Real-time market / pricing | gRPC stream → in-memory cache | Continuous push | TTL, staleness check, structured output | $$$ |
| User memory / preferences | Per-tenant vector namespace + KV | JIT by user / session key | Tenant scope check, encryption at rest | $ |
| Synthetic eval data | Offline pipeline → eval store | Batch generation, versioned | Provenance tag, hold-out enforcement | $ |
Cross-References¶
- Fabric › Data; developer-facing platform layer that consumes these patterns.
- Reference › Research › Context & State; context curation patterns the Data layer feeds.
- Reference › Research › Memory Systems; durable memory substrates the Data layer reads and writes.