Skip to content

Fabric Data

The data-plane patterns referenced throughout Fabric › Data draw on the active agentic data-architecture research lineage. This page catalogs the canonical bindings, retrieval pipelines, schema-enforcement primitives, and egress controls, with selection criteria the platform uses when configuring data flow for a given Unitt fabric.

Model Context Protocol

The Model Context Protocol is an open client / server protocol that exposes three primitives; Tools (model-controlled actions), Resources (application-controlled URI-addressed reads), and Prompts (reusable templates); plus Sampling (server-initiated LLM calls back through the client). The 2025-11-25 spec is the largest revision since launch: async tasks, server-side agent loops, elicitation, Client ID Metadata Documents, enhanced sampling with tool-choice control, and an extensions system. The 2026 roadmap focuses on stateless streamable-HTTP transport, MCP Server Cards for discovery, and richer agent-to-agent coordination. Security remains the protocol's weakest seam; see the egress section below.

External Bindings And Enterprise Connectors

Modern fabrics layer four binding shapes: REST / OpenAPI (default), gRPC (low-latency internal), GraphQL (typed traversal), and webhooks (push). Vendor-managed MCP servers dominate enterprise integration: Snowflake Managed MCP, Databricks Managed MCP, ServiceNow AI Agent Fabric with MCP + A2A bridges to SAP, Salesforce, Snowflake, Databricks, BigQuery using Zero-Copy Connectors. Authentication has consolidated on OAuth 2.1 + PKCE with short-lived per-agent tokens fronted by per-tenant scoped vaults (HashiCorp Vault namespaces, AWS Secrets Manager hierarchical paths).

RAG Ingestion Pipelines

Anthropic Contextual Retrieval (September 2024) prepends an LLM-generated 50-100 token contextual preamble to each chunk before embedding and BM25 indexing, cutting retrieval error by 49% versus naive RAG and 67% combined with a reranker. Late chunking (Jina) is the efficient alternative; embed the whole doc with a long-context encoder, then pool token vectors into chunk vectors. Hybrid retrieval; dense + BM25 + Reciprocal Rank Fusion + cross-encoder rerank; is the durable production stack. Vector store positioning (2026): pgvector + pgvectorscale wins under roughly 50M vectors and Postgres-resident workloads; Qdrant leads filtered hybrid; Weaviate uses ACORN for selective filters; Pinecone is easiest-managed; Turbopuffer offers cheap object-store-backed indices.

Streaming Data

Kafka has become the agent "bloodstream"; agents subscribe to topics rather than poll. Apache Flink Agents (FLIP-531) and Confluent Streaming Agents make agents condition-triggered. The emerging stack is Kafka + Flink + MCP + A2A: Kafka for durable event log, Flink for stateful joins / enrichment + agent loops, MCP for tool / resource access, A2A for inter-agent messaging (Falconer). Materialize, RisingWave, Timeplus, and Fluss provide point-in-time SQL over the same streams.

Schema Enforcement

Anthropic shipped Structured Outputs (public beta Nov 14, 2025; GA on Opus 4.5 / 4.6 / 4.7, Sonnet 4.5 / 4.6, Haiku 4.5) using grammar-constrained sampling that compiles JSON Schema into a token-level grammar; a mathematical guarantee, not prompt prayer. OpenAI offers parallel response_format={"type":"json_schema","strict":true}. Outside vendor support: Outlines, XGrammar, and llguidance provide open constrained decoding; Instructor wraps validation + retry; Guardrails AI RAIL specs validate post-hoc.

Egress Controls / Output Guardrails

NeMo Guardrails uses Colang DSL for programmable input / output / dialog rails. Guardrails AI is validator-centric with RAIL / XML schemas and re-ask loops. Llama Guard 3 / 4 is a fine-tuned classifier categorizing by MLCommons taxonomy. Azure Prompt Shield targets indirect prompt-injection + jailbreak. April 2025 Attack Success Rate benchmarks (arXiv 2504.11168) show meaningful evasion gaps across all single-layer guardrails; production patterns use layered defense (regex / Presidio PII redaction → small classifier → LLM judge → DLP egress proxy → tamper-evident audit log). The OWASP LLM Top 10 v2025 formalized LLM08:2025 Vector & Embedding Weaknesses and made output handling a first-class concern.

Data Lineage And Provenance

OpenLineage (CNCF sandbox) is the open standard; Marquez is its reference server; DataHub ingests OpenLineage events natively. The community has extended OpenLineage facets to capture prompt → retrieval-set → model-version → tool-calls → output as a single run. Vendors layering on top: DataHub (AI asset model with model / dataset / feature linkage), Atlan, Monte Carlo, Snowflake Horizon. The OpenLineage AI extensions and aiCatalog facet (2025) plug MCP resource URIs as lineage inputs.

Synthetic + Real Hybrid Data

2024-2025 pre-training pipelines now mix synthetic and real data by design (Raschka 2025). Hybrid wins in vertical / domain tasks: synthetic generation (LLM-authored Q&A, rule-based perturbations, simulators) augments scarce real data; real samples anchor distribution and prevent collapse (arXiv 2503.14023). For agent evaluation, synthetic trajectories generated by stronger models plus PII-scrubbed real production traces is the standard hybrid pattern.

Multi-Tenant Data Isolation

Five isolation layers (strongest to weakest): dedicated compute / storage → per-tenant Kubernetes namespace + KMS keys → logical row-level security → query-time filters → shared with ACL. For agent fabrics: vector DB collections / namespaces per tenant are mandatory under OWASP LLM08:2025; per-tenant MCP server instances or tenant-scoped MCP gateway; Vault namespaces as Vault-within-Vault; tenant-context propagation attached server-side to every tool call (Blaxel, Prefactor).

Just-In-Time Retrieval

Anthropic's Effective Context Engineering (September 2025) pivots from "stuff the window" to "smallest set of high-signal tokens" via JIT retrieval: agents hold lightweight identifiers (file paths, query handles, URLs) and load data through tools only when needed. LangChain's Context Engineering for Agents recommends LangGraph long-term memory + langgraph-bigtool (semantic search over tool descriptions) for tool-heavy agents. Heuristic: upfront stuff for stable sub-20k-token material; JIT for everything large, mutable, or selectively relevant.

Selection Criteria

Data Source Binding Pattern Ingestion Mode Egress Controls Cost Signal
Transactional DB MCP over read-replica + SQL tool with strict schema JIT query Row-level filter, PII redactor, query allowlist $
Doc corpus RAG: contextual chunking + hybrid retrieval Batch nightly; incremental on change-feed Source-citation, license / DLP filter $$
Streaming events Kafka topic → Flink Agent / Materialize Continuous, event-triggered Schema registry, rate limit, audit topic $$$
External REST / SaaS API Managed MCP server (preferred) JIT tool call, cached OAuth 2.1 scoped tokens, response validator $ per call
File system / object store MCP Resources with URI refs JIT load by path Path allowlist, MIME / AV scan, redaction $
Data warehouse Managed MCP + federated SQL / Cortex Agents JIT semantic SQL Column masking, tenant filter, query budget $$
Knowledge graph GraphQL / Cypher tool JIT traversal with depth cap Relationship allowlist, output schema $$
Real-time market / pricing gRPC stream → in-memory cache Continuous push TTL, staleness check, structured output $$$
User memory / preferences Per-tenant vector namespace + KV JIT by user / session key Tenant scope check, encryption at rest $
Synthetic eval data Offline pipeline → eval store Batch generation, versioned Provenance tag, hold-out enforcement $

Cross-References