Skip to content

Data

Fabric Data defines how an agent fabric binds to external information; what data flows into the system, how it is curated and validated, what data flows out, and how every piece of data is governed, observed, and auditable across the multi-agent runtime. Where Fabric › Setup commits the topology and identity of the fabric, the Data layer commits the data plane: connectors, retrieval pipelines, streaming sources, schema enforcement, and egress controls that determine what the fabric is permitted to read, what it produces, and how that material is bound to lineage and provenance.

Fabric Data is informed by the active agentic data-plane research lineage, including the Model Context Protocol 2025-11-25 specification and the 2026 MCP roadmap, Anthropic Contextual Retrieval, Anthropic Effective Context Engineering just-in-time retrieval guidance, Anthropic Structured Outputs grammar-constrained sampling, NeMo Guardrails and Guardrails AI, Llama Guard and Azure Prompt Shield for egress, OpenLineage and Marquez for lineage, the OWASP LLM Top 10 2025 (notably LLM08:2025 Vector & Embedding Weaknesses), and emerging event-driven agent stacks built on Apache Kafka + Flink Agents and Confluent Streaming Agents. Selection criteria for data-binding patterns are documented in Reference › Research › Fabric Data.

What The Data Layer Configures

The Data layer is the contract between the fabric and the outside world. It commits four configuration surfaces.

flowchart LR
    DL[Data Layer] --> IN[Inbound Bindings]
    DL --> CT[Context Engine]
    DL --> SC[Schema Enforcement]
    DL --> OUT[Outbound Bindings]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class DL,IN,CT,SC,OUT stage
  • Inbound Bindings; connectors, MCP servers, RAG corpora, streaming sources, file systems, knowledge graphs.
  • Context Engine; what the fabric loads into the working window, how it is retrieved, how it is compacted.
  • Schema Enforcement; typed tool inputs and outputs, structured-output grammars, validator chains.
  • Outbound Bindings; egress connectors, output guardrails, side-effect approvals, audit and lineage emission.

Canonical Data-In / Data-Out Pipeline

A request enters the fabric at the Agent Gateway, which resolves tenant identity, mints a runtime audit ID, applies rate limits, and emits the first lineage event. The Context Engine assembles the working window from session memory plus just-in-time handles. The Orchestrator emits structured tool calls through the MCP Gateway, which fans out across four lanes; RAG, streaming, enterprise, and tool / API; each tenant-scoped. Every retrieved datum is tagged with an OpenLineage facet keyed to the run ID. The model emits a schema-constrained response that passes through a layered egress chain before returning to the caller.

flowchart LR
    REQ[Request] --> AG[Agent Gateway]
    AG --> CE[Context Engine]
    CE --> ORCH[Orchestrator]
    ORCH --> MCP[MCP Gateway]

    MCP --> R1[RAG Lane]
    MCP --> R2[Stream Lane]
    MCP --> R3[Enterprise Lane]
    MCP --> R4[Tool / API Lane]

    R1 --> CE
    R2 --> CE
    R3 --> CE
    R4 --> CE

    ORCH --> VAL[Schema + Structured Output]
    VAL --> GR[Egress Guardrails]
    GR --> DLP[PII / DLP Redactor]
    DLP --> AUD[Audit + Lineage Emitter]
    AUD --> RESP[Response]

    AG -. lineage .-> LIN[(OpenLineage)]
    AUD -. lineage .-> LIN

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class REQ,AG,CE,ORCH,MCP,R1,R2,R3,R4,VAL,GR,DLP,AUD,RESP,LIN stage

Inbound Bindings

The Data layer supports four canonical inbound binding shapes, each with its own setup parameters and governance posture. Bindings are declared per fabric and per agent, with the Setup layer authorizing which agent may use which binding under which scope.

MCP-Native Bindings

The platform's preferred binding shape is the Model Context Protocol. MCP servers expose three primitives; Tools (model-controlled actions), Resources (application-controlled URI-addressed reads), and Prompts (reusable templates); plus Sampling for server-initiated LLM calls back through the client. The 2025-11-25 spec added async tasks, server-side agent loops, elicitation, Client ID Metadata Documents, enhanced sampling with tool-choice control, and an extensions system. The 2026 roadmap focuses on stateless streamable-HTTP transport, MCP Server Cards for discovery, and richer agent-to-agent coordination.

Enterprise vendors now ship managed MCP servers: Snowflake Managed MCP, Databricks Managed MCP, ServiceNow AI Agent Fabric with MCP and A2A bridges to SAP, Salesforce, Snowflake, Databricks, and BigQuery using Zero-Copy Connectors. The platform delegates to managed MCP servers wherever they exist; only when no managed server exists does it ship its own.

REST / gRPC / GraphQL / Webhooks

When MCP is not available, the Data layer falls back to four binding shapes: REST / OpenAPI for default integrations, gRPC for low-latency internal services, GraphQL for typed graph traversal, and webhooks for push-based event ingest. Each binding is wrapped in an MCP-shaped tool declaration so the fabric sees a uniform tool surface regardless of underlying transport.

RAG Corpora

For unstructured document corpora, the Data layer assembles a hybrid retrieval pipeline that combines Anthropic Contextual Retrieval (prepending an LLM-generated 50-100 token contextual preamble to each chunk before embedding and BM25 indexing, cutting retrieval error by 49% versus naive RAG and 67% when combined with a reranker) with hybrid BM25 + dense retrieval, Reciprocal Rank Fusion, and a cross-encoder rerank. Late chunking (Jina) is supported when per-chunk LLM cost is prohibitive.

flowchart LR
    DOC[Document] --> CK[Chunking]
    CK --> PRE[Contextual Preamble]
    PRE --> EMB[Embed]
    PRE --> BM25[BM25 Index]
    EMB --> VDB[(Vector Store)]
    BM25 --> KW[(BM25 Index)]

    Q[Query] --> EMB2[Embed Query]
    EMB2 --> VDB
    Q --> BM25
    VDB --> RRF[Reciprocal Rank Fusion]
    KW --> RRF
    RRF --> RR[Cross-Encoder Rerank]
    RR --> TOP[Top-K Context]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class DOC,CK,PRE,EMB,BM25,VDB,KW,Q,EMB2,RRF,RR,TOP stage

Vector-store selection is workload-dependent: pgvector + pgvectorscale for Postgres-resident workloads under roughly 50M vectors, Qdrant for filtered hybrid retrieval, Weaviate for selective-filter workloads with ACORN, Pinecone for fully-managed convenience, Turbopuffer for object-store-backed indices when cold latency is acceptable.

Streaming Sources

For event-driven workloads, the Data layer binds Kafka, Kinesis, or Pulsar topics directly to agents. Modern stacks combine Apache Kafka with Flink Agents (FLIP-531) and Confluent Streaming Agents so agents are condition-triggered; they fire on stream events rather than on prompts. Materialize, RisingWave, Timeplus, and Fluss provide point-in-time SQL over the same streams for in-flight retrieval.

Context Engine

The Context Engine sits between the fabric and the model and decides what enters the working window each turn. It enforces the just-in-time retrieval discipline recommended in Anthropic Effective Context Engineering and described in Emergence › State: the agent holds lightweight handles (file paths, query handles, URLs) and loads data through tools only when needed. The heuristic is upfront stuffing only for stable, high-relevance, sub-20k-token material (style guides, schemas, identity, current task state); everything else is JIT.

Schema Enforcement

Every tool input and every tool output is schema-validated. The platform defaults to grammar-constrained sampling using Anthropic Structured Outputs (output_config.format for response shape and tools[].strict=true for guaranteed tool-input compliance), or OpenAI structured outputs (response_format={"type":"json_schema","strict":true}), which compile JSON Schema into a token-level grammar; a mathematical guarantee rather than prompt prayer. For workloads outside vendor coverage, the Data layer supports open constrained-decoding via Outlines, XGrammar, or llguidance, plus retry-with-validation wrappers like Instructor or Guardrails AI RAIL specs.

flowchart LR
    TC[Tool Call] --> IS[Input Schema Validator]
    IS -->|pass| EX[Execute Tool]
    IS -->|fail| RE[Retry With Hint]
    EX --> OS[Output Schema Validator]
    OS -->|pass| R[Result to Agent]
    OS -->|fail| RE

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class TC,IS,EX,OS,RE,R stage

Outbound Bindings

Outbound bindings define what data the fabric is permitted to send out; to connectors, downstream agents, users, audit sinks, lineage stores. The Data layer enforces a layered egress chain that combines structured-output validation, guardrails, PII / DLP redaction, side-effect approval, and audit emission.

Layered Egress

April 2025 benchmark research (arXiv:2504.11168) shows substantial Attack Success Rate gaps across single-layer guardrails; production-grade fabrics layer multiple defenses rather than relying on any one. The recommended stack is regex / Presidio PII redaction → small classifier → LLM judge → DLP egress proxy → tamper-evident audit log. Vendor-specific layers include NeMo Guardrails (Colang DSL programmable rails), Guardrails AI (validator-centric with re-ask), Llama Guard 3 / 4 (fine-tuned classifier under the MLCommons taxonomy), and Azure Prompt Shield (indirect prompt-injection and jailbreak).

flowchart LR
    MO[Model Output] --> SC[Structured Output Validator]
    SC --> CL[Classifier / Llama Guard]
    CL --> JG[LLM Judge]
    JG --> RD[PII / DLP Redactor]
    RD --> APP{Side Effect?}
    APP -->|yes| HR[Human Approval]
    APP -->|no| AU[Audit + Lineage Emit]
    HR --> AU
    AU --> OUT[Egress]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class MO,SC,CL,JG,RD,APP,HR,AU,OUT stage

Side-Effect Approval

State-mutating actions; writes, sends, payments, ticket creations; pass through an explicit approval gate as part of egress. The gate is configured per connector class in Setup and routed to human-in-the-loop, supervisor sub-unit, or rule-based auto-approval based on the workload's risk profile.

Multi-Tenant Isolation

Agent fabrics frequently serve multiple tenants. The Data layer enforces five isolation layers in priority order: dedicated compute and storage for the most sensitive tenants; per-tenant Kubernetes namespace plus KMS keys for the standard tier; logical row-level security on shared stores; query-time filters; shared with ACL only for low-sensitivity public data. Vector-DB collections and namespaces per tenant are mandatory under OWASP LLM08:2025 Vector & Embedding Weaknesses. Tenant context is attached server-side to every tool call and retrieval query; tenant IDs from the model are never trusted.

flowchart TD
    REQ[Tenant Request] --> AG[Agent Gateway]
    AG --> TID[Resolve Tenant Server-Side]
    TID --> VAULT[Tenant-Scoped Vault]
    TID --> NS[Tenant Vector Namespace]
    TID --> RLS[Row-Level Security]
    TID --> KMS[Tenant KMS Key]

    classDef stage fill:#ffd541,stroke:#222021,color:#222021
    class REQ,AG,TID,VAULT,NS,RLS,KMS stage

Data Lineage And Provenance

Every data event the fabric reads or writes is emitted as an OpenLineage event keyed to the runtime audit ID. Lineage facets capture prompt, retrieval set, model version, tool calls, and output as a single run, so operators can answer "which document influenced this decision?" and "which downstream system received this output?" without re-running. Marquez or DataHub consumes the events; OpenLineage AI extensions and the 2025 aiCatalog facet are the integration surface for MCP resource URIs as lineage inputs.

Selection Criteria

The platform selects a binding pattern per data source by reading the data shape (transactional, doc corpus, stream, API, file, graph) and matching against the table below. Detailed citations and tradeoffs are in Reference › Research › Fabric Data.

Data Source Binding Pattern Ingestion Mode Egress Controls
Transactional DB (Postgres / MySQL) MCP over read-replica + SQL tool JIT query, no preload Row-level filter, PII redactor
Doc corpus (PDF / MD / HTML) RAG: contextual chunking + hybrid retrieval Batch + change-feed incremental Source-citation enforcement, license / DLP filter
Streaming events Kafka topic → Flink Agent / Materialize Continuous, event-triggered Schema registry, rate limit, audit topic
External REST / SaaS API Managed MCP server (preferred) JIT tool call, cached OAuth 2.1 scoped tokens, response validator
File system / object store MCP Resources with URI refs JIT load by path Path allowlist, MIME / AV scan, redaction
Data warehouse (Snowflake / Databricks / BQ) Managed MCP + federated SQL JIT semantic SQL Column masking, tenant filter, query budget
Knowledge graph GraphQL / Cypher tool JIT traversal with depth cap Relationship allowlist, output schema
Real-time market / pricing gRPC stream → in-memory cache Continuous push TTL, staleness check, structured output
User memory / preferences Per-tenant vector namespace + KV JIT by user / session key Tenant scope check, encryption at rest
Synthetic eval data Offline pipeline → eval store Batch generation, versioned Provenance tag, hold-out enforcement

Governance Of Data

Data is the surface where the fabric most directly contacts regulated information. Every binding records its scope, authorization, and lineage. Every retrieval records its query, result set, and ranking signals. Every egress records its content hash, classifier verdict, redactions applied, and approval chain.

Data Governance Requirements

  • Every inbound binding is scoped per agent and tenant; default-deny applies to anything not explicitly authorized.
  • Tenant identity is resolved server-side; model-provided tenant IDs are never trusted.
  • Every retrieval is logged with query, result set, and rerank scores.
  • Schema validation is mandatory on tool inputs and outputs.
  • Egress passes a layered guardrail chain; single-layer defenses are not sufficient.
  • State-mutating side effects pass an explicit approval gate.
  • Every data event emits an OpenLineage facet keyed to the runtime audit ID.

Cross-References