The Hidden Risk in AI Products: Fragmented Enterprise Knowledge

The Hidden Risk in AI Products: Fragmented Enterprise Knowledge

The Hidden Risk in AI Products: Fragmented Enterprise Knowledge

Enterprises everywhere are racing to adopt generative AI. From Microsoft 365 Copilot to ChatGPT Enterprise and beyond. These tools promise productivity, but quietly demand full access to your internal knowledge.

The takeaway: If leaders don’t influence how knowledge flows inside the enterprise, every AI tool will define it for them, and you’ll pay in risk, cost, and control.

The cost of doing nothing? Your data gets fragmented across tools you don’t control. Vendors gain leverage. Governance breaks. Engineering teams duplicate effort. And your AI becomes weaker, not smarter. Without a clear knowledge strategy, you’ll spend more, move slower, and lose control over the very context AI needs to work.

Status quo without strategy: each AI platform builds its own bridge to the same enterprise tools, resulting in redundant effort, fractured governance, and security headaches.

Status quo without strategy: each AI platform builds its own bridge to the same enterprise tools, resulting in redundant effort, fractured governance, and security headaches.The Hidden Pattern: Parallel Embedding Silos

Here’s what’s happening behind the scenes:

  • AI tools rely on indexing your internal content - SharePoint docs, emails, Confluence pages, CRM records - and converting them into embeddings, mathematical representations capturing the meaning of information.
  • Each vendor does this separately, resulting in multiple parallel embedding silos across your enterprise.

This seemingly harmless duplication creates real problems:

  • Redundant ingestion: Multiple tools re-process identical content, wasting resources.
  • Security exposure: Sensitive information ends up duplicated across numerous external databases.
  • Inconsistency: Variations in freshness, scope, and versions of your knowledge emerge.
  • Loss of control: You lose visibility into where your enterprise knowledge resides.

Suddenly, the truth about your business isn’t just distributed, it’s fragmented.

Embeddings and Vector Search: Meaning over Keywords

To truly appreciate why AI-driven search is such a big deal, let’s first demystify how it fundamentally differs from traditional search.

Traditional search is straightforward: you type keywords, and the search engine matches those keywords directly with content. It’s like looking up terms in a book’s index. You find exact matches, but nothing deeper. AI-powered search, on the other hand, matches meaning.

Embeddings let AI understand the meaning behind content, not just the exact words used. They organize information based on similarity in meaning, so two pieces of content that say the same thing in different ways still end up connected. This is how AI finds what’s relevant even when the keywords don’t match.

Data needs to be converted into embeddings by a specialized embedding model to make it useful for generative AI

Data needs to be converted into embeddings by a specialized embedding model to make it useful for generative AI

When AI searches content, it first needs this map. To build it, the AI “catalogues” or indexes content by turning everything into embeddings ahead of time. Every document, email, webpage, or conversation that you’d want AI to consider as part of it is converted into these numerical representations. Later, when you ask a question, the AI converts your query into an embedding too, placing your question onto the same meaning map.

Here’s where the magic (more accurately, math) comes into play: AI uses mathematical techniques like calculating vector distances to find the closest matches between your query embedding and the pre-cataloged content embeddings. By not relying directly on keywords, AI can recognize that two documents are related in meaning even if they don’t share any keywords, or are completely different modalities (audio, images, etc).

The query is converted into an embedding and used to find relevant documents on. The query is converted into an embedding and used to find relevant documents on.

However, this powerful technique comes with complexities: The AI must catalog all potentially useful content in advance. This indexing is a heavy, ongoing process, as content continuously updates.

Embeddings are model-specific. GPT, Gemini, Claude, and others each generate embeddings differently, and even model updates can shift that behavior. This specificity directly impacts how well AI understands your content. Critically, your query embedding must match the embedding model used for cataloging. If your content was catalogued using Gemini, GPT models wouldn’t correctly interpret its embeddings, causing retrieval failures or inaccuracies.

The Data Challenge: Why Indexing is Harder Than it Sounds

AI search requires upfront, consistent cataloging of your data into embeddings using a specific AI model. Every email, document, SharePoint file, or wiki page must be translated into vectors, even though these systems were never designed for AI consumption. Since embeddings are model-specific, queries must use the same embedding model used for indexing. This technical detail creates a strategic risk: every vendor is building its own meaning map, leading to fragmented silos and unreliable enterprise search.

Traditional enterprise data warehouses were optimized for human-driven reporting and analytics, summarizing and trimming data for easier consumption by people. However, this trimming removes the nuanced context and detailed meaning that AI relies on to be effective. In other words, data prepared for human eyes might be insufficient or even misleading for AI embedding.

Additionally, critical day-to-day data like emails, instant messages, and collaborative documents have rarely been considered strategic assets. They now form the backbone of many powerful AI-driven workflows. Yet, current AI models require extremely broad, nearly god-level access to your enterprise’s entire data landscape to catalog it effectively.

For example, for Microsoft 365 Copilot admins must grant application-level consent to access files and email data, not just individual user accounts. Google Gemini Enterprise similarly needs enterprise permissions.

Each AI vendor wants to independently perform this cataloguing and embedding, meaning that multiple external providers each hold a full copy of your internal knowledge.

Further complicating matters, these embeddings aren’t a one-time process. Whenever new content arrives or existing content changes - think daily emails, updated documents, or collaborative edits - the embedding catalogue must be continuously refreshed. Managing this ongoing ingestion across multiple siloed tools creates significant overhead, security risks, and governance challenges.

The Governance and Permissioning Minefield

Granting AI tools comprehensive, god-level access to your internal systems isn’t just technically demanding - it creates profound governance challenges. Every piece of enterprise data has unique access permissions, often highly nuanced and complex, based on user roles, team memberships, and organizational hierarchies.

For example, permissions in systems like Google Drive or SharePoint can vary down to individual files and folders. Replicating these permissions accurately across external AI embedding stores is both technically complex and prone to error. Currently, most AI systems either try to mirror these permissions asynchronously and imperfectly or perform permission checks at query time - again, often imperfectly.

Real-world incidents illustrate how significant this issue is. Microsoft 365 Copilot and Slack’s recent AI prompt injection vulnerability demonstrate that current AI systems often inadvertently expose sensitive information due to poorly replicated access controls or flawed permission logic. These incidents typically result from:

  • Embedding sensitive data that should have been excluded.
  • Out-of-sync or overly broad permissions logic.
  • Vulnerabilities in the AI prompt logic that trick the system into exposing confidential data.

There is no robust, standardized approach today for AI tools to enforce access controls consistently across federated systems like SharePoint, Drive, or Slack. Each vendor maintains its embedding store, attempting to mirror permissions and enforce ACLs, resulting in varying degrees of success and significant security risks.

Ultimately, this isn’t merely a series of isolated bugs - it’s a fundamental design flaw in how AI systems handle enterprise knowledge today. Enterprises must address these architectural gaps proactively to safely harness AI’s full potential.

A Practical Framework for Building Your Knowledge Strategy

To avoid the fragmentation and governance traps of today’s AI tools, enterprises must take ownership of their knowledge layer. Here’s a practical framework for those leading AI, product, or data strategy to guide their approach:

  • Inventory: Catalog all knowledge assets (emails, docs, tickets, contracts, chats, etc.), including where they live, who owns them, and how frequently they change.
  • Ownership: Define a clear ownership model. Will data domains be managed by distributed teams (a la data mesh)? Or will a centralized knowledge platform team maintain the core pipelines, versioning, and semantic models? Ownership clarity prevents entropy
  • Integration: Many internal systems (especially legacy ones) won’t fit cleanly into modern paradigms. Create adapters, wrappers, or APIs that make them searchable and safe to expose to AI tools. If you don’t build the bridge, vendors will build and own the interface.
  • Infrastructure: Decide how AI agents and copilots will access knowledge: centralized store, federated protocol (like MCP), or hybrid. Build consistent interfaces and ensure permission logic travels with the query, not just the data.
  • Govern: Inject Responsible AI gates early. Bake in observability, auditability, ACL enforcement, and red-teaming at the retrieval layer. Security, fairness, and privacy must be native, not bolted on.

Navigating this high-stakes, rapidly evolving landscape requires more than adopting the latest AI tool, but having architectural foresight. Enterprise vendors are racing to position themselves at the center of the AI ecosystem, and control over internal knowledge is becoming a key strategic lever. Platforms like Microsoft SharePoint, Google Workspace, and Salesforce are unlikely to fully embrace open integration in the near term. If a vendor can become the default gateway to an organization’s internal knowledge, they hold disproportionate power over retrieval, which copilots, agents, or workflows can thrive. Opening up too soon means losing that competitive edge.

And yet, no single platform can deliver a complete AI experience in isolation. Most organizations rely on a constellation of tools, each housing fragments of institutional knowledge. As a result, enterprises will need to think carefully about how they expose and retrieve this knowledge in a way that balances interoperability, security, and control. While no standard architecture has emerged, most paths forward will likely resemble one of a few foundational patterns, each with distinct tradeoffs and long-term implications.

Centralized Knowledge Store

Centralized embeddings, federated flexibility: this architecture puts the enterprise back in charge. Index once, use anywhere. Centralized embeddings, federated flexibility: this architecture puts the enterprise back in charge. Index once, use anywhere.

A centralized embedding store that aggregates all your enterprise data in a tool that is owned and managed by the enterprise, in addition to an embedding model that third-party tools can use to interact with the embedding data.

Advantages

  • Complete control and governance over indexing, versioning, permissions, and storage.
  • Reduced vendor lock-in and greater flexibility to experiment with new AI tools safely.
  • Minimized risk, as you avoid giving external vendors unrestricted access.

Disadvantages

  • Requires substantial initial investment and potentially high ongoing costs. Vendors may resist adopting this architecture until market pressures force their hand.
  • Managed services from hyperscalers (AWS, Google, etc.) using open protocols could eventually alleviate vendor lock-in.

Implementation

  • The enterprise stands up a central embedding pipeline: data from email, docs, chat, ERP, CRM, etc. is continuously ingested, chunked, embedded, and stored in a vector database.
  • This database is enriched with metadata and ACLs (user, department, clearance level), allowing permission-aware search.
  • The embedding model (e.g., OpenAI Ada, Cohere, or internal) is exposed via an API for query embedding, so tools like copilots or agents can embed the user’s prompt in the same space as the content.
  • External tools don’t index content themselves; they submit semantic queries to the centralized API (embedding + search call).
  • ACL rules from upstream systems are maintained up-to-date in real-time, allowing query-time ACL enforcement.

Fully Federated Approach (MCP or similar protocol)

Standardized routing over centralization: a federated approach where systems remain the system of record, and AI interfaces query through a unified abstraction layer.

Standardized routing over centralization: a federated approach where systems remain the system of record, and AI interfaces query through a unified abstraction layer.

Another path involves adopting a standardized an API-driven protocol like MCP to act as a federated integration layer, allowing systems to expose their data securely while enforcing native permission controls.

Advantages

  • Facilitates bridging legacy or other systems not yet adapted to AI, as protocols like MCP make it easy to wrap existing APIs within MCP.
  • Simplified integration when leveraging protocols that are becoming industry standard, like MCP.
  • It can leverage existing resource authorization logic already in place in existing APIs.

Disadvantages

Search capabilities may be limited, as queries typically remain in plain-text, with the source system controlling embedding and search algorithms. Advanced AI search techniques available to dedicated embedding stores might not be feasible in this federated model.

Implementation

  • Each system (Salesforce, Jira, Workday, etc) implements an MCP-compatible server allowing raw text queries or specific data requests.
  • This request will include a token that represents the user, which can be used to authorize the resources that the user has access to and filter with the built-in ACL logic.
  • The embedding, chunking, and retrieval logic stay inside the source system, rather than being centralized.
  • The calling AI agent never accesses raw documents - only what’s authorized via the exposed API.

Conclusion

While centralized and MCP-style architectures offer clear starting points, most enterprises will likely evolve toward some form of hybrid model. Some systems may expose their own embedding stores and models via APIs, enabling external tools to retrieve semantically indexed content directly - a “bring-your-own-retriever” approach. This can unlock domain-specific optimization, while maintaining some decoupling.

We’re still early in the enterprise AI journey. The landscape is moving fast: tools, protocols, models, and standards are all in flux. This is not the time to rigidly commit to a single architecture. Instead, design for flexibility and adaptability. We don’t yet know how the AI-native enterprise stack will settle, much like no one could have sensibly planned a five-year e-commerce strategy in 1997.

But just because we can’t predict everything doesn’t mean we should build blindly. Some architectural truths are already clear and worth designing for today:

  • Embedding infrastructure is foundational. Even if you federate access, you’ll almost certainly need internal embedding pipelines to support internal development and use cases. Consider this an inevitable cost center, one that pays dividends when approached strategically.
  • Communication data will require its own strategy. Email, chat, and messaging data introduce unique challenges: they’re highly sensitive, context-specific, and often user-specific in terms of what should be retrieved and by whom. - Traditional data governance patterns don’t apply cleanly here. You’ll need new approaches that respect personal boundaries, enforce strict access controls, and design retrieval to be aware of social context
  • Platform decoupling will remain critical. AI tools, embedding infrastructure, and storage systems should evolve independently. Avoid tightly coupling retrieval logic to any one vendor.
  • Governance and security require upfront design. Access control, observability, and compliance shouldn’t be patched in later. They must be native to your retrieval architecture from day one.

The most resilient organizations will be those that invest in adaptable architecture today, respecting what’s to change, but building intelligently around what won’t.

Tags :

Related Posts

Resume & Interview Tips: A Tech Hiring Manager’s Perspective

Resume & Interview Tips: A Tech Hiring Manager’s Perspective

I’m often asked to help with resume feedback and interview tips, and it has become even more frequent now that the tech market is tightening. I’ve implemented tech hiring processes across multiple teams and companies and usually am deeply involved in candidate screenings, so this post consolidates my thoughts and perspectives, as well as some “behind the scenes” into how the hiring process works.

Read More
Why We Should Stop Making AI Look Human

Why We Should Stop Making AI Look Human

Making AI visually indistinguishable from humans might seem compelling, but it’s a misguided goal that introduces friction, unmet expectations, and poor design.

Read More
Starry Sky Space-Themed Nursery

Starry Sky Space-Themed Nursery

On January 1st, New Year Day, my daughter Lua was born. Her name means “moon” in Portuguese, but even before we had picked the name and had decided to make the nursery space-themed.

Read More