The Diversity of Data Sources for RAG in the Enterprise

Dave Cliffe

Head of RAG (Rendering AI Guidance) at Atolio

Introduction

Today we continue with our series on the challenges of building an internal RAG system for leveraging your enterprise data.

As we noted before, a good search and answer system must draw on the right sources of data if there’s any hope of providing meaningful results. One luxury of modern business systems is that many provide an API for accessing all of this data. However, dealing with all these sources comes with tradeoffs.

Here we’ll focus on the challenges of acquiring data from these systems. Many questions will arise as your team works through these challenges. At each step your engineers and data scientists will find themselves further away from the central aim of leveraging your data for business value.

Source Diversity Means API Diversity

The existence of an API for Microsoft Sharepoint, Google Docs, Github, Slack and others has unlocked new possibilities of leveraging company data. It’s been a real boost for the AI and enterprise search domains. However, it takes only a bit of research to begin seeing the diversity of these APIs.

The first thing you’ll notice is the variety of API approaches. Often you’ll find a traditional REST API. More recently we’ve seen GraphQL and even gRPC APIs for accessing these systems. We’re not going to mention SOAP right? Each of these fundamental approaches requires a bit of software engineering expertise, and each presents a variety of gotchas when working at scale.

Not only are the fundamentals diverse, this also leads to a diversity of underlying software tools. A software engineer must find, investigate, and apply a library to access each API. Sometimes a vendor supplies a solid library or SDK. If you’re lucky they have one for your team’s preferred programming languages. Most times, there’s a pile of open source libraries and you just have to try the ones with the most momentum.

By the way, you know Github has both a REST API and GraphQL API right? Also, don’t forget how Sharepoint is split between the legacy API and the newer Microsoft Graph API. Not only is the landscape diverse, it’s also ever evolving. It takes real time, expertise, and effort to keep up.

Mapping API Objects to your Schema

Once you have the software infrastructure tamed, it’s time to dig into the details of the data. The surface area of most APIs can be a bit overwhelming. There will be a variety of objects and endpoints.

The first step is usually taking an inventory. You’ll be looking for key text fields like title, description, body, notes, and so on. There’s often a plethora of metadata as well. Some of it’s valuable and some not so much.

With an inventory in hand, you can start mapping the API data to the schemas of your search engine. This is a tedious process of mapping, relating, and filtering. In our experience, this is an iterative process as you work through different sources. You inventory an API, decide on core search schema fields, inventory another API, make schema adjustments, and so on. This mapping of API data to engine schema is crucial for downstream success in your system!

Old Data, New Data, and Changing Data

With the data wrangled, it’s time to return to some of the mechanics of data acquisition. You’ve got old data, new data, and changing data. Each needs attention.

Ingesting the last few months of data (say some google documents), is one place to start. It will give you a feel for the process of ingestion and what your data connector needs to support. If you just need to ingest data once, this can be pretty straight forward. Even ingesting the last year can be pretty quick with the right code incantations.

It doesn’t take long though, to see that new data never stops being created. This is where things will start to get messy again. Each API will have a different approach. Before you know it, your connector has to support polling, streaming, and also webhooks. More software development time and more sunk costs.

Last but not least, don’t forget about changes! If HR staff just updated the knowledge base, you’re still serving old content. So you’ll need to support the ability to identify, ingest, and update content that has changed. Also, don’t forget security, which means you need to hide that new Confluence Space which was accidentally exposed to everyone when it was first created.

Closing

Once again we see that building an internal search or RAG system is a deep well of complexity. The diversity of data sources is a key driver of this complexity. Much of this work is plumbing that will only distract your R&D team from the real value of leveraging your data.

If you’d like to outsource all this data ETL work, and gain an application and API for that data, we’re here to help! Apply your AI budget efficiently, in as little as a few weeks, with a low risk trial of Atolio. Reach out today!

Dave Cliffe is the Head of RAG (Rendering AI Guidance) at Atolio. Atolio helps enterprises use Large Language Models (LLMs) to find answers privately and securely.

Dave Cliffe

Head of RAG (Rendering AI Guidance) at Atolio

Get the answers you need from your enterprise. Safely.

Subscribe to receive the latest blog posts to your inbox every week.

Book a Demo