Are AI agents just RAG in disguise? 🙈

Spoiler alert: no.

and

May 13, 2024

Typical AI agents demos are awash with examples of bots that answer simple questions. These types of bots, when applied in a customer support setting, act as first line of help for companies that are scaling: customers can get quick best-effort answers, and companies can be a little less inundated. In 2024, the main approach that is in vogue to build this type of capability is Retrieval Augmented Generation, or RAG, with large language models.

At Gradient Labs, we are building an operating system of AI agents that automate manual, repetitive work—starting with customer service. We therefore could not escape investigating RAG in depth as we started out and are often asked whether this is the main technological approach that we are working on.

RAG is a more general solution to existing methods

The history of the tech that underlies question-answering in customer service has been one of progressive automation. Fifteen years ago, “bots” were likely nothing more than manually curated flowcharts under the hood. Ten years ago, they were probably starting to be powered by basic, custom machine learning classifiers. Five years ago (when we built a bot called Monzo Helper), the latest systems were powered by the first wave of pre-trained models like BERT. And now, the headline-grabbing approach is RAG with LLMs.

The crux of the approach is to inject relevant search results (retrieval-augmented) as input context for LLMs to produce (generate) answers. A lot has been written about how RAG succeeds and fails, and there’s a growing literature of practical resources on how to get more out of it. In effect, RAG promises to be a huge leap towards a general solution for question-answering: index documents into your vector store of choice, link it up with your favourite LLM, and you’re off to the races.

RAG is a thin slice of the customer support problem space

We originally thought that RAG was the no-brainer starting point for working with our design partners. But, for some of them, RAG would not automate a meaningful amount of their work. We’re now, broadly, dividing the companies we work with across several intersecting groups:

General information: companies with broad, diverse, complex, or multi-featured digital products tend to have customer demand that is dominated by questions seeking information (“how do I…?”)
Personal information: there are companies with inbound demand that is dominated by requests about the customers’ personal situation— their account, their booking, their transaction, their order (“what is the status of my…?”)
Procedural: there is a large segment of companies where agent’s work is dictated by procedures which are characterised by a combination of investigative and action-taking work—refunds, upgrades, cancellations, modifications and more (“can you…?”)

Many companies have a blend of all three, with the balance tipped one direction or another based on what the core business is. For example, tech companies that offer or mediate services in the real world are often more characterised by procedural inbound than information-seeking queries. For the category of businesses that get little to no general information-seeking questions, standard RAG would have very little impact. For companies that are predominantly asked about personal information, standard RAG would lead to frustrating and long-winded customer experiences.

And so a wider set of capabilities needs to be built. Tool use and procedure orchestration and execution are front runners here, as well as the meta-capabilities of knowing when to use which approach to solve a specific kind of query.

RAG has dangerous, niche failure modes

Okay, so there’s still a slice of the market where RAG could be useful. But how well is it working out? There’s a growing list of public cases where bots (that might be using some form of RAG under the hood?) write answers that are wrong. Understanding, diagnosing, and mitigating these errors is a nuanced exercise that requires looking at the entire stack of RAG components.

At the highest level, it is immediately clear that the quality of the document corpus that is available to run RAG over is critical, since RAG aims to generate answers from that input. Mostly, however, documents are written for human consumption—either publicly, as articles that get published online, or privately, in internal company knowledge bases. Chunking and indexing documents in a vector store and hoping for the best readily results in generating answers where customers are told to get in touch (which is literally what they are already doing) or disclosing information that is meant to be internal-only, which could lead to breaking the law in regulated environments. And this does not even touch on the problem that a lot of company corpora are not only outdated but also largely incomplete.

Consider a more nuanced example: a customer writes into a multinational fintech asking “how do I open an account?” A RAG system with all of the usual bells and whistles may find a document on opening accounts, and generate a reply enumerating the required steps. At a first glance, this may look great! But, digging deeper, what if that customer was writing in from a country where that company does not operate? Suddenly the answer is not just incorrect, it is misleading. In this case, the ‘right’ thing for the AI agent to do would have been to discover, diagnose, and reason about facts that are absent in the originating query—perhaps more akin to the design thinking that is applied in the context of medical diagnosis—capability that goes well beyond out-of-the-box RAG.

The ‘right’ answer, in other cases, may be no answer at all. Consider a customer who is asking an informational query but exhibiting the hallmark signs of financial distress or vulnerability. The right thing to do is to identify vulnerability as the overarching problem and redirect the customer to the right team, not answer their informational query.

Squaring the circle

There is no doubt that multi-turn question-answering is an important quality of a fully capable AI agent, albeit smaller in many cases than is commonly believed. RAG is also on its way towards becoming a commodity SaaS technology itself—but the experiments we ran here also urge caution to treating RAG’s standard formulation as a magical solution. The research literature on this topic is growing, and we’re following it as closely as we can—pulling promising ideas and testing them out as we iterate and refine.

But it is only one piece of the wider picture: AI agents that automate manual, repetitive work will need to go well beyond the scope of today’s RAG. To following along on our journey, subscribe below!

Gradient Labs Team

Discussion about this post