Gradient Labs Team

Creating a Culture for AI Builders

Neal Lathia — Mon, 02 Mar 2026 16:17:04 GMT

If you’ve ever worked in a startup, you’ve heard the common phrases— it’s fast paced, it’s high-autonomy, it’s impact-oriented. Anyone can tell you these buzzwords. This blog post sets out to go into actual detail about how we work at Gradient Labs.

Momentum over detailed planning

One of the earliest conversations we had when we were starting Gradient Labs was about momentum. In previous roles, we had felt that the real cost of large planning exercises like quarterly OKRs was a loss of velocity and an inability to quickly update plans when circumstances changed. We therefore started out with a simple mantra: “keep things moving.”

If you go into our archives, you’ll find a #one-line-updates slack channel which was bullet points of updates: “___ is now done.” It was early; majority of things were yet to be done. As we grew, and updates started spreading out, we switched over to a :glabs-heartbeat: emoji reaction, which automatically cross-posts the update to a channel. These posts celebrate anything that moves us forward: shipping new product features, winning bake-offs, launching a marketing campaign, or anything else.

Momentum is everything

Our planning process is high level and is oriented towards driving focus and momentum. We decide on outcomes we are aiming for, identify the bottlenecks, and set intentional time limits. At the same time, we retain the ability to change course. Some of our most interesting product features have come from the team spotting patterns, having ideas, or taking a day to run an experiment. When outcomes need to be sequenced or mapped out in more detail, we use proposals— documents akin to RFCs or product specifications.

From the outside, some of this might appear to be a little bit chaotic. At any one time, we have a set of problems that we are tackling and what we need is to have, validate, and then quickly exclude ideas that don’t work in order to move the needle forward. Making in-depth plans often stand in the way of fast iteration; deciding on the next step to take is easier than deciding on the next ten.

Fluid teams

Today, just over 50% of Gradient Labs is Engineering (AI, Backend, Product) and Design. We also have a very special team that we call AI Delivery, with Product & Ops specialists who run the entire lifecycle of landing and expanding our AI agent’s capabilities with each of our partners. The rest of the company spans Sales, Marketing, and Ops; all of them work deeply with everyone in tech.

There’s a blueprint that has been synonymous with startups and scale-ups for a long time: people get organised into cross-functional teams (squads, pods) and teams are clustered thematically into groups (tribes, collectives). This model assumes that you can decide that structure relatively up-front, by knowing what needs to be done. Once you create teams, re-structuring them is not a frictionless exercise, and people start thinking about their team’s remit.

We’ve adopted a more fluid structure:

Strike teams are cross-functional groups who are working on a top level company priority. The team’s continued existence means that the goal isn’t fully reached yet. Membership in a strike team is decided based on what the team currently needs to achieve that outcome—people are expected to drop out when no longer needed.
Pair projects. In some cases, we have people huddle on a project to get it over the line. Usually these are discipline-specific: a change that is only AI, product, or backend-related.
One person quest. Every company has had that inevitable moment of receiving a customer request or idea, and then needing to figure out how to slot that into a wider set of priorities that a cross-functional squad is working on. We have countless examples of a single Engineer having an idea, building it, and then seeing it through all the way to customers.

On demand over recurring meetings

When I asked the team for ideas for this blog post, “meeting free calendar” was one of the first replies and is highly valued across the entire Engineering team.

There’s usually a natural cadence of meetings that tech teams fall into. Stand up, planning, cross-team syncs, retros. In these environments, builders might realistically only get a couple hours of day of pure focus. We do not impose any required meetings. We’ve found that standing meetings quickly become formulaic and lose their meaning. In the spirit of fast collaboration, meetings happen on demand—when a thread is becoming too long, when someone is blocked, or when multiple people are running in parallel and want to check in.

The only standing meeting that we currently have is a weekly company-wide All Hands which ranges from 20 to 45 minutes, and spotlights recent progress, reinforces our current goals, or gives us a forum to share updates.

Customers actively contribute to our roadmap

When adopting an AI agent, the most natural questions from our partners are around understanding, steering, and extending the AI agent’s performance. A lot of amazing ideas come from their feedback and the constraints that they are working with.

It’s very common at Gradient Labs for customer meetings to have folks from AI Delivery and Engineering involved so that we can unblock them as quickly as possible. Given that we focus on financial services, it’s also common that new features we build from one customer’s feedback are immediately useful to others: this means that we can stage our feature rollouts by customer & their interest rather than needing to trade-off against or design around competing requests.

Process & tooling, insofar as it helps

When joining a company, a natural question to ask is “what is the process for ___?” At an early-stage startup, the answer is usually “we don’t have one yet.” Having a process does help in some cases: it brings clarity, and safety. Other times, it stands in the way.

At Gradient Labs, we’ve intentionally avoided adding process until the reason for adding one is to help us move faster.

What is our process for raising an incident? If you are thinking about whether or not to raise an incident, just do it.
What is our process for tracking bugs? Whatever the current team moves faster on—sometimes Linear, sometimes a Notion page; sometimes, just squash the bug right now.
What is the process for onboarding a new customer? We have the milestones that need to be achieved, not the strict sequence of steps.

All of these are ways-of-working processes; the process & bar we set for the product itself remains very high.

Owning the quality bar & not hiding behind the data

The truth we face today is that, over time, it is becoming easier to build an AI agent. What is not becoming easier is building a great one. Although the entire founding team has a background in Data Science, we are very intentional about not hiding behind “what the data says.” After all, an AI agent can deliver the same resolution in many ways: it might be okay, mediocre, frustrating, or amazing, even though the resolution was reached in all cases. The data is an important driver of where to look, but not the ultimate decider on whether things are good enough.

Coming to a shared understanding of “great” is not easy. When we embarked on building our voice agent, the first thing we did was scour the Internet to find any and all agents we could talk to, and asked questions like: “would I enjoy speaking to this?” and “would I buy this?” When we launched, we reviewed all of the phone calls and iterated heavily. And as we scale, we’re keeping that ethos, even though the number of conversations we’re having far exceeds our ability to review them all.

Scaling quality

We believe that when building AI agents (with AI), the bottleneck has largely moved away from the time it takes to create something and into the time it takes to ensure it works well. We know that every part of our stack, from the support platform connection through to the Temporal cache, has a meaningful impact on the outcomes we achieve.

And we have seen that nimble, fluid teams that live & breathe a problem while driving it towards the best possible outcome can build products that outperform others.

Subscribe now

Building resilient agentic systems

Devan Kuleindiren — Mon, 15 Sep 2025 16:33:08 GMT

At Gradient Labs, our AI agent interacts with the customers of financial services companies: high-reliability is non-negotiable. Suppose you get in touch with your bank about an important problem related to your money — there’s no excuse for your bank’s AI agent not to be able to reply!

Under the hood, we use a blend of different large language models (LLMs) to construct the highest quality answers. LLMs are therefore a critical dependency for our agent, and we need to be resilient to all kinds of failures or constraints in order to provide a reliable experience. This blog post goes into some detail about how we achieve that.

AI agents are a new paradigm

When building a server which handles a request, such as a mobile phone app calling a bank, the request might return within a few hundred milliseconds. Anything which fails on the path of the request would likely result in a retry of the entire request.

In agentic systems, where a single reply may be the result of a chain of LLM calls, requests can span much longer durations. Each LLM call costs both user-facing latency and money, so you don’t want a single failure in that chain to result in the entire request being retried:

An illustration of how agentic systems often involve longer-running requests where you don’t want to retry the entire chain of LLM calls if only one of them fails.

One way to solve for this might be to manually write data to a database at the end of each call, to persist state that captures the progress of your agent. This way, you can always recover from a checkpoint upon failure—if the database write succeeded. Another way might be to implement `@retry` logic at every step of your agent, which means solving for retry logic while trying to implement the AI agent. At Gradient Labs, we use Temporal, a durable execution system, that provides us with a way to effectively checkpoint progress out of the box.

Failing over across providers

The earliest design choices we made at Gradient Labs favoured an architecture that enables us to experiment with, evaluate, and adopt the best LLMs for each part of our agent. Right now, we use three major groups of models — from OpenAI, Anthropic and Google. Each of these groups of models are hosted in multiple places, giving us the ability to fail over from one place to another:

We use OpenAI models served by: OpenAI and Azure APIs; Anthropic models served by Anthropic, AWS and GCP APIs; and Google models served by GCP APIs in different regions.

We take advantage of this to both (a) spread traffic across providers, to increase utilisation of per-provider rate-limits, and (b) fail over from one provider to another when we encounter certain errors, rate limits or latency spikes.

In our platform, each completion request starts with an ordered list of API provider preferences. For example, if we’re making a request for GPT 4.1, we might have the preference ordering: (1) OpenAI, (2) Azure. We can configure these preferences on both a global and a per-company basis, and we can also assign them proportionally to how we want to split traffic across the two. If we encounter certain types of errors, then we failover to the next provider.

The nuance of a failover system is: when is it right to fail over? We currently handle four broad categories:

Successful, invalid responses — for example, perhaps we asked the LLM to place its final decision inside ... tags, but the generated response doesn’t include these. We do not need to fail over for these.
Errors — LLM APIs, much like any other software provider, can return all sorts of errors. We’ll failover when we hit most 5XX errors.
Rate limits — LLM APIs impose different rate limits for each model. Increasing these limits, particularly for more recent or experimental models, is not an automated process. If we failover due to being rate limited, then we mark the API provider that we failed-over from as “unavailable” in our cache for a short while. This way, we don’t waste latency on subsequent requests on a resource that’s already over limits.
Latency — LLMs can be slow. This is expected to be variable across calls. But we need to look out for scenarios where they’re globally slower than expected—this is a symptom that something is wrong. We currently failover if the request exceeds a timeout in the p99+ percentile of latency.

Failing over across models

There are catastrophic, low-likelihood scenarios where the provider failover system is not enough. For example, in the extremely rare case that Google is down, our failovers for completions from Gemini would no longer work.

In these cases, we can also activate model failover: for each LLM API request, we can configure a different model to use in the event of failure.

The main challenge of a model failover is that prompts that work well with one model don’t necessarily work well with others. However, designing and evaluating multiple prompt-model pairs for components of our agent is already part of our development lifecycle. For several critical components of our system, we have tailored prompts for both the primary and backup models.

This has a two-fold benefit:

In the event that the entire model group’s providers are down, our customers are protected, and their customers continue to receive replies from our AI agent.
In the event that we get rate-limited for newer and more experimental models, we can failover to older models for which we have higher rate limits.

Always improving

We’re always thinking about ways in which we can improve our resiliency—here is can a recent example. We’re already protected against scenarios where individual LLM API requests takes too long, because we time out and failover to the next provider. This timeout is designed to catch abnormally long requests — typically in the p99 percentile of latency. What should happen if the entire latency distribution shifts?

We observed this with one of our providers:

An example of an incident where the p75+ latency meaningfully increased, but stayed within our failover timeouts. This meant that the end-to-end latency of our agent increased, but failover didn’t kick in.

In this case, the mean latency of a few models spiked, and the p75+ latency jumped to well over 10s. This increased the overall latency of our agent, but didn’t initiate our failover mechanism. Thankfully, through our latency-based alerts, we were able to identify this fairly quickly and manually invoke the failover mechanism. However, this points to an interesting idea — can we auto-failover when we observe abnormal shifts in the latency distribution?

Anatomy of an AI agent incident

Neal Lathia — Mon, 18 Aug 2025 13:55:48 GMT

As we launch our first Platform & Security Engineering role, we thought to follow in the footsteps of many great tech companies that have come before and write about a string of incidents that we worked through about six months ago. These were a wild ride for us!

It all started with a memory usage alert

At Gradient Labs, our platform and AI agent are written in Go and deployed using Cloud Run. Each conversation that our agent participates in is a long-running Temporal workflow which manages the conversation’s state, timers, and runs child workflows to generate responses. We have alerts across many of our platform and agent metrics, some of which page an on-call Engineer.

Late one weekend evening, Google Cloud platform alerts fired: the memory usage across our agent’s containers was abnormally high. This is a somewhat nebulous alert: it doesn’t mean we’re down, but it does mean that something is not quite right.

Our top priority is always to ensure that no customers are left hanging: i.e., that our agent is replying in all its active conversations. We raised an incident. We looked to quickly pinpoint any changes that started this pattern: there was nothing that immediately stood out. Notably, this was difficult to narrow down because, during the day, each deployment that we had been making was zeroing out the problem. And we were also dealing with variable traffic, having multiple trials running at that time. So we bought ourselves some time by redeploying the agent with more memory. The alert was resolved within minutes, but we knew that a deep investigation was required to get to a root cause.

Diving into the metrics, this looked like a classic memory leak where our agent container kept restarting. The problem was that memory utilisation was growing much faster than before.

Uh oh! This memory usage doesn’t look right

Spoiler: it wasn’t a memory leak

There are several different parts of our AI agent that we expect to be memory intensive, particularly parts that are operating with many (variably sized) documents. We also run different parts of the agent in parallel to speed it up. We revisited all of these to ensure that resources were not being left in memory when no longer in use. None of these turned up anything useful; it seemed that something below the surface was at play. Perhaps with Temporal?

When researching online, we found an odd post on the Temporal forum about an unexplained memory leak, but there wasn’t a clear answer there either. It seemed likely that if there was a memory leak in the Go Temporal SDK, that we wouldn’t be the only ones impacted. So we put it to one side, and kept digging.

The Google Cloud Profiler flame graphs deltas for our agent's memory usage finally shed some light. Temporal’s top-level execution functions had the biggest growth over time in exclusive memory: something they were doing was increasing memory usage, and it had nothing to do with any of the functions they called further down the call stack. Among other things, these top-level functions are responsible for adding items to the Temporal Workflow cache. This cache stores workflow execution histories so that they don't always need to be retrieved from Temporal Cloud when a workflow resumes. Could this be what is causing the problem? We set out to validate this by running some tests.

Validating & fixing the issue

To validate whether our problem was indeed related to the workflow cache, we made a sequence of intentional changes:

The first change was to boost the agent’s memory to 5x what it previously had. Memory usage continued to grow but eventually plateaued. There was some kind of limit that we had reached.
We then deployed a change to decrease the worker’s cache size down by 10x. The container’s memory continued to grow, but plateaued at a much lower value.

It seemed that we had found our answer: our previous container memory limit was below the cache’s “plateau” value: containers were crashing as the cache was trying to fill up.

The AI agent’s container memory usage while we were validating our hypothesis

We closed this out by tuning our worker cache size. This was a trade-off between: how much memory we need to provision for our instances (infrastructure cost) vs how large we could make the cache (which reduces network calls to Temporal Cloud, and thus overall latency). Back in business!

But wait... why is the AI agent slowing down?

A short while after, as we continued onboarding new customers (and increasing our overall volume of conversations) we spotted that our AI agent’s mean latency seemed to have increased. This measures the average time that it takes our AI agent to do all of the steps that it needs to do to generate and safeguard its replies—a slow down here makes for a slightly worse customer experience.

Usually, this is an early symptom that one of the many LLM model providers that we use might be about to declare an incident. On that day, that was indeed the case, and so we adjusted our LLM fail-over system and the mean latency started going back down.

The next day, the problem resurfaced. This time, however, there were no reported outages from LLM providers (and no other errors, like rate limits). Curiously, this apparent change in latency seemed to be across the board, spanning agent skills that used different model providers and prompts that were known to regularly run very quickly, for both simple and complex AI agent responses. The problem seemed to be proportional to the volume that we were currently handling, which does fluctuate over the course of the day.

Digging through our metrics, we identified that the issue was not with the LLM providers themselves: the problem was that the time between scheduling and starting to execute our Temporal activities had grown substantially (sometimes more than 10x), while our rate of executions had dropped. Something was bottlenecked.

Fixing a side-effect of the previous fix

Opening the Google Cloud dashboard, we quickly found the answer: our agent had scaled itself down to very few instances. By manually editing the minimum instance count, we saw an immediate uptick in activity executions and everything returned to normal. It took us less than an hour to identify and fix the problem.

How did this happen? Effectively, by auto-scaling our container count down, we had reduced our ability to execute activities. Most of them were scheduled, but then stuck waiting for an available worker to execute them—and there were far too few to execute them swiftly.

Cloud Run auto-scales, in both directions, and so originally we had been reliant on this and had set a low, non-zero minimum instance count. Importantly, it auto scales based on incoming HTTP requests, event consumption and CPU utilisation. Since our agent is a Temporal workflow, it had none of these: it polls Temporal Cloud (rather than receiving HTTP requests), it does not consume events, and the CPU utilisation had been fairly low. However, Cloud Run also attempts to gracefully handle cases where containers are crashing, which had stopped ever since we fixed the workflow cache problem 🤦. Effectively, Cloud Run had been keeping up with our instances crashes. By tuning the cache size, we fixed one problem but inadvertently prevented Cloud Run from scaling: we had throttled ourselves. A quick fix, but an unintended problem!

Exceptional customer service, from the platform up

Running an AI agent at scale means that we need to finely tune every single layer of our systems: the prompts, the LLM providers, the databases, and all the way through to the containers. Ultimately, the exceptional experience that we give each customer is the result of orchestrating all of these together, seamlessly.

In the incidents described above, we were all hands on deck to ensure that our AI agent was not abandoning any customers, and to get it back up on its feet as quickly as possible. As we keep rapidly scaling, this is an area that needs full-time attention: if these kinds of problems excite you, we are hiring for a Platform & Security Engineer.

Subscribe now

We've raised $13M to automate customer operations in financial services

Neal Lathia — Tue, 08 Jul 2025 14:44:56 GMT

You can find the full announcement on our home page:

Read the full announcement

We're incredibly excited about the road ahead. If you want to deliver superb customer experiences that span the entire customer journey, we'd love to partner with you. We’re only an email away at hello@gradient-labs.ai.

If you want to be part of a passionate, smart team tackling some of the most complex automation challenges in financial services – join us, we’re hiring!

LLMs at Gradient Labs: the perfect blend

Neal Lathia — Wed, 16 Apr 2025 13:08:30 GMT

In the last two months, there have been announcements for several landmark models: GPT-4.1 on April 14th, Llama 4 on April 5th, Gemini 2.5 on March 25th, and Claude 3.7 at the end of February. Undoubtedly, there’s a frenetic amount of work going into training the next generation of foundation models, and everything is changing fast.

This continuous change is a great reminder that evaluating LLMs is hard. Given one model, there’s a plethora of metrics reported on public evaluation datasets that aim to demonstrate its general performance. Looking across many models, there are often slight differences in what benchmarks are used and how they are compared to competitors. The ultimate question, for task-centric AI agents, is: if a model looks good on paper, will it be good for the specific problems that our agent works on?

At Gradient Labs, we’ve taken a slightly different path with respect to how we go about model selection—this post is an overview of what we do.

The ultimate trifecta

If you zoom out all the way, there are only three variables at play with all foundation models. Assuming a task that requires a single completion, the trade off is between:

Quality: how “good” are the outcomes that the model achieves?
Latency: how quickly can the model achieve a result?
Cost: how many tokens does it need to generate to get to the result, and how are those tokens priced?

Unfortunately, the range of models out there often only gives you two of these 😞—sometimes just one!

At Gradient Labs, we heavily anchor on the first: quality. Primarily, this is because the industry trend with the other dimensions, over time, has been progressively faster, cheaper models that are “just as smart.” End-to-end agent latency also has a strong Engineering angle that is separate from the model choice—by parallelising different building blocks of the agent, by pre-emptively running some parts before they are needed, and more.

However, inside the quality dimension there is a lot to unpack:

An individual building block of an agent might be evaluated in completely different ways—with binary or multi-class classification metrics, ranking metrics, or response quality reviews. In this arena, the choice of metric can’t be divorced from what that part of the agent is trying to achieve.
The end-to-end customer experience when chatting with an AI agent is determined by the effect of combining all of the agent’s blocks together, and so ensuring that rare upstream mistakes do not compound into low-quality downstream responses is critical. This area is not contingent on using the same model throughout the whole agent. There are a suite of tools and product features that we surface for this, ranging from simulations all the way through to advanced customer conversation synthesis (more to come on this front soon!).

Going all-in ❌ , maintaining optionality ✅

At Gradient Labs, picking a single model to serve all of an AI agent’s needs during a time when new models are being announced every month felt overly constraining. It would mean that we would be adopting all of the model’s strengths, and need to accept all of its limitations. We avoided this because of:

The rising tide. Imagine building an agent using GPT-3 end-to-end. By 2025, no matter how good it was, its overall position would have been eroded, by virtue of being committed to a model that has largely been supplanted. We believe that the same is true going forward.
The risk of migrations. Imagine building end-to-end with Sonnet 3.7, and then waking up one morning to the announcement of Sonnet 4. Being committed to a single model would force us to think about large, uncertain, and risky upgrades where the entire AI agent might need to be migrated onto a new model.
The risk appetite of our partners. Some of the companies we work with want the latest & greatest live as quickly as possible, others care less about experimental opportunities and more about consistent outcomes. Being flexible enables us to cater for both!

Ultimately, the flexibility that we desired is that of enabling AI Engineers to pick the ideal model for the building block they are working on—they would know best where it fits in the overall puzzle, and what trifecta variables they want to trade off against.

Uniform interface, reliable service ✨

While AI Engineers are empowered to pick their choice of model, there are several separate problems that they shouldn’t need to care about:

Rewriting code to use a different model. We have built an internal abstraction that enables changing models by editing one line, rather than needing to juggle different clients.
Observability. We log each completion request that is made, whether it succeeded or failed; this happens inside of our internal abstraction and is invisible to AI Engineers.
Picking the model’s provider. While Open AI and Anthropic models are available directly, many are also available via cloud services providers (Azure, AWS, GCP). Retry and fallback behaviour when any of them are experiencing a hiccup, or if we approach our rate limits, sits in the heart of our platform—far away from the daily work of AI folks.

The perfect blend ☕️

When people chat with our AI agent, their experience is being driven by a blend of models that are each playing to their unique strengths. Today, that is a blend of Sonnet, Gemini, and GPT models. When a new model is released, we (like many!) are very quickly evaluating different parts of the agent to see what can be quickly improved.

This, combined with a design that focuses on a conversational & diagnostic-oriented approach to resolution is one of the many reasons why we have seen our agent outperform others when going head to head against them.

We know, however, that there is yet much to build—subscribe below to hear all of our updates!

Subscribe now

Safe AI agents in high-stakes industries

Neal Lathia — Mon, 03 Feb 2025 11:31:24 GMT

In December, the team at multiply.ai invited us to the workshop they hosted on AI Agents at the Google London office.

The full presentation was recorded, and is embedded below. Key highlights:

How we design our agentic systems at Gradient Labs, with our key principle: “what would a human do?” When a person is working on a task, there is a lot that we do not document & take for granted. At Gradient Labs, we bake this implicit expertise into our agent so that it does not need to be instructed down to the last detail.
What is the most basic thing that you need to get right? Having an AI agent that knows what to do when it receives (or, critically, doesn’t receive) signals from the outside world. The Gradient Labs agent handles this seamlessly by being built as a finite state machine.
Is retrieval-augmented generation enough? The promise of RAG hides two important problems: majority of a company’s knowledge is undocumented, and it is rare for “search & reply” to lead to good outcomes. At Gradient Labs, we’ve built a suite of agents that can learn from a company’s historical data.
What are AI agent tools? Empowering an AI agent to use tools (APIs) unlocks opportunities for end-to-end automation, but if you wouldn’t allow an employee to access all of the tools in your company, how should an AI agent do so safely? At Gradient Labs, our agent’s choices restrict which tools it can or cannot use.
How can you prevent both common AI Agent mistakes and industry-specific ones? At Gradient Labs, we guard against both—no outcome is safer than a wrong one.

🚢 If you would like a deeper dive, I gave a more technical version of this presentation at the MLOps Agents in Production event.

AI agents in 2025

Dimitri Masin — Fri, 20 Dec 2024 20:25:50 GMT

The transitory nature of “Build-Your-Own” AI Agents

We’re in the thick of the AI agent hype cycle. Companies are trying to build and deploy their own, and startups are promising platforms to do just that. But does anyone really want to manage an army of custom AI agents?

What companies truly want is an “API to a brain”—a digital employee they can direct. This “steerable, safe intelligence on tap” is the real prize.

From my experience with ML and MLOps, I understand why some think building agents in-house matters. There is a key difference though: with ML, most models had to be company-specific and trained on unique data. Not so with today’s generative AI—it’s general purpose. Like with human employees, the digital employee would be the same for company A and B. Expecting to build and manage your own AI agents is like running a private university just to hire graduates. No one would do that.

Where Will the Puck Be Next December?

Instead of DIY AI agents, we’ll see “API-to-a-brain” services—intelligent, contextual, secure endpoints. These will be domain-specific “brains” with superhuman skills in areas like engineering, operations, or customer service—endpoints that you steer rather than armies you must recruit, train, and manage.

To get there, innovators have discovered that LLMs and data/action layers alone are just two components. True “steerable intelligence” requires a lot more. Here are a few examples from the ops domain that we needed to build on our journey there:

Continuous learning algorithms: Systems must learn like humans, continuously assimilating new information, staying current, and unlearning outdated or incorrect facts. Distilling knowledge autonomously from new information. This goes beyond traditional fine-tuning and requires robust, dynamic knowledge updating.
Deeper understanding: Standard Retrieval-Augmented Generation (RAG) won’t cut it for genuine comprehension. We need proprietary knowledge graphs and richer context models that let these brains grasp nuanced user requests and handle intricate domain concepts.
Reasoning for specialist skills: as humans do, intelligent systems need to be taught how to handle deception, manage objections gracefully, deal with vulnerable users, or perform domain specific assessments. Which is not trivial given that systems with more reasoning are progressively harder to steer. This is however the most important property.

And there are many more examples across other domains like autonomous research systems or fully autonomous engineers where action taking and raw LLM output are just the tip of the iceberg.

Conclusion: AI agents are more of an implementation detail and highly transitory. Building your own agents is like running your own universities just to hire future employees from those. Instead, by next December, the puck will have moved towards out-of-the-box, steerable “brains” that can deliver real intelligence on demand.

AI Support Agents - Build vs. Buy?

Dimitri Masin — Wed, 27 Nov 2024 12:28:12 GMT

Recently, several prospective clients have asked for my thoughts on building versus buying AI support agents. My typical response is, "You probably shouldn't listen to anyone who is trying to sell it! However, I'm happy to share some objective thoughts that might help :-)"

Companies considering building their own AI support agents generally fall into two categories:

Group A: Companies that have been working on AI support agents for over six months with a significant investment in personnel (at least 5-10 people) and have generally made substantial progress.
Group B: Companies that want to start building soon (or have just begun 1-2 months ago) or have invested in 2-3 people for the project.
Thanks for reading Gradient Labs Team! Subscribe for free to receive new posts and support my work.

Perhaps surprisingly, it's usually the Group A that wants to work with us, even though they've already made good progress. They've had enough time to realize how challenging it is to do it well and that, over time, there's no strategic advantage to building it in-house as tools become readily available to solve the problem out of the box. As an analogy, I don't think anyone would try to replicate Intercom, Zendesk, or Salesforce in-house these days (it’s quite obvious in hindsight). It’s much better to focus on efforts which are specific to your company instead.

For Group B, I usually offer the following advice based on the most frequent mistakes I’ve seen to help them succeed:

Treat it as a strategic priority and allocate a team of at least 5-10 people full-time to building it. Don't treat it as a side project!
Progress will come as a step change. It takes only 1-2 months to get something that works 70-80% of the time (which isn't useful in production). It then takes 9-12 more months to achieve something that works 90% of the time. Depending on your quality standards, you might not see live results for a long time.
Give your team enough space to create that step change. You can't expect gradual progress every month. Also there is no general blueprint for high quality results that your team can just copy. Some failed attempts will inevitably happen.
While it's tempting to think of this as another platform-building project, you need more than just engineering expertise. You need someone who deeply understands customer support (a domain expert), is willing to read thousands of conversations to improve and test their work, can instruct LLMs and create sophisticated chains of LLM calls. Let's call this person an "unfussy unicorn AI engineer". You can't approach this project with a pure engineering or classical machine learning mindset, as LLMs behave very differently.
Finally, set your team an informed target. To give you a benchmark: the best tools out there can achieve 80% handling time reduction for customer ops work like customer conversations plus back-office processes that underpin those conversations. Make sure you have a plan on how to get there, otherwise you might spend a year building something trivial and delay the crucial business impact.

If you're considering building an AI support agent in-house, please reach out. I'm happy to offer a free 30-minute advice session.

Getting to a ‘thank you’ with an AI agent

Neal Lathia — Wed, 11 Sep 2024 13:34:51 GMT

The world is full of metrics

Every launch of a next-generation foundation model is always accompanied by a plethora of metrics: how well the new LLM does, compared to its peers, on a series of tasks that, together, embody a notion of ‘performance.’ Other than by direct comparisons (with leaderboards like LMSys), this is the best way that researchers currently have to calibrate how one LLM should rank compared to another.

The customer support automation arena is not dissimilar. Each automation service is often described in terms of volume, deflection, resolution, and the average customer satisfaction scores it achieves. Together, these capture support automation ‘performance’ from a business-oriented perspective, and outside of a direct comparison at least enable a high-level distinction of sorts across multiple vendors.

Each support conversation is a unique moment

High-level metrics will always hide the low-level, case-by-case, experiences. One facet that remains elusive in how we measure things—either at the LLM or support automation level—is a concept of “human level” interaction. That is, given two completions, bots, or agents that spit out some text, both of which are technically correct or even semantically similar, how can we tell that one is a more natural, fluent, and seamless response to give over the other? Perhaps both of them would have lead to a deflected customer, but (in light of how important customer support is to a company’s brand) how can we make it a better experience?

At Gradient Labs, we’ve been thinking deeply about these questions, less so from the perspective of creating new numbers but more from the point of view of shaping a better experience for our design partner’s customers. Primarily, we do so by reading a huge number of customer support chats—both between customers and human agents, and when customers talk to our AI agent. Our ongoing discussions about these currently bring us to three insights, which we describe below.

Today’s chatbots often make customers act robotically

A wide range of customer support automation could be described as “best effort.” These systems will answer every single customer query by throwing some kind of information at the customer and forcing them to wade through it. They might have workflows that customers must follow to the letter in order to get anywhere, or screw up many times before an escape hatch is offered. Some even encode what seems like a level of keyword matching (“you must type ‘talk to a human’ to be transferred”).

Interestingly, customer replies in these settings strike us as very robotic. “Yes.” “No.” “Talk to a human.” It’s as if the customers are bending their communication style to try and speak in the limited language understood by the bot. It can work, for simple cases: as soon as any nuance appears, or even if the customer wants a clarification, these tend to fall over. But, ultimately, having customers talk in a specific, unnatural way seems to be a symptom of people trying to find their way around poor automation.

Having seen these in action, these systems will, by definition, have a high response rate and will likely have a high deflection rate—many customers don’t figure out how to speak robotically enough to resolve their own problem, and so they give up.

Tools to speed up support staff make human agents sound robotic

A method that we use frequently at Gradient Labs is to compare AI and human replies, side by side. Our initial opinion was that the human response should be considered a “gold standard” for what the AI could achieve. However, we ironically found that (in larger organisations) replies from human agents can also not result in a fluid, natural conversation.

One reason for this is that support staff performance is typically evaluated in ways that implicitly encourages transactional behaviour. If Mike and Alice must take on X support chats per hour, the easiest way to achieve their targets is to reach for the best matching canned response in each turn. This can work, in simpler cases—where pre-written responses are good enough—but as soon as there is any nuance or ambiguity in customers’ intents, things go astray.

At its extreme, we know of cases where human agents have been accused by customers of being bots, when in practice nothing more than canned responses were being sent.

Getting to a “thank you”

There are a lot of qualities that we attend to when making our AI agent more natural-sounding. This goes beyond the standard expectations of AI agents, like making sure that informational replies come from documented sources. It includes more qualitative aspects: replies shouldn’t be unnecessarily verbose, shouldn’t ask too many questions at once, shouldn’t ask about things that have already been said (but sometimes should ask for confirmation about them), should only be apologetic when it’s relevant, and many more. And one way that we see our AI agent get it right is when customers go so far as saying “thank you” for the help they have received.

Importantly, this is not about obfuscating that the customer is talking to an AI agent or masquerading it as a human—this is about making the AI agent so easy to talk with that it feels natural to thank it. It’s somewhat tricky to describe without speaking about the details, but we’ll close with two anecdotes of this in action.

Our AI agent had a chat with a customer that was about a fairly complex topic—understanding which statement month a particular transaction should appear on. While diagnosing the issue, the customer mentioned it is their first month with the service, in the context of not having a billing statement history. Our AI agent’s next reply acknowledged that by welcoming them to the company as part of its answer to the customer’s actual question and then worked with them to a resolution. And it ended in a thank you:

In a separate instance, a customer chatted to our AI agent about a delayed payment. The AI agent clarified details about the payment and then not only used relevant sources to create a reply, but intermingled the context given previously in the conversation with the right knowledge to give the customer a tailored response about about when that payment should be visible. It did not reach for a canned-style response “all payments take 1-2 days” or just a directly say “Thursday” without explanation, it said something akin to “our payments typically take 1-2 days to clear, which means Thursday in your case.” It invited the customer to reach back out if that didn’t come to fruition. And, again, it ended with a thank you:

Meeting metrics with amazing experiences

The discussion above does not mean that we’re doing away with metrics—after all, we’re a technical team and need to summarise our overall progress in a way we can share 📈. But every support conversation is a unique moment for a customer; their reach-out is much more than a number to them. Creating a fluid, helpful, and easy conversation with an AI agent needs much more than numbers, and making it feel natural to thank the AI is one way that we’re seeing it coming into fruition.

Subscribe now

Making customer support automation as simple as writing a document

Dimitri Masin — Tue, 23 Jul 2024 06:45:02 GMT

Imagine automating a complex customer support process by simply describing it in plain English and running it as cheaply as machine code, without needing to hire people for months or roll out new processes for weeks. You might not have to wait much longer!

We believe that over the next 5 years, there will be the first bank, food delivery service, or travel agency with most of its customer ops work done fully autonomously. This will unlock an unprecedented ability to scale and the agility. This new breed of companies will have a fundamental advantage when it comes to cost and quality.

Let's first consider why this was not possible until now and what has changed.

Today, repetitive work gets done manually or automated with workflows

There are two fundamental ways to complete repetitive customer support tasks in a company: through humans or machines.

Humans need to be hired, trained, and given Standard Operating Procedures (SOP) to follow.
Machines currently only execute deterministic tasks/flows that can be expressed as code.

Workflow automation tooling has been widely adopted by many forward-thinking businesses in an attempt to make automation through code simpler and more accessible. Some of these tools have a graphical interface to enable non-technical staff to contribute to process automation.

However, this approach has largely failed to replace a huge proportion of the manual, repetitive labor in customer ops organisations.

Let's consider a simple example to better understand the limits of current workflow automation approaches

"I can't login" is a great example that applies to almost all businesses that offer a digital account— it is along the same lines of "I can't signup," "Why was my transaction declined?" or "I can't find my deposit." Customers are reporting a problem which may have a variety of causes and a variety of solutions. In the average enterprise setting, more than 50% of support queries are the types of issues that require a non-linear set of steps to resolve (aka “Troubleshooting issues“).

A human customer support agent would receive guidance that might look something like this (simplified version for illustration purposes):

Standard Operating Procedure that covers Login Issues

It's fast to write, easy to understand, and simple to edit by a domain expert. Let's now look at the same procedure represented as a deterministic dialog workflow where a customer can interact with a bot:

Simplified login troubleshooting workflow

For simplicity, we have skipped most of the paths in that tree, however, modelling all the relevant scenarios and edge cases mentioned in the SOP above would require 60-80 workflow elements in a carefully arranged order. And that's a trivial example.

Procedural AI agents provide a vastly superior customer experience

For workflows to help customers to resolve their issues, customers need to correctly pick the triage tree entries that will lead them to the right resolution path. While this can work well in a simple linear workflow, it often leads to frustrating experiences where customers need to navigate the triage tree back and forth to find the relevant entries. I’m sure we've all been there and given up occasionally. 😊

Procedure-following AI agents provide a more fluid and enjoyable experience similar to talking to a human. At the current rate of progress, it's also easy to imagine that experience becoming superhuman in 2-3 years. AI agents will be able to figure out the straightest path possible to the resolution, and everything will happen in real-time.

Beyond the experience of executing a single workflow or SOP, what's really important for the customer experience is how those are triggered and the ability to narrow down on really relevant issues. To use the login issue example above, if the customer says, "I can't login and see a red banner in the app," should they really be shown the whole login issues workflow with all the irrelevant choices where an intelligent operator would jump straight to offering a solution? Unlike AI agent, workflows are generally not able to rely on conversational context.

Workflows are complex to create, maintain, and test

Anyone who has had to model even a moderately complex process through workflows knows how laborious and complicated it is. The writer needs to painstakingly enumerate all the possible paths in the tree and make sure there are no loops. Once created, the burden doesn't stop there; try keeping up with business changes if there are hundreds of connected boxes in front of you and you need to adjust exactly the right ones.

SOPs, on the other hand, are much more scalable because they are written with intelligent operators in mind. These operators can fill in the gaps, make common-sense assumptions, and hold relevant company and product context in their memory, which the SOP itself does not need to be explicit about. We have found that even smaller enterprises will usually quickly grow to having hundreds of SOPs documenting different customer support procedures.

Workflow tooling is too technical for domain experts (and too visual for engineers)

Despite the promise of low or no-code tools, ops staff often cannot operate workflows by themselves—they need to call in Engineers to configure the workflow to do anything meaningful. Conversely, engineers’ regularly feed back that they prefer to use code rather than draw boxes on the screen or copy & paste API endpoints into a form. Hence, the ideal user profile for workflow creation and maintenance is rare in organisations.

The solution? Procedural AI agents

At Gradient Labs, we’re building AI agents that automate manual, repetitive work. As part of this, we’re doing away with the concept of box & arrow workflows altogether. Instead, we’re developing an engine for AI agents to safely follow SOPs that are written in plain English. Drawing from the lessons above, we’re aiming for:

Customers to get a faster, more fluid, and enjoyable experience, similar to talking to a human;
Ops domain experts to be empowered to own the logic of automation in an accessible way, similar to instructing a colleague, and
Engineers to be able to give the AI agent relevant data and actions in a similar way that they ship code to production

Put together AI agents that can follow procedures (SOPs) have the potential to unlock meaningful automation that was never feasible before. Based on our estimates, it's possible to automate 70-80% of today's manual customer ops work, and most likely even more as the technology advances over the years to come. The key is to combine the simple yet expressive authoring of SOPs with the ability to execute those fully autonomously through AI agents.

If you are equally excited about defining the future of automation, either by becoming a part of our team or as a potential customer, please don't hesitate to reach out. 😊

Join us

Building agentic workflows

Neal Lathia — Mon, 24 Jun 2024 15:01:45 GMT

At Gradient Labs, we’re building a suite of AI agents that automate manual, repetitive work—starting with customer service. There are two major areas to this: the platform that enables AI agents to operate (which we 🦉 previously wrote about), and the AI agents themselves.

We’re always soaking up as many blog posts about AI agents as we can find: recently, we’ve seen that a lot of them have been focusing on the criteria to qualify software as an AI agent— is it just code that makes more than one LLM call, or perhaps something else? As Andrew Ng wrote, perhaps the technical definition of “agent” doesn’t matter as much as the capabilities and impact that can be achieved via agentic systems, and so instead of thinking about what agents are, we’re back to the age-old questions of organising a group of people to build a novel type of system, in safe and scalable way— starting from its design and going all the way through to where we write what code.

“What would a human do?”

Humans are really good at reasoning by instinct when working on abstract tasks. For example, if you were a support agent and a customer asked “hey, can you change my flight?” then you may instinctively redirect them elsewhere if you work at food delivery company— you know, without having been told, that the question is out of bounds. Or, if you did happen to work at a travel company, you might instinctively know to ask the customer clarifying questions if they have no bookings in the system instead of saying yes but then not knowing what to change.

LLMs, on the other hand, while increasingly powerful (as zero-shot classifiers, as planners, and as summarisers, and beyond), tend to do better when given clear, explicit instructions rather than being given overly abstract tasks. It’s currently infeasible to expect LLMs to “instinctively” do the right thing within a realistic setting. That means that a starting point when designing an agentic system is to try and map out, in broad brushstrokes, what these implicit instincts that humans have might be. For example, given a question from a customer, humans might implicitly answer a range of questions: is it clear? Is it relevant? Is it ambiguous? Is it threatening? Is it a greeting or a farewell? Does it sound vulnerable? Each one of those questions can be encoded as a discrete, explicit task for an LLM to tackle.

From the perspective of someone building an agentic system, it’s then much easier to evaluate and reason about individual components, rather than treating the “AI” as an impenetrable black box. For us, from a technical perspective, each of these skills is a Temporal Activity so that it can be automatically retried when it fails.

After building a family of skills, the exercise becomes one of composing skills together and getting them to be executed effectively, where they can be interleaved with business logic to suit different companies and conversational scenarios. To do that, we build at two levels of abstraction: the task level (as a state machine), and the LLM level (as chains of skills).

Tasks as state machines

Imagine an extremely simplistic representation of a conversation between me and you. There could be three states:

I’m listening — a state where you speak, and all I need to do is to keep up with what you’re saying (and not get distracted 😳 )
I’m talking — a state where I need to figure out what to say, and then say it (hoping to say something relevant 😇 )
The end state, where the conversation has finished after you say goodbye 👋🏽

The states and a minimal set of transitions could look like this:

Agentic workflows can also be formulated as (much more) complex state machines. This enables us to reason about the messy world of dialog, catering for things like stopping any work if a human agent steps in, reaching back out to customers who seem to have stopped responding, invoking tools, waiting for a human-in-the-loop to approve an action, and beyond.

From our point of view, as builders: state machines are a well-established pattern that facilitates reasoning about high-level behaviour of our agent while staying away from the non-deterministic LLMs that sit under the hood. We represent each conversation in our platform as a Temporal Workflow, where states can receive signals when external events occur that require the workflow to interrupt whatever it is currently doing and move into a different state.

We’ve stayed away from multi-agent collaborating, for now

A popular topic that we’ve noted in papers and blog posts is about multi-agent systems. One way these are built is by decomposing the work required to perform a task into different personas. For example, generating software (which is one of the most popular topics in the literature these days!) could be broken down into the “product manager” persona, the “software engineer” persona, the “quality tester” persona, and so on.

At the end of the day, this is still LLM calls that are stacked into a hierarchy— just organised in a slightly different way. However, from a builder’s perspective, mapping agentic workflows back into the roles that humans developed to do the same tasks (as opposed to the skills that are used to accomplish that task) feels like an anti-pattern—but we might be wrong. Is it the product manager or software engineer who is in charge of clarifying ambiguous requirements?

At this early stage, this is a level of ambiguity and complexity that we have naturally veered away from and our agentic workflows are assumed to be able to impersonate all of the skills they need to automate the entirety of the task they are being designed to handle. So that means that we spin up a new agent when we have a new high-level task to automate (say, a content editor).

The agent: planning, reasoning, executing

Looking back at that simplistic state machine, it might still look pretty vague. What does the “I’m talking” state actually translate into? This is the part of our system that most closely resembles the emerging arena of agentic design patterns:

The planner inspects the current conversation and the available chains to pick the best chain and back out of ineligible conversations;
The best chain is executed first (falling back to other options);
Each chain might have a series of classification, search, reasoning, and completion steps — and emits one or more events (such as “I decided X” or “I classified this as Y”) and operations (i.e. “say X” or “call tool Y” or “hand this conversation off to Z”).
Before these operations are acted on, they are checked — with guardrails and other consistency checks to make sure the agent rarely does contradictory or undesired things.

From a builder’s perspective, this ends up as coding up a large, non-deterministic directed acyclic graph. By starting at the top (the planner) and inspecting the events we can trace all of the decisions that the workflow has made to reach its conclusion.

One of the neat things about this structure is that we don’t need to reason about higher-level questions (such as “what if the customer writes in again midway through planning?”) while we implement these skills. Equally, we don’t need to reason about lower-level questions (such as “what if this LLM call is rate limited?”) because we can just error and let Temporal take care of retries. Most importantly: we can run this agent as both a live-agent, responding to customers, and a simulated-agent— just asking it “what would you do next here?” — which unlocks our entire approach to evaluation.

Putting it together

As we continue to grow, finding a way to structure our agentic workflows is enabling us to map the parts of the agents that we build back into the disciplines in our team, such that each person is playing to their strengths, the level of abstraction they want to work with, and their domain expertise. Having the structure above has— to date, one year in— really helped to cement all things from our high-level design approach (”what would a human do?”) through to the lowest details (how we structure our code bases).

Have you taken a similar approach, or have you found a different path that works for you? We’d love to chat about it!

Subscribe now

Going beyond RAG for customer support conversations

Dimitri Masin — Mon, 10 Jun 2024 13:24:50 GMT

In a recent post, we discussed how Retrieval Augmented Generation (RAG) is just one piece of the puzzle in building customer support AI agents. Today, we'll a dive a bit deeper into these issues—at Gradient Labs, we want to empower our customers to deliver superhuman AI support experiences to their end users.

The key insight? Real world support conversations are a lot messier than what RAG can usefully handle.

Human support agents seek to understand intent

RAG agents rely on semantic similarity search to ground their answers in truth. While this helps to avoid hallucinations, they often provide only "best effort" replies as long as enough semantically similar content is found. In effect, a lot of RAG is tuned for a single question and reply rather than a long-form conversation.

Human agents, however, are expected to understand the true, specific intent behind a customer’s query. They have a non-fuzzy world model of the company’s products, processes, and related user activities, which then allows them to pattern match customer’s question to that world model in a more structured and precise way. For instance, if a customer says, “Why can’t I pay?”, it’s easy for a human to reason about what’s missing in that query and to instinctively ask for clarification about the missing, implied, information in order to help effectively.

Taking customer queries at face value is inaccurate

Customers can make statements that are technically inaccurate, like saying, “My card is broken,” when they mean a transaction was declined. Or they might say, “Somebody took my money,” when they simply forgot about a prior transaction. In extreme cases, customers might try to be deceitful, particularly in fraud scenarios.

RAG approaches take these statements at face value, often resulting in irrelevant or harmful responses.

Implicit knowledge from human agent’s experience is invaluable

Even the best-maintained knowledge bases have gaps and can become outdated—permanently, when a product changes, or temporarily, when a marketing campaign is being run for a day.

Human agents rely heavily on their shared experience from handling numerous cases, from their training, and from company-wide announcements that may not be documented. Experienced human agents can quickly deduce common patterns, like symptom X usually leading to outcome Y, even if X can theoretically arise from other causes. Standard RAG agents lack this practical shortcutting ability and tend to perform poorly in troubleshooting scenarios.

While the underlying documents could be updated with a lot of effort to include such information, there are also other approaches, as outlined in a research paper from Google on medical diagnosis. We’ve found that AI agents can extract a lot of value from reading historical conversations and building their own facts. One needs to be particularly careful with this approach, however, in order not extract wrong or irrelevant information.

Complex scenarios

Customers often describe complex situations with partially irrelevant details where human agents first identify “the crux of the issue” before responding. For example: “I was travelling abroad last week and paid my hotel with my card. They charged £1000 and said £300 would be refunded. I had an additional restaurant bill of £100, but they refunded only £150.”

We have seen first hand that standard RAG agent approaches do not reply anything helpful in such situations and can cause confusion and frustration. While the retrieved documents might provide the relevant context on what to expect with hotel card reservations, RAG agents will not try to apply appropriate reasoning techniques by default. To solve such cases they need to recognise such cases explicitly and switch to a different “reasoning mode” internally. To a human agent on the other hand it’s immediately obvious that the customer was expecting a refund of £200 but has received only £150.

Putting it all together

Real-world customer support involves multi-turn conversations, unlike the single-turn question-answer pairs typical of RAG. These conversations progress through multiple phases. They usually begin by understanding the “real” customer intent, which might involve asking clarifying questions to grasp the situation correctly before providing an answer. Even after providing an answer, the conversation often continues, with customers seeking further clarification or adding more details. The multi-turn agent must navigate these phases fluently to deliver a magical support experience, akin to the best human agents.

Until we address these challenges, AI-driven conversations will feel deceptively good but will be riddled with issues due to a lack of real context understanding.

We envision a future with superhuman quality support experiences. If you’re excited about that vision and the challenges ahead that we’ve outlined, we’d love to hear from you!

Subscribe now

Are AI agents just RAG in disguise? 🙈

Danai Antoniou — Mon, 13 May 2024 15:07:43 GMT

Typical AI agents demos are awash with examples of bots that answer simple questions. These types of bots, when applied in a customer support setting, act as first line of help for companies that are scaling: customers can get quick best-effort answers, and companies can be a little less inundated. In 2024, the main approach that is in vogue to build this type of capability is Retrieval Augmented Generation, or RAG, with large language models.

At Gradient Labs, we are building an operating system of AI agents that automate manual, repetitive work—starting with customer service. We therefore could not escape investigating RAG in depth as we started out and are often asked whether this is the main technological approach that we are working on.

RAG is a more general solution to existing methods

The history of the tech that underlies question-answering in customer service has been one of progressive automation. Fifteen years ago, “bots” were likely nothing more than manually curated flowcharts under the hood. Ten years ago, they were probably starting to be powered by basic, custom machine learning classifiers. Five years ago (when we built a bot called Monzo Helper), the latest systems were powered by the first wave of pre-trained models like BERT. And now, the headline-grabbing approach is RAG with LLMs.

The crux of the approach is to inject relevant search results (retrieval-augmented) as input context for LLMs to produce (generate) answers. A lot has been written about how RAG succeeds and fails, and there’s a growing literature of practical resources on how to get more out of it. In effect, RAG promises to be a huge leap towards a general solution for question-answering: index documents into your vector store of choice, link it up with your favourite LLM, and you’re off to the races.

RAG is a thin slice of the customer support problem space

We originally thought that RAG was the no-brainer starting point for working with our design partners. But, for some of them, RAG would not automate a meaningful amount of their work. We’re now, broadly, dividing the companies we work with across several intersecting groups:

General information: companies with broad, diverse, complex, or multi-featured digital products tend to have customer demand that is dominated by questions seeking information (“how do I…?”)
Personal information: there are companies with inbound demand that is dominated by requests about the customers’ personal situation— their account, their booking, their transaction, their order (“what is the status of my…?”)
Procedural: there is a large segment of companies where agent’s work is dictated by procedures which are characterised by a combination of investigative and action-taking work—refunds, upgrades, cancellations, modifications and more (“can you…?”)

Many companies have a blend of all three, with the balance tipped one direction or another based on what the core business is. For example, tech companies that offer or mediate services in the real world are often more characterised by procedural inbound than information-seeking queries. For the category of businesses that get little to no general information-seeking questions, standard RAG would have very little impact. For companies that are predominantly asked about personal information, standard RAG would lead to frustrating and long-winded customer experiences.

And so a wider set of capabilities needs to be built. Tool use and procedure orchestration and execution are front runners here, as well as the meta-capabilities of knowing when to use which approach to solve a specific kind of query.

RAG has dangerous, niche failure modes

Okay, so there’s still a slice of the market where RAG could be useful. But how well is it working out? There’s a growing list of public cases where bots (that might be using some form of RAG under the hood?) write answers that are wrong. Understanding, diagnosing, and mitigating these errors is a nuanced exercise that requires looking at the entire stack of RAG components.

At the highest level, it is immediately clear that the quality of the document corpus that is available to run RAG over is critical, since RAG aims to generate answers from that input. Mostly, however, documents are written for human consumption—either publicly, as articles that get published online, or privately, in internal company knowledge bases. Chunking and indexing documents in a vector store and hoping for the best readily results in generating answers where customers are told to get in touch (which is literally what they are already doing) or disclosing information that is meant to be internal-only, which could lead to breaking the law in regulated environments. And this does not even touch on the problem that a lot of company corpora are not only outdated but also largely incomplete.

Consider a more nuanced example: a customer writes into a multinational fintech asking “how do I open an account?” A RAG system with all of the usual bells and whistles may find a document on opening accounts, and generate a reply enumerating the required steps. At a first glance, this may look great! But, digging deeper, what if that customer was writing in from a country where that company does not operate? Suddenly the answer is not just incorrect, it is misleading. In this case, the ‘right’ thing for the AI agent to do would have been to discover, diagnose, and reason about facts that are absent in the originating query—perhaps more akin to the design thinking that is applied in the context of medical diagnosis—capability that goes well beyond out-of-the-box RAG.

The ‘right’ answer, in other cases, may be no answer at all. Consider a customer who is asking an informational query but exhibiting the hallmark signs of financial distress or vulnerability. The right thing to do is to identify vulnerability as the overarching problem and redirect the customer to the right team, not answer their informational query.

Squaring the circle

There is no doubt that multi-turn question-answering is an important quality of a fully capable AI agent, albeit smaller in many cases than is commonly believed. RAG is also on its way towards becoming a commodity SaaS technology itself—but the experiments we ran here also urge caution to treating RAG’s standard formulation as a magical solution. The research literature on this topic is growing, and we’re following it as closely as we can—pulling promising ideas and testing them out as we iterate and refine.

But it is only one piece of the wider picture: AI agents that automate manual, repetitive work will need to go well beyond the scope of today’s RAG. To following along on our journey, subscribe below!

Subscribe now

Drawing the Rest of the Owl 🦉

Dan Upton — Fri, 03 May 2024 09:24:30 GMT

We are building an operating system of AI agents that automate manual, repetitive work—starting with customer service.

A natural starting point for building AI agents is to think about prompting large language models (LLMs). But what else needs to happen? The software engineering practices of taking agents into production and turning their output into automated work are complex and not well trodden. And handling LLMs is only one slice of what otherwise needs to become an end-to-end system that integrates with companies that operate with a host of diverse systems.

The entire arena beyond LLM prompting is what we, at Gradient Labs, are affectionally calling “the rest of the owl.” Or, more simply: our backend platform.

The core of our backend platform now has five areas

The crux of our backend platform is similar to what we have used in the context of building banks and infrastructure automation software: Go services.

External-facing services. AI agents need to respond to and interact with the outside world. These services bridge between it and our platform. They connect us to a growing range of support platforms and enable companies to integrate directly with our API.
There are a growing range of resources (conversations, documents, procedures, tasks) that form the core of our platform. Each one lives in its own service, with a bounded context.
A finite-state machine that models conversations and is responsible for triggering our first AI agent, dispatching actions, and handling failures.
The agents themselves, which we currently deploy separately in order to enable more rapid experimentation, and
Finally, an orchestrator over many of today’s popular language model APIs, like Open AI’s GPTs and Anthropic’s Claude(s).

Our stack is both familiar and new

Deciding on which technology to use is an exercise in budgeting innovation tokens—we love to try new tools, but we're building AI agents, not infrastructure, so it's important to pick the ones that give us the greatest leverage. There are two that now have a cornerstone role in our backend platform:

Encore.dev is the backend engine that we use to ship our Go services backed by Postgres databases and Pub/Sub to our own cloud provider account. Its code-first approach means we don’t need to think about provisioning and maintaining anything under the hood; Encore manages everything from environments through to deployments. We even do our similarity-search using Postgres, which is natively supported by Encore, and pgvector. Most serendipitously, Encore gave us a lot of convention and structure, out of the box—which we would otherwise have had to create.

Temporal.io is our choice of toolkit for tackling a range of issues that plague distributed systems. Requests partially fail or time out, providers get overwhelmed and fall over, autoscalers abruptly terminate instances, and—especially for companies like us—LLMs get rate-limited or return a garbage completions. We are now crafting our way of intersecting Encore APIs with Temporal workflows, activities, and signals to structure our long-running, highly parallel processes for resilience and fault-tolerance.

Beyond these, we’ve also adopted Incident.io, Vercel, Google’s BigQuery, and more as we expand our platform. But this is just the start! We are 2,831 pull requests into this journey, and this post was just coarse brush strokes of what we’re building and there is much more to come.

To hear from us again, please make sure that you’ve subscribed below!

Subscribe now

👋 We’re Gradient Labs

Neal Lathia — Fri, 26 Apr 2024 12:54:45 GMT

Hello! We are Gradient Labs, a London-based AI startup founded in mid-2023.

We are building an operating system of AI agents that automate manual, repetitive work.

We are starting with customer support. Having worked within Ops at companies experiencing exponential growth, this is an area that is close to our hearts. It is painful to scale while ensuring that quality remains high; it is challenging to automate—particularly in regulated, risk-averse environments—because question-answering bots only skim the surface of the problem.

We do not believe in AI as a co-pilot: true, scalable, high quality automation will come from spending time supervising, evaluating, and course-correcting the work of AI agents, rather than being nudged by AI recommendations. So the practices that are common today need to be redefined: giving companies deep insight and control of the quality, observability and safety of work that is automated is at the core of what we’re building alongside the AI agents themselves.

As of April 2024, we’re working with eight design partners that span fintech, insure tech, online marketplaces, food delivery, travel, and crypto. Many of them are household names in their home countries. We’re also grateful to be funded by Local Globe and many wonderful angel investors who are former colleagues, friends, and family.

The current idea for this blog is to publish short, specific posts about the AI, Engineering and Product problems and ideas that we are grappling with. Think of it as a small window into what is happening behind the scenes of a sub-10 person startup that is on the verge of going live for the first time. Or, think of it as what you’d hear us talk about if you randomly joined us for a 30 minute coffee. If that sounds like your thing, subscribe to receive our updates below.

Subscribe now