Building agentic workflows

How we're navigating the abstract process of designing and building AI agents

, and

Jun 24, 2024

At Gradient Labs, we’re building a suite of AI agents that automate manual, repetitive work—starting with customer service. There are two major areas to this: the platform that enables AI agents to operate (which we 🦉 previously wrote about), and the AI agents themselves.

We’re always soaking up as many blog posts about AI agents as we can find: recently, we’ve seen that a lot of them have been focusing on the criteria to qualify software as an AI agent— is it just code that makes more than one LLM call, or perhaps something else? As Andrew Ng wrote, perhaps the technical definition of “agent” doesn’t matter as much as the capabilities and impact that can be achieved via agentic systems, and so instead of thinking about what agents are, we’re back to the age-old questions of organising a group of people to build a novel type of system, in safe and scalable way— starting from its design and going all the way through to where we write what code.

“What would a human do?”

Humans are really good at reasoning by instinct when working on abstract tasks. For example, if you were a support agent and a customer asked “hey, can you change my flight?” then you may instinctively redirect them elsewhere if you work at food delivery company— you know, without having been told, that the question is out of bounds. Or, if you did happen to work at a travel company, you might instinctively know to ask the customer clarifying questions if they have no bookings in the system instead of saying yes but then not knowing what to change.

LLMs, on the other hand, while increasingly powerful (as zero-shot classifiers, as planners, and as summarisers, and beyond), tend to do better when given clear, explicit instructions rather than being given overly abstract tasks. It’s currently infeasible to expect LLMs to “instinctively” do the right thing within a realistic setting. That means that a starting point when designing an agentic system is to try and map out, in broad brushstrokes, what these implicit instincts that humans have might be. For example, given a question from a customer, humans might implicitly answer a range of questions: is it clear? Is it relevant? Is it ambiguous? Is it threatening? Is it a greeting or a farewell? Does it sound vulnerable? Each one of those questions can be encoded as a discrete, explicit task for an LLM to tackle.

From the perspective of someone building an agentic system, it’s then much easier to evaluate and reason about individual components, rather than treating the “AI” as an impenetrable black box. For us, from a technical perspective, each of these skills is a Temporal Activity so that it can be automatically retried when it fails.

After building a family of skills, the exercise becomes one of composing skills together and getting them to be executed effectively, where they can be interleaved with business logic to suit different companies and conversational scenarios. To do that, we build at two levels of abstraction: the task level (as a state machine), and the LLM level (as chains of skills).

Tasks as state machines

Imagine an extremely simplistic representation of a conversation between me and you. There could be three states:

I’m listening — a state where you speak, and all I need to do is to keep up with what you’re saying (and not get distracted 😳 )
I’m talking — a state where I need to figure out what to say, and then say it (hoping to say something relevant 😇 )
The end state, where the conversation has finished after you say goodbye 👋🏽

The states and a minimal set of transitions could look like this:

Agentic workflows can also be formulated as (much more) complex state machines. This enables us to reason about the messy world of dialog, catering for things like stopping any work if a human agent steps in, reaching back out to customers who seem to have stopped responding, invoking tools, waiting for a human-in-the-loop to approve an action, and beyond.

From our point of view, as builders: state machines are a well-established pattern that facilitates reasoning about high-level behaviour of our agent while staying away from the non-deterministic LLMs that sit under the hood. We represent each conversation in our platform as a Temporal Workflow, where states can receive signals when external events occur that require the workflow to interrupt whatever it is currently doing and move into a different state.

We’ve stayed away from multi-agent collaborating, for now

A popular topic that we’ve noted in papers and blog posts is about multi-agent systems. One way these are built is by decomposing the work required to perform a task into different personas. For example, generating software (which is one of the most popular topics in the literature these days!) could be broken down into the “product manager” persona, the “software engineer” persona, the “quality tester” persona, and so on.

At the end of the day, this is still LLM calls that are stacked into a hierarchy— just organised in a slightly different way. However, from a builder’s perspective, mapping agentic workflows back into the roles that humans developed to do the same tasks (as opposed to the skills that are used to accomplish that task) feels like an anti-pattern—but we might be wrong. Is it the product manager or software engineer who is in charge of clarifying ambiguous requirements?

At this early stage, this is a level of ambiguity and complexity that we have naturally veered away from and our agentic workflows are assumed to be able to impersonate all of the skills they need to automate the entirety of the task they are being designed to handle. So that means that we spin up a new agent when we have a new high-level task to automate (say, a content editor).

The agent: planning, reasoning, executing

Looking back at that simplistic state machine, it might still look pretty vague. What does the “I’m talking” state actually translate into? This is the part of our system that most closely resembles the emerging arena of agentic design patterns:

The planner inspects the current conversation and the available chains to pick the best chain and back out of ineligible conversations;
The best chain is executed first (falling back to other options);
Each chain might have a series of classification, search, reasoning, and completion steps — and emits one or more events (such as “I decided X” or “I classified this as Y”) and operations (i.e. “say X” or “call tool Y” or “hand this conversation off to Z”).
Before these operations are acted on, they are checked — with guardrails and other consistency checks to make sure the agent rarely does contradictory or undesired things.

From a builder’s perspective, this ends up as coding up a large, non-deterministic directed acyclic graph. By starting at the top (the planner) and inspecting the events we can trace all of the decisions that the workflow has made to reach its conclusion.

One of the neat things about this structure is that we don’t need to reason about higher-level questions (such as “what if the customer writes in again midway through planning?”) while we implement these skills. Equally, we don’t need to reason about lower-level questions (such as “what if this LLM call is rate limited?”) because we can just error and let Temporal take care of retries. Most importantly: we can run this agent as both a live-agent, responding to customers, and a simulated-agent— just asking it “what would you do next here?” — which unlocks our entire approach to evaluation.

Putting it together

As we continue to grow, finding a way to structure our agentic workflows is enabling us to map the parts of the agents that we build back into the disciplines in our team, such that each person is playing to their strengths, the level of abstraction they want to work with, and their domain expertise. Having the structure above has— to date, one year in— really helped to cement all things from our high-level design approach (”what would a human do?”) through to the lowest details (how we structure our code bases).

Have you taken a similar approach, or have you found a different path that works for you? We’d love to chat about it!

Gradient Labs Team

Discussion about this post