LLMs at Gradient Labs: the perfect blend
We don't pick one language model; we blend many together to get the best result.
In the last two months, there have been announcements for several landmark models: GPT-4.1 on April 14th, Llama 4 on April 5th, Gemini 2.5 on March 25th, and Claude 3.7 at the end of February. Undoubtedly, there’s a frenetic amount of work going into training the next generation of foundation models, and everything is changing fast.
This continuous change is a great reminder that evaluating LLMs is hard. Given one model, there’s a plethora of metrics reported on public evaluation datasets that aim to demonstrate its general performance. Looking across many models, there are often slight differences in what benchmarks are used and how they are compared to competitors. The ultimate question, for task-centric AI agents, is: if a model looks good on paper, will it be good for the specific problems that our agent works on?
At Gradient Labs, we’ve taken a slightly different path with respect to how we go about model selection—this post is an overview of what we do.
The ultimate trifecta
If you zoom out all the way, there are only three variables at play with all foundation models. Assuming a task that requires a single completion, the trade off is between:
Quality: how “good” are the outcomes that the model achieves?
Latency: how quickly can the model achieve a result?
Cost: how many tokens does it need to generate to get to the result, and how are those tokens priced?
Unfortunately, the range of models out there often only gives you two of these 😞—sometimes just one!
At Gradient Labs, we heavily anchor on the first: quality. Primarily, this is because the industry trend with the other dimensions, over time, has been progressively faster, cheaper models that are “just as smart.” End-to-end agent latency also has a strong Engineering angle that is separate from the model choice—by parallelising different building blocks of the agent, by pre-emptively running some parts before they are needed, and more.
However, inside the quality dimension there is a lot to unpack:
An individual building block of an agent might be evaluated in completely different ways—with binary or multi-class classification metrics, ranking metrics, or response quality reviews. In this arena, the choice of metric can’t be divorced from what that part of the agent is trying to achieve.
The end-to-end customer experience when chatting with an AI agent is determined by the effect of combining all of the agent’s blocks together, and so ensuring that rare upstream mistakes do not compound into low-quality downstream responses is critical. This area is not contingent on using the same model throughout the whole agent. There are a suite of tools and product features that we surface for this, ranging from simulations all the way through to advanced customer conversation synthesis (more to come on this front soon!).
Going all-in ❌ , maintaining optionality ✅
At Gradient Labs, picking a single model to serve all of an AI agent’s needs during a time when new models are being announced every month felt overly constraining. It would mean that we would be adopting all of the model’s strengths, and need to accept all of its limitations. We avoided this because of:
The rising tide. Imagine building an agent using GPT-3 end-to-end. By 2025, no matter how good it was, its overall position would have been eroded, by virtue of being committed to a model that has largely been supplanted. We believe that the same is true going forward.
The risk of migrations. Imagine building end-to-end with Sonnet 3.7, and then waking up one morning to the announcement of Sonnet 4. Being committed to a single model would force us to think about large, uncertain, and risky upgrades where the entire AI agent might need to be migrated onto a new model.
The risk appetite of our partners. Some of the companies we work with want the latest & greatest live as quickly as possible, others care less about experimental opportunities and more about consistent outcomes. Being flexible enables us to cater for both!
Ultimately, the flexibility that we desired is that of enabling AI Engineers to pick the ideal model for the building block they are working on—they would know best where it fits in the overall puzzle, and what trifecta variables they want to trade off against.
Uniform interface, reliable service ✨
While AI Engineers are empowered to pick their choice of model, there are several separate problems that they shouldn’t need to care about:
Rewriting code to use a different model. We have built an internal abstraction that enables changing models by editing one line, rather than needing to juggle different clients.
Observability. We log each completion request that is made, whether it succeeded or failed; this happens inside of our internal abstraction and is invisible to AI Engineers.
Picking the model’s provider. While Open AI and Anthropic models are available directly, many are also available via cloud services providers (Azure, AWS, GCP). Retry and fallback behaviour when any of them are experiencing a hiccup, or if we approach our rate limits, sits in the heart of our platform—far away from the daily work of AI folks.
The perfect blend ☕️
When people chat with our AI agent, their experience is being driven by a blend of models that are each playing to their unique strengths. Today, that is a blend of Sonnet, Gemini, and GPT models. When a new model is released, we (like many!) are very quickly evaluating different parts of the agent to see what can be quickly improved.
This, combined with a design that focuses on a conversational & diagnostic-oriented approach to resolution is one of the many reasons why we have seen our agent outperform others when going head to head against them.
We know, however, that there is yet much to build—subscribe below to hear all of our updates!