Evaluating multi-agent systems in enterprise tool use

In recent months, there has been increasing interest in the area of multi-agent systems and how they can be used to solve more complex tasks than a single agent could accomplish on its own. The topic is particularly interesting and raises several questions and ideas to consider:

How can agents act as a cohesive and cooperative population, similar to how humans function together?
When each agent is primarily trained within its own silo, does this present a significant limitation to the effectiveness of multi-agent systems?
How well does a multi-agent system really work? And do we need multiple agents in today’s era of long context (>1M tokens) models?
What causes a multi-agent system to work better than a single-agent system?

Anthropic’s blog post about how they architected a multi-agent deep research system is an excellent source for understanding the nuances of the latter two questions:

“Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously…

“Multi-agent systems work mainly because they help spend enough tokens to solve the problem. In our analysis, three factors explained 95% of the performance variance in the BrowseComp evaluation (which tests the ability of browsing agents to locate hard-to-find information). We found that token usage by itself explains 80% of the variance, with the number of tool calls and the model choice as the two other explanatory factors…

“Further, some domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today. For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time. We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.”

Based on these insights, there are two key areas where current multi-agent architectures outperform single-agent architectures:

Systems where token usage (input & output) is high enough to exceed the context of a single agent/model.
Systems where on-the-fly reasoning on higher token usage is essential and a part of the problem.

Background

The discussion around splitting information/input tokens/reasoning among multiple agents often ties back to the research on evaluating long-context model performance and the reasoning benchmarks around it. Long-context evaluations are often based on the “needle in a haystack” analogy – the challenge of identifying a specific group of tokens within a much larger context of many similar groups of tokens.

OpenAI’s MRCR benchmark is often used to measure model performance in a long-context environment. In the MRCR benchmark, the model has to disambiguate between multiple needles in a haystack environment, and select a particular needle based on the user request (“ask”). The setup simulates multi-turn conversations where identical user requests appear several times within a long context—for example, multiple prompts like “write a poem about tapirs” scattered throughout. The model is then asked to retrieve a specific instance (e.g., “give me the third poem about tapirs”), testing how well it can track and recall the right information across extended inputs.

OpenAI’s evaluation shows decreases in model performance when:

The input context length is increased.
The number of needles is increased.

The findings provide more clarity around the suggestion to split reasoning or task effort into multiple agents/models, because models reason best when context is limited. We decided to investigate further and show a real-world use case (outside of deep research) for a multi-agent architecture. Our work specifically demonstrates the performance differences in an enterprise/multi-tool use case and addresses specific details often overlooked in other discussions.

Our breadth of past work related to long-context evaluation enabled us to create methods of evaluation that mirror real-world use cases.

A multi-tool ecosystem

Given the widespread adoption of MCP (Model Context Protocol), connecting an agent to tools is straightforward, enabling it to manage diverse user tasks or solve complex enterprise challenges. We aim to show how single-agent and multi-agent performance differ within such an ecosystem, and how model performance varies with:

Choice of agents
Number of tools (context length)
Number of distractors (reasoning ability)

Changing these variables helps evaluate agentic systems as a whole, with complexities mirroring a real-world ecosystem, allowing us to evaluate the performance differences behind each.

To create our multi-tool ecosystem, we used ToolACE (ICLR 2025). We reworked their open source dataset, which consists of 10K+ tools in 30 domains, to create task/tool groups with a high number of tools per task.

The following are definitions that will be used throughout the blog:

User ask: A user ask that may span using multiple tools, e.g., Run a tool to get the weather for me in Celsius!
Domain: A broad category in which each tool is classified, for example, Science, Entertainment, Commerce, Data, Communication, AI, Art, Engineering, etc.
Label tool: e.g., get_weather_celsius(Toronto)
Distractor tools: e.g., get_weather_farenheit(Toronto)
Unrelated tools: Tools that are irrelevant to the user ask. e.g., order_ubereats

Data generation & synthesis process

*Data generation process for our multi-tool ecosystem*

To test the performance of agents, we created tasks that span the variety of domains mentioned above. These tasks were created by:

Assigning a domain to each tool via a LLM judge.
Creating a tool complexity rubric, assigning complexity scores to each tool within our dataset.
Filtering tools that are too simple.
Filtering tasks that are not uniquely completed by a single tool call prediction (i.e., multiple tools can satisfy user ask).

(Note: For a deep dive on the role of rubrics in data evaluation and curation, see our blog post series on rubrics.)

To mirror real-world complexities, we created distractor tools by feeding our (task + label) tool schema to another LLM. The LLM was instructed to select the label tool schema and make a distractor by selecting one of the following methods:

Removing arguments that are needed to carry out the user task (e.g., remove the parameter ‘region’ if the user is required to specify a region).

Making tool descriptions + parameters nonsensical (e.g., change the description to indicate that the tool is relevant to Mars plants instead of Earth plants).

Creating legacy versions of tools, where the model has to reason that the label tool should be used.

Changing enumerated values for tools so that they no longer meet user query requirements (e.g., User says “yellow” → enumerated list has [‘crimson’, ‘magenta’, ‘turquoise’] (excludes yellow)).

Restricting constraints on tools that don’t allow user query fulfillment (e.g., maxItems below what the user needs).

This process allowed us to create synthetic distractors that require reasoning from the model to be able to call the correct tool that completely satisfies the user’s request.

Distractor tool examples

User Query

"Can you transcribe the following English sentences into IPA within 24 hours: 'Hello, how are you?', 'Good morning.', 'Thank you very much.' and 'See you later.'?"

The user requests IPA (International Phonetic Alphabet) transcription of four English sentences with a 24-hour deadline.

Label Tool (Correct Answer)

PhoneticTranscription_transcribeText

Converts input text into its phonetic transcription using specified phonetic alphabet and language settings.

Required Parameters:

• text (string) – The text to be transcribed phonetically
• settings (object)
 - alphabet (enum: "IPA", "X-SAMPA")
 - language (enum: "English", "French", "Spanish")
 - timeOptions.deadline (enum: "1 hour", "12 hours", "24 hours")

This tool supports IPA transcription for English text with a 24-hour deadline option, matching all query requirements.

Distractor Tool #1

QuickPhoneticTranscriber_convert

Converts short input text into its phonetic transcription. Optimized for brief phrases and single words.

Key Difference: Has maxLength: 15 constraint on text parameter. The user's query contains sentences longer than 15 characters, making this tool unsuitable.

Distractor Tool #2

PhoneticTranscriber_convertToPhonemes

Converts input text into its phonetic transcription using specified phonetic alphabet and language settings.

Key Difference: Missing the required "text" parameter. Only requires "settings", making it impossible to pass the actual sentences to transcribe.

Distractor Tool #3

AdvancedPhoneticEncoder_process

Converts input text into its phonetic transcription using advanced computational phonetic alphabets.

Key Difference: Does not support IPA in its alphabet enum (only "X-SAMPA", "SAMPA", "Kirshenbaum", "Arpabet"). The user specifically requested IPA transcription.

Distractor Tool #4

TextToPhonetics_transcribe

Converts input text into its phonetic transcription using specified phonetic alphabet and language settings.

Key Difference: Language enum only supports "Mandarin", "Japanese", "Korean", "Arabic" – does not include "English", which is required for the query.

Summary

All distractor tools are semantically similar to the correct tool but have subtle parameter mismatches:

• Text length constraints

• Missing required parameters

• Unsupported alphabet formats

• Incompatible language options

These variations test whether models can carefully match query requirements against tool specifications when multiple plausible options are available.

Single agent evaluation

We started by evaluating the performance of a single agent across two distinct settings to address the following questions:

How does a single agent perform when the label tool call (the correct tool call) is interspersed among varying quantities of irrelevant tools?
How does a single agent perform when the label tool call is interspersed among varying quantities of irrelevant tools, including four distractor tools per label tool call?

The first scenario evaluates the effect of context length on model performance in a lower-reasoning setting, while the second scenario evaluates the effect of context length on higher-level reasoning performance within a tool-calling environment.

We selected OpenAI’s GPT-5 family because it provides a healthy mix of frontier models of different sizes.

We compared the evaluation of 100-tools & all-tools scenarios with the 3-tools scenario and noted the differences across the two experiments.

As you can see, a clear pattern emerged across all evaluations based on model size. The smaller models (GPT-5-nano, GPT-5-mini) start to struggle as context size and the number of tools increase—their performance drops as the environment gets more complex. In contrast, GPT-5 stays remarkably steady, maintaining accuracy even with longer contexts and larger tool sets.

This points to a clear differential effect: smaller and mid-sized models lose reasoning depth as complexity grows, while larger models can keep up without much degradation. When the tool environment becomes more intricate—with overlapping functions, distractors, and higher reasoning demands—those differences become even more pronounced.

Given these findings, we investigated whether a multi-agent system could mitigate the performance impacts observed in these environments for the mini and nano models, and narrow the gap between mini/nano & GPT-5.

Multi-agent system construction and performance

*Architecture of the multi-agent system used for the multi-tool ecosystem. We used a planner-executor architecture (also known as supervisor/orchestrator/manager architecture).*

Each executor in our architecture is tied back to a unique Domain, allowing the planner to select the domain-specific executor based on the user query.

Our architecture splits tools across executors, with each executor tied to a specific domain. This design maps cleanly to real-world setups where multiple specialized agents collaborate within defined boundaries.

In an enterprise setting, executors can represent different domains—HR, finance, engineering, and so on—with a planner agent routing refined user queries to the right executor and aggregating their outputs into a unified response.

In a user ecosystem, executors can represent individual devices or applications, each operating within its own silo to maintain privacy while still contributing to a shared task flow.

Based on the output, we observed a consistent pattern across these plots. At lower tool counts, performance in the multi-agent setup either holds steady or dips slightly, largely due to planner overhead—the routing layer becomes the bottleneck. As tool count and context size grow, that trend reverses: multi-agent systems begin to outperform single-agent baselines, showing clear gains even beyond the 30K token range.

GPT-5-mini, in particular, shows strong improvements in the full tool setting, demonstrating that multi-agent coordination helps smaller/medium-sized models recover much of the accuracy lost in single-agent, long-context scenarios.

Cost implications

We measured the cost implications across the different experiments, and found the multi-agent experiments to be much more cost-effective in long-context scenarios.

*In the 3-tool setting, the planner overhead leads to increased costs*

*In the 100-tool setting, there are consistent decreases in cost for each model used in multi-agent systems*

*In the long-context setting, the multi-agent system shows greater cost efficiency with increased accuracy*

Summary and key takeaways

In ecosystems that require reasoning over long contexts, multi-agent systems can deliver meaningful gains by reducing cost and improving accuracy. While our experiments centered on relatively straightforward reasoning tasks, real-world challenges often involve far more complex reasoning, where the decline in accuracy as a function of increased context length is even steeper. In such settings, dividing context across specialized agents becomes a necessity.

The following key takeaways summarize the main insights from our analysis:

Larger, state-of-the-art models excel at long-context reasoning, whereas smaller models tend to lag behind in these scenarios.
Multi-agent systems introduce an additional failure mode of routing errors through the planner agent, which can impact accuracy. Notably, larger models may experience performance drops in multi-agent setups due to these routing issues.
Multi-agent architectures offer substantial optimization for long-context (LC) reasoning tasks, enabling dramatic reductions in cost while allowing smaller models to achieve accuracy levels closer to those of larger models.

* Throughout our experiments, we encountered API errors with the GPT-5 family. To ensure experiment quality, we only kept data points where no API errors were observed for any model.

* While the sample size was limited, the trends we observed were clear and directionally consistent.

Evaluating multi-agent systems in enterprise tool use

Background

A multi-tool ecosystem

Data generation & synthesis process

Distractor tool examples

Single agent evaluation

Multi-agent system construction and performance

Cost implications

Summary and key takeaways

Recommended
articles

Part V: Future direction and emerging trends

The self-critique paradox: Why AI verification fails where it’s needed most

A chat with the Terminal-Bench team

Join our newsletter for expert advice, the latest research, and exclusive events.

Evaluating multi-agent systems in enterprise tool use

Background

A multi-tool ecosystem

Data generation & synthesis process

Distractor tool examples

Single agent evaluation

Multi-agent system construction and performance

Cost implications

Summary and key takeaways

Recommended articles

Part V: Future direction and emerging trends

The self-critique paradox: Why AI verification fails where it’s needed most

A chat with the Terminal-Bench team

Join our newsletter for expert advice, the latest research, and exclusive events.

Recommended
articles