VAKRA: What Happens When Enterprise AI Agents Actually Have to Use Tools

IBM Research just dropped VAKRA, and honestly, it’s the kind of benchmark that makes you wince if you’ve been hyping AI agents for enterprise use.

VAKRA stands for something I’m not going to pretend to remember, but what it does is simple: it drops an AI agent into a simulated enterprise environment with over 8,000 locally hosted APIs across 62 domains, gives it a task that requires 3 to 7 steps of reasoning, and watches whether it can actually finish the job.

Spoiler: most models can’t.

What VAKRA Actually Measures

The benchmark breaks down into four capabilities, each designed to expose a different weak spot in current agent architectures.

Capability 1: API Chaining using Business Intelligence APIs

This one has 2,077 test instances across 54 domains. The agent needs to chain 1 to 12 tool calls to answer a question. Here’s a concrete example from the dataset:

Query: "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?"

The agent has to call get_data first to initialize the data source (this returns a lightweight preview, not the full dataset), then chain select_data_equal_to calls for each filter, and finally call get_team_name to extract the answer. The answer is FC Barcelona.

What I find interesting here is the design choice to return only a preview of the data rather than the full dataset. That’s smart – it prevents the MCP protocol from being bogged down by large data transfers, but it also means the agent has to reason about what data exists before it can query it. This is closer to how a human analyst would work: skim the schema, then drill down.

The tool set is split into two collections. SLOT-BIRD gives you 7 generic data manipulation tools (filter, sort, etc.), reminiscent of Tableau or Google Analytics. SEL-BIRD extends this with more specialized tools, including flattening categorical arguments into separate functions. So instead of sort_data(ascending: bool), you get sort_data_ascending and sort_data_descending. This is a trade-off: it reduces ambiguity for the agent but blows up the tool space.

Capability 2: Tool Selection using Dashboard APIs

1,597 instances across 17 domains, using REST-style APIs wrapped in MCP servers. The twist here is that each domain has between 6 and 328 tools (average 116). The agent has to pick the right one from a potentially massive list.

This is where things get ugly. The OpenAI API spec limits the tool list to 128 tools. That means if your domain has more than 128 tools – and many do – you can’t just dump them all into the system prompt. You need a retrieval mechanism or a hierarchical selection strategy. Most current agent frameworks don’t handle this gracefully.

The Failure Modes Are Familiar But Painful

The VAKRA paper goes into detail on what breaks, and none of it is surprising if you’ve been watching this space:

Tool hallucination: Agents call tools that don’t exist, or pass arguments that don’t match the API spec. This happens even with well-documented APIs. The model “remembers” a function signature from training data and tries to use it even though the actual API is different.

Chain breakage: A single wrong intermediate result cascades. The agent filters on the wrong column, gets an empty result, and then either hallucinates an answer or starts a completely unrelated chain of tool calls. This is the agent equivalent of a spreadsheet error that nobody catches until the board meeting.

Context window saturation: Multi-step workflows generate a lot of intermediate state. Tool call traces, data previews, error messages – it all piles up. By step 5 or 6, the model starts losing track of what it was trying to do. I’ve seen this in production systems too; it’s not just a benchmark artifact.

Over-reliance on the first tool call: Agents tend to commit to a strategy based on the first tool they call, even if the data preview suggests a different approach would work better. This is the agent version of anchoring bias.

Why This Matters More Than Another LLM Leaderboard

Most benchmarks test isolated skills. MMLU tests knowledge. HumanEval tests code generation. GSM8K tests math. But enterprise workflows are none of these things in isolation. They’re a messy combination of API calls, document retrieval, data manipulation, and decision-making under uncertainty.

VAKRA is harder than any of those benchmarks because it’s compositional. The agent can’t just generate text; it has to act, observe the results of its actions, and adjust its plan. That’s a fundamentally different capability, and the low scores suggest we’re still early in the curve.

I also appreciate that VAKRA provides full execution traces, not just final answer accuracy. This is crucial for debugging. If an agent gets the right answer but took a weird path, that’s still a problem for reliability. And if it gets the wrong answer, you can see exactly where it went off the rails.

The Elephant in the Room: Cost and Latency

VAKRA doesn’t explicitly measure this, but anyone running agents in production knows that multi-step tool use is expensive. Each tool call is an API round trip plus LLM inference time. A 7-step workflow with retries can easily cost $0.50-$2.00 per task with current models. For enterprise scale, that adds up fast.

The benchmark also doesn’t address the cold start problem: how do you configure the tool descriptions, example queries, and error handling for a new domain? That’s still largely manual work, and it’s where most of the engineering effort goes in real deployments.

What I’d Like to See Next

VAKRA is a solid contribution, but I’d love to see a few extensions:

Multi-agent variants: Can a system with a planner agent, a tool-calling agent, and a verification agent outperform a single monolithic agent?
Error recovery metrics: How well do agents recover from a failed tool call? Current benchmarks mostly measure success on the first attempt.
Latency and cost baselines: Because enterprise buyers care about both accuracy and operational cost, not just leaderboard position.

For now, VAKRA is a useful reality check. If you’re building an enterprise agent and it scores well here, you’re doing something right. If it doesn’t, you’re in good company – nobody does yet.