VAKRA benchmark reveals where agents actually fail: tool selection, arguments, multi-step planning | Radar

IBM Research published VAKRA (Verifiable Agent Knowledge Retrieval and Action), a benchmark for evaluating agents in an enterprise-like environment. It goes beyond an accuracy table: VAKRA tests full execution trajectories across 8,000+ locally hosted APIs in 62 domains, validating every step.

Benchmarks finally measure where agents bleed, not just where they smile

Standard agent benchmarks typically evaluate the final answer. VAKRA adds a layer: it verifies whether the agent selected the correct tool, specified arguments correctly, and whether the final answer is grounded in tool output rather than hallucination. The four tested capabilities are API chaining, tool selection from large sets (6 to 328 tools per domain), multi-hop reasoning, and multi-source queries (APIs plus documents) with policy constraints.

The results are specific: performance degrades with more reasoning steps (3+ hop tasks perform significantly worse), models fail on argument specification even with correct tool selection, and policy constraints cause 30 to 50 % accuracy drops. These are concrete numbers for things agent developers intuitively sense but have not had systematically measured until now.

For teams deploying agents, this changes how to think about failure

Until now, typical agent debugging looked like this: the agent returned a wrong result, why? You try a better prompt, a better model, a different framework. VAKRA offers a different frame: it breaks failure into stages (tool selection, argument specification, argument values, answer synthesis) and measures precisely where in each model these failures occur.

That is the difference between "the agent is weak on multi-step tasks" and "this model fails specifically on argument specification in 2+ hop planning". The second description has an actionable direction.

VAKRA measures under ideal conditions and correlation with production depends on your use case

VAKRA is an academic benchmark, not a production test. Locally hosted APIs are controlled environments without the noise of real traffic. Leaderboard results say what a model can do under ideal conditions with precisely defined tools. Correlation with production agents in enterprise depends on how representative the benchmark domains are for your specific use case.

At the same time, the dataset is publicly available and the environment reproducible. That is a standard many commercially motivated benchmarks do not meet.

Value will come when results begin correlating with production outputs

Worth watching: whether VAKRA results correlate with real agent behavior outside the test environment, and whether the error types the benchmark identifies match what operations teams see in practice. If yes, VAKRA becomes a diagnostic tool. If not, it remains an academic paper.

Lilith's verdict

Finally a benchmark that measures agent failures where they actually happen: not in the final answer, but at every intermediate step. If the results correlate with production behavior, VAKRA becomes the diagnostic tool agent developers need.