Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
IBM Research published a deep technical analysis of VAKRA, a benchmark that tests AI agents across more than 8,000 APIs in 62 domains to find exactly where and how they break. The findings reveal that cascading errors, not just isolated mistakes, are the real threat to deploying ...
Ankita Naik, writing for the Hugging Face Blog alongside co-authors from IBM Research including Danish, Ben, and Anupama Murthi, published on April 15, 2026 what may be the most systematic look yet at how AI agents fail in enterprise settings. The team introduced VAKRA, a tool-grounded, executable benchmark built to stress-test agent reasoning and tool use under conditions that resemble real business workflows, not sanitized academic tasks.
Why This Matters
The AI agent space has been moving fast on capability claims while largely ignoring the reliability problem, and VAKRA is a direct challenge to that pattern. Gartner has projected that roughly half of all AI agent implementations currently in development will fail to meet the reliability thresholds required for autonomous production use, which should alarm any enterprise CTO who has already green-lit an agent deployment roadmap. IBM Research is not just building a leaderboard here. They are building a failure taxonomy that the whole industry needs but few have had the patience to produce. This kind of empirical, unglamorous work is what actually moves a technology from demo to deployment.
Daily briefing from 50+ sources. Free, 5-minute read.
The Full Story
The VAKRA benchmark was built around a core frustration that existing AI agent evaluations test isolated skills rather than the messy, multi-step workflows that agents face in real enterprise environments. IBM Research designed the system with over 8,000 locally hosted APIs backed by real databases spanning 62 domains, paired with domain-aligned document collections. Tasks within the benchmark require agents to complete 3 to 7 step reasoning chains that combine structured API interaction with unstructured document retrieval, which is a much harder challenge than answering a single question or running a single tool call.
What the team found when they ran agents through this environment was not encouraging, but it was clarifying. The primary reliability threat is not the sheer variety of failure modes an agent encounters. It is error propagation, where a single early mistake in reasoning or tool selection compounds through every subsequent decision. One wrong API parameter call at step 2 of a 6-step chain can make every action after it irrelevant or actively counterproductive. This cascading dynamic is harder to catch and harder to fix than isolated errors, because the agent often continues operating without any explicit signal that something has gone wrong.
The research documents four major categories of failure. Tool hallucination occurs when agents fabricate API parameters or specify tool interfaces that do not exist. Parameter misspecification happens when agents provide syntactically correct but semantically wrong values to real tools. Context loss describes situations where agents fail to carry task-relevant information across turns, leading to contradictory behavior. Planning failures emerge when agents cannot decompose a complex goal into the right sequence of tool calls. Each category requires a different intervention strategy, which is part of why cataloging them separately matters.
The team also found that agents gravitate toward frequently used tools even when better options exist for a given task. This suggests that agents are not truly mapping task requirements to tool capabilities in a principled way. They are pattern-matching to familiar tools, which works until it does not. For enterprise deployments where tool libraries may include dozens of specialized APIs, this is a meaningful design constraint that architects need to account for.
Multi-agent orchestration received pointed scrutiny in the analysis. The intuition that specialized agents working in parallel should outperform a single agent is not well supported by the VAKRA findings. Coordination failures, misaligned task state between agents, and conflicting actions introduce new failure modes that offset potential gains. The research suggests that simpler single-agent architectures may prove more reliable than elaborate multi-agent setups at the current maturity level of the technology.
Key Details
- VAKRA includes over 8,000 locally hosted APIs backed by real databases across 62 distinct domains.
- Tasks require 3 to 7 step reasoning chains combining API calls with document retrieval.
- The benchmark was published on April 15, 2026 on the Hugging Face Blog by Ankita Naik and the IBM Research team.
- Gartner projections cited in related industry analysis estimate that approximately 50 percent of current AI agent implementations will fail reliability thresholds for autonomous production use.
- The VAKRA dataset and leaderboard are publicly available on Hugging Face and GitHub for external submissions.
- 4 primary failure categories are documented: tool hallucination, parameter misspecification, context loss, and planning failures.
What's Next
The VAKRA leaderboard is live and open for external submissions on GitHub, meaning the research community can now benchmark their own agents against the same environment IBM used and track performance across the 62 domains. Expect the benchmark to become a standard reference point in enterprise agent evaluation conversations over the next 6 to 12 months as more organizations confront deployment reliability problems that VAKRA explicitly names. Teams building production agents should treat the 4-category failure taxonomy as a checklist for their monitoring and testing infrastructure, not an abstract research curiosity.
How This Compares
OpenAI's Agents SDK, which the company has been actively advancing in 2025 and 2026, takes a different approach to the reliability problem. OpenAI focuses primarily on tooling that helps developers build safer agents through better orchestration primitives, while VAKRA focuses on empirical measurement of where agents break across a standardized environment. Both approaches are necessary, but IBM's contribution fills a gap that OpenAI's SDK alone cannot address: you need a benchmark before you can know whether your safety tooling is actually working. For teams that want to evaluate AI tools and platforms against a rigorous standard, VAKRA now provides that baseline.
The MAST taxonomy, developed in parallel by the broader research community, classifies agent failures into actionable categories that engineers can design around. VAKRA's framework shares that categorical ambition but goes further by grounding the analysis in a fully executable environment with real databases rather than static test cases. MAST is a useful conceptual map. VAKRA is a proving ground you can actually run your agent through and get a score.
Compared to general-purpose benchmarks like GAIA or WebArena, VAKRA is deliberately enterprise-oriented, which makes it less headline-grabbing but considerably more useful for the organizations where agent failures carry real financial and operational risk. The AI news cycle tends to reward benchmarks that show impressive top-end scores. VAKRA is more interested in exposing the floor, and that is the more honest contribution.
FAQ
Q: What is the VAKRA benchmark and who made it? A: VAKRA is a benchmark created by IBM Research to test how well AI agents reason and use tools in enterprise-like environments. It was published on April 15, 2026 by Ankita Naik and colleagues on the Hugging Face Blog. The benchmark includes over 8,000 APIs across 62 domains and is freely available on Hugging Face and GitHub.
Q: What does AI agent tool hallucination mean in practice? A: Tool hallucination happens when an AI agent invents API parameters or calls tools that do not actually exist in its environment. Instead of selecting a real, available tool correctly, the agent confidently generates a fake or malformed tool call, which causes the workflow to fail silently or produce wrong results downstream.
Q: How can developers use VAKRA to improve their agents? A: Developers can submit their agents to the live VAKRA leaderboard on GitHub to measure performance across the benchmark's 62 domains. Reviewing execution traces against the 4 documented failure categories, specifically tool hallucination, parameter misspecification, context loss, and planning failures, points directly to which parts of an agent's architecture need the most work. Practical guides on agent evaluation can help teams interpret those results.
IBM Research's VAKRA work arrives at exactly the right moment, when the industry's enthusiasm for autonomous agents is running well ahead of its understanding of how those agents actually fail. The benchmark's open leaderboard and clearly documented failure taxonomy give the engineering community a shared vocabulary and a shared testing environment for building more honest assessments of agent reliability. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.
Get stories like this daily
Free briefing. Curated from 50+ sources. 5-minute read every morning.


