AI scientists produce results without reasoning scientifically
A new study from eight researchers at arXiv found that AI scientific agents complete research tasks while skipping the core reasoning that makes science trustworthy. The systems ignore evidence in 68% of cases and almost never revise their beliefs when confronted with contradicti...
Martiño Ríos-García, Kevin Maik Jablonka, and six colleagues submitted a paper to arXiv on April 20, 2026, with a title that doubles as a verdict: "AI scientists produce results without reasoning scientifically." The team ran more than 25,000 agent trials across eight scientific domains, from automated workflow execution to open-ended hypothesis testing, and what they found should give serious pause to every lab director currently handing research tasks to an LLM-based agent.
Why This Matters
The scientific method is not just a set of steps, it is a self-correcting system. When AI agents bypass that system while still producing plausible-looking outputs, the entire integrity of the scientific record is at risk. The study's finding that evidence is ignored in 68% of agent reasoning traces is not a minor calibration problem. It means that nearly 7 in 10 reasoning chains produced by these systems are epistemically hollow, and standard outcome-based evaluations, the kind most research teams actually run, cannot catch that failure at all.
Daily briefing from 50+ sources. Free, 5-minute read.
The Full Story
The research team designed a study specifically to test whether LLM-based agents reason in ways that align with the epistemic norms of science, not just whether they produce correct answers. Those norms include things like updating beliefs when new evidence arrives, seeking out contradicting data, and applying consistent logic across multiple tests. The distinction matters because a system can get the right answer by pattern-matching against its training data without ever engaging in genuine empirical reasoning.
To stress-test that distinction, the researchers ran their experiments through two complementary analytical lenses. The first was a performance analysis that separated how much of an agent's success came from the underlying base model versus the scaffolding built around it. The second was a behavioral analysis that looked at the epistemological structure of how agents actually reasoned through problems, not just whether they got the answer right.
The performance results are striking on their own. The base model accounted for 41.4% of explained variance in outcomes, while the scaffold, the engineering work that surrounds and directs the model, accounted for only 1.5%. That is a 27-to-1 ratio. If you are a developer spending time tuning your agent scaffold in hopes of improving scientific reasoning quality, this paper is essentially telling you that effort is largely wasted.
The behavioral findings are where the story gets genuinely alarming. Across all tested configurations, agents ignored available evidence in 68% of reasoning traces. Refutation-driven belief revision, the core of how science corrects itself, occurred in only 26% of cases. Convergent multi-test evidence gathering, where an agent runs multiple independent tests and synthesizes the results, was rare across the board. Most troubling of all: these patterns did not change when agents were given nearly complete successful reasoning trajectories as context. Even when shown what good scientific reasoning looks like, the agents defaulted to their same shortcuts.
The researchers also found that this unreliability compounds over repeated trials in domains that require genuine epistemic rigor. A single flawed reasoning trace might slip past reviewers. A system that produces flawed reasoning 68% of the time, across hundreds of automated runs, generates a body of work that is structurally unsound at scale. The paper concludes that until reasoning itself becomes an explicit training target, the scientific knowledge produced by these agents cannot be justified by the process that generated .
Key Details
- The study was submitted to arXiv on April 20, 2026, under identifier arXiv:2604.18805.
- The research team includes 8 authors, led by Martiño Ríos-García and Kevin Maik Jablonka.
- The study covered 8 scientific domains and more than 25,000 agent runs.
- The base model explained 41.4% of outcome variance; the scaffold explained only 1.5%.
- Agents ignored available evidence in 68% of reasoning traces across all configurations.
- Refutation-driven belief revision occurred in just 26% of cases.
- The behavioral patterns persisted even when agents received near-complete successful reasoning trajectories as context.
What's Next
Research teams building autonomous science pipelines need to shift evaluation frameworks away from outcome accuracy and toward process auditing, specifically measuring whether agents revise beliefs, seek contradicting evidence, and apply consistent logic. The authors argue that reasoning must become a training target, not an afterthought, which means the AI labs building foundation models carry as much responsibility here as the researchers deploying them. Watch for follow-up work that attempts to benchmark epistemological behavior directly, which would give the field a concrete metric to optimize against.
How This Compares
This paper lands in a crowded but anxious conversation about what AI is actually doing to science. A January 2026 analysis published in Science magazine, examining 41 million research papers, found that AI tools supercharge individual researchers while narrowing collective scientific exploration. The Ríos-García study adds a sharper edge to that finding: it is not just that AI narrows what scientists explore, it is that the reasoning behind what gets explored may be structurally broken from the start.
Compare this to the Scientific American report from March 27, 2026, which noted that an AI-authored paper had passed peer review for the first time. That milestone was framed as either an acceleration of discovery or a threat to overwhelm the peer-review system with automated submissions. The new arXiv paper essentially argues that the peer-review framing misses a deeper problem. Peer review checks outputs. It does not check reasoning traces. If agents are producing plausible outputs through epistemically hollow processes, peer review as currently practiced will not catch . There is also the matter of a finding reported in Nature showing that AI language models can predict the results of studies they have never read with notable accuracy. Physics communicator Sabine Hossenfelder covered this in a December 2024 video titled "AIs Predict Research Results Without Doing Research," which drew over 206,000 views. The Ríos-García paper gives that concern a formal empirical foundation. Prediction accuracy and scientific reasoning are not the same thing, and the field has been conflating them. That conflation is now documented at the scale of 25,000 agent runs across eight domains. You can browse related AI tools being deployed in research workflows and judge for yourself how many of them address epistemological auditing at all.
FAQ
Q: What does it mean for an AI agent to ignore evidence? A: In the context of this study, it means the agent proceeds through a reasoning task without updating its conclusions when new data contradicts its earlier assumptions. Instead of revising its beliefs based on what the evidence shows, the agent continues toward its original answer. This happened in 68% of the reasoning traces the researchers analyzed across more than 25,000 runs.
Q: Why can't better scaffolding fix this problem? A: Scaffolding is the engineering layer built around an AI model to guide how it behaves during a task. The study found that scaffolding accounted for only 1.5% of explained variance in agent behavior, compared to 41.4% for the base model itself. That means the reasoning problem is baked into the model, and no amount of prompt engineering or workflow design can reliably compensate for . Q: How do you evaluate whether an AI is reasoning scientifically? A: The researchers used a behavioral analysis that examined the epistemological structure of each reasoning trace, not just whether the final answer was correct. They looked for specific patterns like belief revision after encountering contradicting evidence and convergent reasoning across multiple independent tests. Standard outcome-based evaluation, which checks only whether the answer matches a known result, cannot detect these failures at all.
The Ríos-García paper is a clear-eyed intervention in a space that has been moving too fast to ask foundational questions. The productive path forward requires treating scientific reasoning as a measurable, trainable capability rather than an assumed byproduct of task completion. For more AI news on autonomous agents and their real-world limitations, and for guides on evaluating AI tools in research workflows, subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.
Get stories like this daily
Free briefing. Curated from 50+ sources. 5-minute read every morning.




