ResearchWednesday, April 22, 2026·8 min read

Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

AI Agents Daily

Curated by AI Agents Daily team · Source: ArXiv CS.LG

Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

Why This Matters

Researchers from a Chinese university have found a way to make AI-powered math theorem provers dramatically more efficient by borrowing a trick from compiler technology. Instead of burning through thousands of failed proof attempts, their system uses compiler error outputs to com...

Guchan Li, Rui Tian, and Hongning Wang, publishing on arXiv under paper identifier 2604.18587, submitted their work on March 13, 2026, describing a framework they call "Compile to Compress." The core claim is that compiler outputs, specifically the structured error messages a formal verification system generates when a proof attempt fails, can serve as a compression signal that makes AI-assisted theorem proving far cheaper to run without sacrificing accuracy.

Why This Matters

Formal theorem proving is the unglamorous backbone of verified AI reasoning, and right now it is prohibitively expensive to run at scale. The dominant approach forces systems to generate thousands of proof candidates and hope one sticks, which is wasteful and brittle. Li and colleagues are pointing at a structural property of formal systems that everyone else has been ignoring: compilers collapse a huge space of wrong answers into a small, interpretable set of failure categories, and that compression is free information you can actually learn from. If this approach holds up under scrutiny, it could cut the compute bill for formal reasoning systems by orders of magnitude and finally make verified AI reasoning deployable in real-world applications.

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

The problem the paper attacks is straightforward to describe but genuinely hard to solve. When you ask an AI model to formally prove a mathematical theorem, you are not asking it to reason in plain English. You are asking it to produce a machine-checkable proof in a formal language like Lean 4, where every logical step must satisfy a strict verifier. The verifier either accepts the proof or rejects it, and rejection is the overwhelmingly common outcome.

Current state-of-the-art systems deal with this by brute force. They generate massive numbers of candidate proofs in parallel, roll through them until something passes, or they expand the context window so the model can keep a long memory of prior failed attempts. Both strategies are expensive. Parallel rollouts burn GPU cycles linearly with the number of attempts. Long context windows are slow and memory-hungry. Neither approach teaches the system anything durable about why its proofs keep failing.

The central insight of the paper is that the formal verification compiler is not just a pass-fail oracle. It is a structured feedback machine. When a proof step fails, the compiler produces an error message that belongs to a recognizable category of failure modes. Critically, a vast number of superficially different wrong proofs all map to the same small set of compiler error types. That is the compression the title refers to: the compiler is doing something analogous to what a compiler does for source code, collapsing diversity into structure.

Li and colleagues build what they call a learning-to-refine framework on top of this observation. Rather than discarding failed proof attempts and starting over, their system performs a tree search that corrects errors locally, using the explicit verifier feedback as a conditioning signal. When the compiler says a proof step fails for a specific reason, the model refines only that local portion of the proof tree rather than regenerating everything from scratch. This avoids accumulating a long, expensive history of prior attempts in the context window.

The team tested their approach on PutnamBench, a benchmark derived from the William Lowell Putnam Mathematical Competition, which is widely regarded as one of the harder evaluations for AI theorem provers. Their method achieved state-of-the-art performance among all publicly reported models in the approximately 8 billion parameter and approximately 32 billion parameter categories, under comparable test-time compute budgets. That last qualifier matters enormously. Hitting top scores while spending no more compute than the competition is the actual proof of concept here.

Key Details

Paper arXiv:2604.18587 was submitted on March 13, 2026, by Guchan Li, Rui Tian, and Hongning Wang.
The method achieves state-of-the-art results on PutnamBench for models in both the roughly 8 billion and roughly 32 billion parameter ranges.
The framework uses tree search with local error correction, conditioned on compiler feedback, rather than full proof regeneration.
The paper spans 838 KB and is cross-listed under Machine Learning, Artificial Intelligence, Logic in Computer Science, and Programming Languages on arXiv.
The core claim is that compilers map a large, diverse space of proof attempts to a compact, structured set of failure modes, making that failure information reusable for learning.

What's Next

The immediate test for this work is independent replication on benchmarks beyond PutnamBench, particularly MiniF2F and the International Mathematical Olympiad formalization tasks that other groups have been using as evaluation standards. If the efficiency gains hold across those benchmarks and across different formal systems beyond Lean 4, the compiler-compression idea will likely be absorbed into the standard toolkit for formal reasoning agents. Watch for follow-up work exploring whether the same compression principle applies to verified code generation, where compiler feedback is already a standard part of the development loop.

How This Compares

The NeurIPS 2025 paper APOLLO, from Azim Ospanov, Farzan Farnia, and Roozbeh Yousefzadeh, tackled a closely related problem by automating the collaboration between LLMs and the Lean proof assistant. APOLLO's approach still depended on prompting LLMs thousands of times until a correct proof emerged, which the authors themselves acknowledged as a core limitation. The "Compile to Compress" work is a direct response to exactly that limitation, proposing a principled alternative rather than a workaround.

The Nature paper on Olympiad-level formal reasoning published in 2025, which demonstrated reinforcement learning for proof generation, showed that formal systems can reach impressive heights given enough training signal. However, RL-based approaches require substantial offline training and do not necessarily help with test-time efficiency for new problems. Li and colleagues are working at inference time, which means their gains are orthogonal to training-side improvements and could stack on top of a well-trained RL base model.

Compared to the broader position paper from Kaiyu Yang at Meta FAIR, Gabriel Poesia at Stanford, and collaborators from UC Berkeley and the University of Edinburgh, which argued in arXiv:2412.16075v1 that formal mathematical reasoning is indispensable for next-generation AI, the "Compile to Compress" work is the kind of concrete efficiency result that turns a research direction into a deployable technology. The field has been building the theoretical case for formal reasoning for years. Papers like this one start building the engineering case. For AI tools and platforms targeting math and code verification, that engineering case is what actually moves adoption.

FAQ

Q: What is formal theorem proving and why does it matter for AI? A: Formal theorem proving means writing mathematical proofs that a computer can automatically verify as correct, with zero ambiguity. For AI, it matters because it offers a way to guarantee that a system's reasoning is actually valid rather than just plausible-sounding. Systems like Lean 4 are the proof checkers, and getting AI to use them reliably is considered a key step toward trustworthy AI reasoning.

Q: How does using compiler output actually speed up theorem proving? A: When an AI's proof attempt fails, the compiler generates a structured error message explaining exactly what went wrong. Because many different wrong proofs produce the same few error types, the system can learn from those patterns and fix only the broken part of a proof rather than starting from scratch every time, which cuts the total number of attempts needed.

Q: What is PutnamBench and is it a reliable test for AI math ability? A: PutnamBench is a formalized collection of problems from the William Lowell Putnam Mathematical Competition, an annual undergraduate math contest known for its difficulty. It is considered a credible benchmark for AI theorem provers because the problems require genuine mathematical creativity, not just pattern matching, and the formal format means scoring is objective and not subject to interpretation.

The "Compile to Compress" framework is a smart piece of engineering that borrows from a field, compiler design, that machine learning researchers rarely consult. If the results replicate broadly, this approach could make formally verified AI reasoning accessible well outside the handful of well-funded labs that can currently afford the compute. Check out the latest AI news for continued coverage as independent evaluations emerge. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. The research findings here could reshape how developers build agentic systems in the coming months.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

Why This Matters

The Full Story

Key Details

What's Next

How This Compares

FAQ

Get stories like this daily

More in Research

OpenAI Open-Sources Euphony: A Browser-Based Visualization Tool for Harmony Chat Data and Codex Session Logs

On Solving the Multiple Variable Gapped Longest Common Subsequence Problem

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

Learn more — Guides