The Cost of Relaxation: Evaluating the Error in Convex Neural Network Verification
A team of four researchers has published a formal study quantifying exactly how much error creeps into neural network verification systems when they take computational shortcuts. The findings show the error grows exponentially with network depth, which is a serious problem for an...
Merkouris Papamichail, Konstantinos Varsos, Giorgos Flouris, and Joao Marques-Silva submitted their paper "The Cost of Relaxation: Evaluating the Error in Convex Neural Network Verification" to arXiv on April 20, 2026, under identifier arXiv:2604.18728. The work sits at the intersection of formal methods and machine learning safety, two fields that desperately need to talk to each other more often. This paper forces that conversation by putting hard numbers on a problem the verification community has been quietly aware of but has not thoroughly measured.
Why This Matters
Neural network verification is not an academic curiosity anymore. Autonomous vehicles, medical diagnostic systems, and aircraft collision avoidance software all run on neural networks, and regulators are starting to ask hard questions about what "safe" actually means in that context. The verification tools most people use today sacrifice mathematical soundness for speed, and until this paper, nobody had formally bounded how bad that sacrifice gets. The answer, it turns out, is exponentially bad as networks get deeper, which describes nearly every modern architecture worth deploying.
Daily briefing from 50+ sources. Free, 5-minute read.
The Full Story
To understand what these researchers did, you need to know how neural network verification works at a basic level. When a verification system checks whether a neural network will always behave correctly given a range of inputs, it has to represent the network's computations as a mathematical problem a solver can check. The honest, complete way to do this uses integer constraints, because the activation functions inside neural networks behave like switches that are either on or off. Integer constraint problems are, however, notoriously hard to solve at scale.
The shortcut that much of the research community has gravitated toward is called convex relaxation. Instead of modeling those switches precisely with integers, you replace them with continuous approximations that form a convex shape, which is a property that makes optimization much more tractable computationally. The problem is that convex shapes can include points that the real network could never actually reach. You are checking safety for a slightly different, slightly larger network than the one you actually deployed, and you may not know by how much.
Papamichail and his co-authors decided to measure that gap rigorously. They framed the set of all possible relaxations as a mathematical lattice, a structured hierarchy with two extremes. At the bottom sits the original network with no relaxation at all, described by its exact integer constraints. At the top sits the fully relaxed version, where every single neuron has been linearized. Between those poles lie all the partial relaxations that real-world verification tools use.
The team then derived analytical upper and lower bounds for the distance between what the fully relaxed network outputs and what the original network outputs. They measured this distance using the infinity norm, which captures the worst-case difference across all output dimensions. Their core finding is stark: that distance grows exponentially with respect to the network's depth, and linearly with respect to the radius of the input region being checked. A deeper network does not just have a slightly larger verification error, it has a catastrophically larger one.
The implications for misclassification are equally striking. When the researchers examined how often a relaxed verification system would incorrectly predict that a network had mislabeled an input, the probability did not grow smoothly as the input radius increased. Instead it jumped in sharp steps, a behavior that makes calibrating trust in relaxation-based tools particularly difficult because errors do not appear gradually. They validated all of this experimentally on the MNIST handwritten digit dataset, the Fashion MNIST dataset, and on randomly generated networks to confirm the results were not artifacts of a specific architecture.
Key Details
- Paper submitted April 20, 2026, by 4 authors affiliated with institutions contributing to the cs.LG and cs.AI categories on arXiv.
- ArXiv identifier is arXiv:2604.18728, with a file size of 677 KB.
- Relaxation error grows exponentially with network depth and linearly with input radius.
- The relaxation space is formalized as a lattice with exactly 2 defined boundary elements: the original network and the fully relaxed network.
- Experiments were run on 3 datasets: MNIST, Fashion MNIST, and randomly generated networks.
- Misclassification probability shows step-like, not gradual, growth as input radius increases.
- The paper is licensed under Creative Commons BY-NC-SA 4.0.
What's Next
The natural follow-up to this work is using these bounds to build smarter hybrid verification pipelines, ones that apply cheap relaxation-based checks first and then invoke sound integer-constraint solvers only when a relaxed check comes back positive. Regulatory bodies drafting AI safety certification standards, particularly in aviation and automotive sectors where formal verification requirements are already taking shape, will need to grapple with these findings as they define what counts as an acceptable verification method. Watch for citations of this paper in safety-critical AI governance discussions through the remainder of 2026.
How This Compares
The most direct point of comparison is work by Hong-Ming Chiu and Richard Y. Zhang, whose November 2022 arXiv paper "Tight Certification of Adversarially Trained Neural Networks via Nonconvex Low-Rank Semidefinite Relaxations" (arXiv:2211.17244) named the same fundamental problem and called it the "convex relaxation barrier." Chiu and Zhang's response was to propose nonconvex semidefinite relaxations as a workaround. Papamichail and colleagues take a different and arguably more foundational approach: instead of immediately proposing a fix, they quantify the damage precisely. Knowing the exact shape of a problem is often more valuable than a partial fix that has not been characterized.
Going further back, the Neurify system developed at Columbia University by Shiqi Wang, Kexin Pei, Justin Whitehouse, Junfeng Yang, and Suman Jana was among the first scalable formal safety analysis tools for neural networks, and it already had to navigate this soundness-versus-speed tension. The field has been aware of the trade-off since at least that generation of work. What has been missing is the kind of closed-form, analytical characterization of worst-case error that this new paper provides.
For practitioners using any of the AI tools in the neural network verification space today, the takeaway is uncomfortable but important. The faster your verification tool runs, the more likely it is using some form of convex relaxation, and now there is a formal framework telling you that the error introduced by that relaxation compounds exponentially as your network gets deeper. That is not a reason to abandon these tools, but it is a reason to read their documentation very carefully and understand what guarantees they actually provide versus what they imply.
FAQ
Q: What is convex relaxation in neural network verification? A: Convex relaxation replaces the exact mathematical rules that govern a neural network's internal switches with simpler, smoother approximations. This makes the verification problem much faster to solve, but it means the system is technically checking a slightly different network than the one you built, which can produce incorrect safety guarantees.
Q: Why does verification error growing exponentially with depth matter? A: Modern neural networks used in serious applications are often dozens or hundreds of layers deep. If verification error doubles or triples with each added layer, a network that is 50 layers deep could have errors that are astronomically larger than a 10-layer network, making any safety certificate based on a relaxed method essentially meaningless for deep architectures.
Q: How do MNIST results relate to real-world AI safety? A: MNIST is a standard benchmark dataset used to test ideas before applying them to harder problems. The researchers used it alongside Fashion MNIST and random networks to show their mathematical bounds hold in practice, not just in theory. The implication is that the same error dynamics apply to more complex, real-world networks used in safety-critical systems.
The research from Papamichail, Varsos, Flouris, and Marques-Silva does not close the book on convex relaxation as a technique, but it does force the verification community to stop treating soundness loss as an acceptable footnote and start treating it as a measurable, quantifiable risk. As AI safety certification moves from voluntary best practice to regulatory requirement, papers like this one will define the technical vocabulary that regulators and engineers use to argue about what "verified" actually means. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.
Get stories like this daily
Free briefing. Curated from 50+ sources. 5-minute read every morning.




