Open SourceFriday, April 10, 2026·8 min read

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]

AI Agents Daily

Curated by AI Agents Daily team · Source: Reddit ML

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]

Why This Matters

Researcher Dmitry Trifonov has documented a serious bug in NVIDIA's cuBLAS library that causes RTX consumer GPUs to operate at only 40% of their available compute capacity during batched matrix multiplication workloads. This affects every RTX non-professional GPU, including the f...

Dmitry Trifonov, writing for Medium, surfaced a critical kernel selection bug in NVIDIA's cuBLAS library that causes batched single-precision floating point matrix multiplication to run at roughly 40% GPU utilization across a wide range of matrix sizes. The finding, which also circulated widely on Reddit's r/MachineLearning community and drew significant attention on Hacker News, points to a systemic failure in cuBLAS's internal routing logic that affects the RTX 5090 and is likely present across all consumer RTX GPUs not in the professional product line.

Why This Matters

A 60% compute loss on matrix multiplication is not a minor regression. Matrix multiplication is the single most dominant operation in every modern neural network, from transformers to diffusion models, so this bug effectively degrades every training run and every inference call on affected hardware. NVIDIA's RTX 5090 retails at the top of the consumer GPU market, and buyers purchasing that card for machine learning work are operating a machine that its own vendor's library cannot fully exploit. This is not a niche edge case affecting a peculiar workload at an unusual size; Trifonov tested matrix dimensions from 256 by 256 all the way up to 8192 by 8192 by 8, which covers the overwhelming majority of real-world deep learning batch operations.

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

Trifonov's investigation centers on how cuBLAS decides which internal kernel to run when a program requests a batched FP32 matrix multiplication. In a correctly tuned library, the kernel selection process should analyze the matrix dimensions and batch size, then dispatch the code path that maximizes GPU throughput for that configuration. What Trifonov found is that cuBLAS consistently chooses a suboptimal path for these common workloads, and the RTX 5090 ends up idling on roughly 60% of its available compute resources as a result.

The testing environment Trifonov used was specific and documented: CUDA 13.2.51, cuBLAS version 13.3.0, and driver version 595.58.03. He also noted that reverting to earlier versions of CUDA and cuBLAS produced even worse results, which means this problem has existed for multiple release cycles without NVIDIA's quality assurance catching it. The bug is not something introduced in a recent experimental build; it is baked into the current shipping version of cuBLAS.

For context, cuBLAS is NVIDIA's official implementation of the Basic Linear Algebra Subprograms specification, and it has been the foundational library for GPU-accelerated machine learning for well over a decade. Frameworks like PyTorch and TensorFlow route matrix operations through cuBLAS by default, which means this inefficiency propagates directly into the training and inference pipelines of a huge share of production machine learning systems. Most users would never know about it because GPU utilization figures are not always monitored at the kernel level during routine workloads.

This is not the first documented instance of cuBLAS kernel selection going wrong. NVIDIA's own developer forums contain a thread from July 2020 discussing performance degradation on very small matrix multiplications, which shows that the internal routing logic has had correctness issues across different problem sizes for years. The difference this time is that the affected size range covers a much larger portion of practical use cases, making the impact far broader.

What makes Trifonov's documentation particularly useful is that it is reproducible and specific. The size range he tested, from 256 by 256 through 8192 by 8192 with a batch size of 8, corresponds to operations that appear constantly in transformer attention layers, feed-forward blocks, and convolutional operations. This is not a synthetic benchmark designed to make a library look bad; it reflects workloads that real models execute thousands of times per training step.

Key Details

Researcher Dmitry Trifonov published findings on Medium documenting the bug in detail.
Affected hardware confirmed as RTX 5090, with all consumer RTX non-professional GPUs suspected.
GPU compute utilization measured at approximately 40% for batched FP32 matrix multiplication.
Matrix size range affected spans 256 by 256 up to 8192 by 8192 with a batch size of 8.
Tested on CUDA 13.2.51, cuBLAS 13.3.0, and driver 595.58.03.
Prior cuBLAS versions exhibited worse performance than the current release.
NVIDIA's developer forums documented a related cuBLAS performance issue as far back as July 2020.

What's Next

NVIDIA has not publicly acknowledged the bug or committed to a fix as of the time of this writing, so the most important near-term signal to watch is whether a cuBLAS patch appears in a driver or CUDA toolkit update. Users running batched inference workloads on consumer RTX hardware should monitor the cuBLAS release notes and consider benchmarking their own pipelines against alternative matrix multiplication backends. If NVIDIA does not respond quickly, community pressure through NVIDIA's developer forums and GitHub issue trackers is likely to accelerate the timeline.

How This Compares

This situation sits uncomfortably close to the CUDA-L2 project, a reinforcement learning-based effort to automatically discover GPU kernels that outperform cuBLAS at matrix multiplication. CUDA-L2, which appeared on GitHub under the deepreinforce-ai organization roughly four months before Trifonov's findings gained wide attention, was already framing itself as a response to cuBLAS's limitations. The discovery of a 60% compute loss in the shipping library gives that project a much stronger case. If a vendor library leaves 60% of compute idle, the argument for third-party kernel search tools becomes essentially self-evident.

Compare this to the longer-running story of Triton, OpenAI's open-source GPU kernel language, which gained traction precisely because researchers found that hand-tuned or learned kernels could exceed cuBLAS performance on specific workloads. Triton is now integrated into PyTorch's compilation stack, and frameworks increasingly give developers the option to bypass cuBLAS entirely for certain operations. The cuBLAS bug Trifonov documented adds another data point to the argument that vendor-supplied libraries cannot be treated as a performance ceiling; they can, under the wrong conditions, be a performance floor that sits well below what the hardware can actually . The broader pattern here is worth naming directly. NVIDIA's competitive moat has always rested partly on the quality of its software stack, not just its silicon. A 60% compute loss in the flagship library, present across multiple driver versions and undetected through normal release testing, is a crack in that reputation. AMD and Intel, both of whom are investing heavily in AI tools and GPU software ecosystems, will not miss the opportunity to point out that open or alternative libraries can sometimes outperform NVIDIA's own code on NVIDIA's own hardware.

FAQ

Q: Does this bug affect PyTorch or TensorFlow training? A: Yes, it almost certainly does. Both PyTorch and TensorFlow route batched matrix multiplication through cuBLAS by default on NVIDIA GPUs, which means any training run using these frameworks on a consumer RTX GPU is likely experiencing the same 40% utilization problem that Trifonov measured directly. Switching to Triton-backed kernels or using torch.compile may help, but users should benchmark their specific workloads to confirm.

Q: How do I know if my GPU is affected by this cuBLAS bug? A: The bug affects all consumer RTX non-professional GPUs, according to Trifonov's testing on the RTX 5090. If you are running batched FP32 matrix multiplication on any consumer RTX card using cuBLAS 13.3.0 or earlier, you are likely affected. Tools like NVIDIA Nsight Systems can show kernel-level GPU utilization and help you verify whether your workloads are hitting the inefficient code path.

Q: Is there a fix available right now for this cuBLAS issue? A: As of the documentation reviewed for this article, NVIDIA has not released a patch. Trifonov's testing showed that the current release, cuBLAS 13.3.0 with CUDA 13.2.51, is the best available version, with earlier releases performing even worse. Users can follow AI news and the official CUDA toolkit release notes for any announced fixes.

The cuBLAS bug Trifonov documented is an uncomfortable reminder that even foundational, battle-tested software libraries can silently underperform for years before someone runs the right benchmark. Developers building inference systems or training pipelines on consumer RTX hardware should run their own utilization audits now rather than waiting for an official response. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. We are tracking this development closely and will report on follow-up impacts as they emerge.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]

Why This Matters

The Full Story

Key Details

What's Next

How This Compares

FAQ

Get stories like this daily

More in Open Source

IPFS Bot AI Agent Content

A3: Kubernetes for autonomous AI agent fleets

TMLR reviews stalled [D]

Learn more — Guides