LLMSaturday, April 11, 2026·8 min read

I built a pure WGSL LLM engine to run Llama on my Snapdragon laptop GPU

AD
AI Agents Daily
Curated by AI Agents Daily team · Source: HN LLM
I built a pure WGSL LLM engine to run Llama on my Snapdragon laptop GPU
Why This Matters

A developer going by Beledarian has built a pure WGSL-based LLM inference engine from scratch to run Llama models directly on Snapdragon X laptop GPUs. This matters because existing tools like llama.cpp were not built to fully exploit Qualcomm's Adreno GPU architecture, leaving a...

According to the GitHub repository at github.com/Beledarian/wgpu-llm, a developer known as Beledarian published a working LLM inference engine written entirely in WebGPU Shading Language, targeting the Snapdragon X Elite and X Plus processors found in the latest wave of ARM-based Windows laptops. The project, which reached 58 commits as of April 11, 2026, takes a bottom-up approach to GPU-accelerated inference that sidesteps the usual CUDA and Metal dependencies that dominate the AI tooling world.

Why This Matters

The Snapdragon X Elite is a serious piece of hardware that the developer community has been underserving for too long. CPU-only inference on these chips maxes out around 20 tokens per second for a quantized Llama 7B model, a number that leaves real GPU compute sitting completely idle. Qualcomm shipped the Snapdragon X Elite in mid-2024 promising AI-capable hardware, and it took until early 2026 for a developer to build an engine that actually speaks the Adreno GPU's native language. That gap is embarrassing for the ecosystem, and this project is the first serious attempt to close .

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

Qualcomm launched the Snapdragon X Elite and X Plus processors in mid-2024, positioning them as ARM-based competitors to Intel and AMD in the premium laptop segment. The chips include an integrated Adreno GPU built on a tile-based deferred rendering architecture, which is meaningfully different from the discrete NVIDIA GPUs that most LLM frameworks were designed around. Early adopters quickly ran into a ceiling: the GPU was largely being ignored by the inference stack.

The developer community's initial response leaned heavily on llama.cpp, the open-source inference framework maintained by ggml-org, which now carries 103,000 GitHub stars and has become the default tool for running quantized models on consumer hardware. CPU-based benchmarks documented in llama.cpp GitHub discussions starting July 3, 2024, showed Llama 7B Q4 models running at roughly 20 tokens per second on Snapdragon X Elite hardware using tools like LM Studio. That is functional, but it is not using the GPU at all.

Qualcomm itself acknowledged the gap in February 2025 when developers Devang Aggarwal and Dileep Karpur published an official tutorial on the Qualcomm Developer Blog covering how to run DeepSeek models on Windows on Snapdragon. That guide covered both llama.cpp and MLC-LLM, but neither framework had built native, optimized compute kernel support for Adreno. The situation was workable but clearly suboptimal.

Beledarian's answer was to stop waiting and write the engine in WGSL from scratch. WGSL is the shading language that powers WebGPU, a cross-platform GPU compute API maintained through the Khronos Group. The implementation lives inside a Rust project using wgpu, the Mozilla-backed Rust implementation of the WebGPU standard. That stack gives the developer direct access to compute shaders on the Adreno GPU, with none of the abstraction overhead that platform-agnostic frameworks carry. The repository is organized into two crates, wgpu-llm-core and wgpu-llm-cli, and a CI workflow publishes both to crates.io automatically on versioned tags.

The technical documentation included in the repository covers sampling, INT8 GEMM (general matrix multiply), KV cache spilling, and dry-run capabilities, which are not beginner features. This is not a weekend toy. INT8 GEMM in particular is a meaningful optimization, since it reduces memory bandwidth pressure during the matrix multiplications that dominate transformer inference. Getting that right on Adreno, which has a very different memory hierarchy than a discrete GPU, requires real understanding of the hardware. Exploring AI tools targeting ARM-based hardware has become a growing category worth watching closely.

The project surfaced on Hacker News under submission ID 47729917 with modest initial engagement. That is not surprising. The Snapdragon X developer community is still small compared to the NVIDIA CUDA ecosystem, and pure WGSL inference is a niche inside a niche. But the work itself is substantive regardless of the star count.

Key Details

  • The repository at github.com/Beledarian/wgpu-llm reached 58 commits as of April 11, 2026.
  • Qualcomm released the Snapdragon X Elite and X Plus processors in mid-2024 for the ARM laptop market.
  • CPU-based Llama 7B Q4 inference on Snapdragon X Elite achieved approximately 20 tokens per second using LM Studio.
  • llama.cpp, the dominant alternative, holds 103,000 GitHub stars as of the latest count.
  • Qualcomm published its official Snapdragon LLM tutorial in February 2025, authored by Devang Aggarwal and Dileep Karpur.
  • The project ships as 2 Rust crates: wgpu-llm-core and wgpu-llm-cli, both publishable to crates.io.
  • The engine supports INT8 GEMM, KV cache spilling, and configurable sampling parameters.

What's Next

GPU-accelerated inference on Adreno could realistically push Llama 7B throughput from 20 tokens per second to somewhere in the 60 to 100 tokens per second range, based on comparable GPU acceleration gains documented on other integrated GPU architectures. Watch for benchmark reports from early adopters running the tool on Snapdragon X Elite devices over the coming weeks, since that data will determine whether the WGSL kernel implementations are actually competitive. If performance numbers look good, this could pressure llama.cpp maintainers to prioritize proper Adreno GPU support rather than leaving it to side projects.

How This Compares

The closest direct comparison is llama.cpp's experimental GPU backend work. Llama.cpp supports CUDA, Metal, and Vulkan backends, but Vulkan coverage for Adreno has lagged behind Apple Silicon and NVIDIA in terms of optimization depth. Apple's approach with Metal on M-series chips is the gold standard for integrated GPU inference, and Apple achieved it through years of co-designing the framework around their own hardware. Beledarian is attempting to replicate that kind of tight hardware-software integration for Adreno by going even lower in the stack, writing directly in the GPU's native shading language.

MLC-LLM, the machine learning compilation framework, represents a different philosophy. It uses compiler technology to auto-generate efficient kernels for different hardware targets rather than hand-writing them. That approach scales better across many architectures, but it can leave performance on the table for specific hardware that the compiler has not been tuned against. A hand-written WGSL engine for Adreno might outperform MLC-LLM on Snapdragon specifically, even if MLC-LLM wins on breadth.

The broader context here is important. Microsoft's Copilot Plus PC initiative, which launched with Snapdragon X hardware in mid-2024, has pushed millions of Adreno-equipped laptops into developer hands. That install base is growing, and the software ecosystem is still catching up. Projects like wgpu-llm are exactly the kind of grassroots infrastructure work that typically precedes larger framework investment. For more AI news covering on-device inference developments, the space is moving faster than most people realize.

FAQ

Q: What is WGSL and why does it matter for running AI models? A: WGSL stands for WebGPU Shading Language, which is the programming language used to write compute programs that run directly on a GPU through the WebGPU API. For AI inference, writing kernels in WGSL means you can target a wide range of GPUs, including Qualcomm's Adreno, without depending on CUDA or other vendor-specific libraries. It gives developers direct control over how the GPU processes the math inside a language model.

Q: Can I run this on my Snapdragon X laptop right now? A: The project is available publicly at github.com/Beledarian/wgpu-llm and is written in Rust, so you can clone and build it if you have a Rust toolchain installed. It targets Snapdragon X Elite and X Plus hardware specifically. Performance benchmarks have not been widely published yet, so treat it as early-stage software that works but has not been validated at scale.

Q: How is this different from just using llama.cpp on a Snapdragon laptop? A: Llama.cpp can run on Snapdragon, but its GPU support for Qualcomm's Adreno architecture is limited, meaning most inference happens on the CPU. The wgpu-llm engine is written specifically to run compute workloads on the Adreno GPU using WGSL shaders, which theoretically allows it to tap into GPU memory bandwidth and parallelism that llama.cpp leaves unused on this particular hardware.

Beledarian's project is exactly the kind of low-level, hardware-first engineering that the ARM laptop ecosystem needs more of if local AI inference is going to become genuinely fast on non-Apple silicon. The Snapdragon X platform has the hardware. The software stack is finally starting to catch up. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. We are tracking this development closely and will report on follow-up impacts as they emerge.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

Share this article Post on X Share on LinkedIn

This website uses cookies to ensure you get the best experience. We use essential cookies for site functionality and analytics cookies to understand how you use our site. Learn more