Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4
A developer spent multiple nights getting Intel's Arc Pro B70 GPU to run a 27-billion parameter AI model locally, finally hitting 12 tokens per second. That result matters because it gives buyers real-world data on whether Intel can compete with NVIDIA in the growing local AI inf...
, -
A member of the LocalLLaMA subreddit community, posting under their Reddit handle after first sharing initial impressions on r/IntelArc, has published detailed performance results for the Intel Arc Pro B70 32GB running Alibaba's Qwen3.5-27B model at 4-bit quantization. According to the post on LocalLLaMA, the user fought through multiple nights of configuration headaches to get vLLM operational on Intel's Arc Pro hardware before eventually documenting a consistent throughput of approximately 12 tokens per second. The data covers two separate inference frameworks, giving the results more credibility than a single-tool test would.
Why This Matters
Real-world benchmark data from actual users running real models on Intel Arc hardware is genuinely scarce, and that scarcity has been a quiet advantage for NVIDIA. This test closes part of that information gap. At 12 tokens per second on a 27-billion parameter model, the Arc Pro B70 is a functional local inference card, not a science experiment. The 32GB VRAM spec puts it ahead of consumer NVIDIA cards like the RTX 4090 with 24GB, and for developers who need to run mid-size models offline, that memory headroom is the whole argument for choosing Intel.
Daily briefing from 50+ sources. Free, 5-minute read.
The Full Story
The Intel Arc Pro B70 is Intel's professional-grade answer to the local inference market, built around 32 gigabytes of GDDR6 memory and Intel's Xe-HPG architecture, which includes tensor operation support designed for AI workloads. Intel positioned the B70 to go after enterprise and developer buyers who want memory capacity and cost efficiency rather than gaming performance. The card targets data center deployments, local development environments, and research labs where cloud dependency is either too expensive or not allowed.
The user's testing approach was straightforward in theory and painful in practice. They loaded Qwen3.5-27B at Q4 quantization, a method that compresses a model's numerical precision from 16-bit floating point down to 4-bit integers, shrinking memory requirements dramatically while keeping output quality at an acceptable level for most use cases. At 4-bit quantization, a 27-billion parameter model fits inside 32GB of VRAM with enough room left for system overhead, which is exactly the scenario Arc Pro B70 buyers hope for when they choose a card with that memory capacity.
Getting vLLM to work was the real obstacle. The vLLM project is an open-source inference engine built for speed, using techniques like continuous batching and paged memory management to push throughput beyond what naive implementations achieve. Intel GPU support in vLLM is functional but not as mature as CUDA support for NVIDIA cards, and the user's multi-night troubleshooting experience reflects that gap honestly. The framework eventually worked through what the poster described as extended trial and error with drivers and library configurations.
Once operational, both vLLM and llama.cpp, the widely used C++ inference library that powers a huge portion of local AI deployments, produced approximately 12 tokens per second on token generation. The fact that two completely different software stacks landed at the same number is meaningful. It strongly suggests the 12 tokens per second figure reflects a hardware ceiling rather than a software limitation, which makes the benchmark more trustworthy as a purchasing reference than a single-framework result would . At 12 tokens per second, a 100-token response takes roughly 8.3 seconds to generate. That is adequate for offline document processing, research pipelines, local development testing, and scenarios where the user is not expecting instant conversational response. It is not fast enough for real-time chat applications or high-throughput production serving, and the user did not claim otherwise. The honesty of the post about practical use cases is one of the things that makes it useful to the LocalLLaMA community.
Key Details
- The Intel Arc Pro B70 ships with 32GB of GDDR6 VRAM, built on Intel's Xe-HPG architecture.
- Testing used Qwen3.5-27B at Q4 quantization, a model developed by Alibaba.
- Both llama.cpp and llm-scaler-vllm produced approximately 12 tokens per second on token generation.
- A 100-token response at that rate requires approximately 8.3 seconds of compute time.
- The user spent multiple nights troubleshooting vLLM compatibility before achieving a working configuration.
- The original Reddit post appeared first on r/IntelArc before the LocalLLaMA follow-up with full performance data.
What's Next
Intel needs to close the software gap. The multi-night setup struggle documented here will discourage adoption among developers who do not want to fight their tools before getting to actual work. Intel's driver and inference framework support for Arc Pro needs to reach the plug-and-play reliability that NVIDIA's CUDA ecosystem has built over a decade, and that work is measurable in quarterly software releases. Watch for vLLM and llama.cpp compatibility improvements specifically, since those two tools cover the majority of local inference deployments.
How This Compares
NVIDIA's RTX 4090 with 24GB of VRAM remains the consumer benchmark for local inference, and it consistently posts 20 to 30 tokens per second on similarly quantized models at comparable parameter counts, depending on the framework. The Arc Pro B70's 12 tokens per second is slower, but the 32GB VRAM advantage is real and not trivial. Developers running Qwen3.5-27B at Q4 are using roughly 18 to 22GB, leaving headroom on the B70 that disappears entirely on a 24GB card when system memory pressure increases. For anyone who regularly pushes model sizes upward or runs multiple quantization levels in sequence, that buffer matters.
AMD's competitive position here comes from the MI300 series on the high end and RDNA-based consumer cards on the low end, with the ROCm software stack sitting in between. AMD's ROCm support for frameworks like PyTorch and vLLM has improved noticeably through 2024 and into 2025, and AMD has been more transparent about inference benchmarks than Intel has. The Arc Pro B70 is playing catch-up to AMD on software maturity, not just NVIDIA, which is a tougher position than Intel's marketing suggests.
What separates this test from manufacturer benchmarks or YouTube reviews is that it came from a developer who actually wanted to use the card for work, struggled through realistic setup friction, and reported honest numbers without promotional framing. Phoronix reviewed the Arc Pro B70 on Linux and documented similar software configuration challenges, and Level1Techs covered the hardware launch with comparable observations about ecosystem readiness. The picture across independent sources is consistent: the hardware is capable, the software ecosystem needs more runway. For AI tools coverage and local inference guides, that distinction between hardware capability and deployment readiness is the story developers actually need to act .
FAQ
Q: Can the Intel Arc Pro B70 run large AI models locally? A: Yes, the Arc Pro B70 can run large models like Qwen3.5-27B at 4-bit quantization within its 32GB of VRAM. Real-world testing shows approximately 12 tokens per second on token generation using either llama.cpp or vLLM. That is sufficient for offline document work and local development, though not ideal for fast conversational applications.
Q: How does Intel Arc compare to NVIDIA for local AI inference? A: NVIDIA still leads on raw token generation speed and software ecosystem maturity. An RTX 4090 typically generates 20 to 30 tokens per second on similar tasks, compared to the Arc Pro B70's 12. However, the Arc Pro B70's 32GB of VRAM gives it a memory capacity edge over the 4090's 24GB, which matters when running larger models.
Q: What is vLLM and why was it hard to set up on Intel? A: vLLM is an open-source inference engine that uses continuous batching and optimized memory management to serve large language models faster than standard implementations. Intel GPU support in vLLM is less mature than NVIDIA's CUDA support, meaning configuration requires more manual troubleshooting with drivers and libraries before the framework runs correctly.
, -
The LocalLLaMA post represents exactly the kind of unglamorous, honest hardware testing that actually helps developers make purchasing decisions. As Intel continues to mature its Arc Pro software stack, the gap between hardware potential and deployment friction should narrow, and that narrowing will determine whether Arc Pro cards become a genuine alternative in local inference setups or remain a niche option for patient tinkerers. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.
Get stories like this daily
Free briefing. Curated from 50+ sources. 5-minute read every morning.




