ResearchSunday, April 12, 2026·8 min read

Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference

AI Agents Daily

Curated by AI Agents Daily team · Source: MarkTechPost

Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference

Why This Matters

Liquid AI released LFM2.5-VL-450M on April 8, 2026, a 450-million-parameter vision-language model that runs real-time inference in under 250 milliseconds on edge hardware like the NVIDIA Jetson Orin. The model adds bounding box prediction, eight-language support, and function cal...

According to MarkTechPost, Liquid AI has dropped a meaningful upgrade to its edge-focused vision-language lineup with the LFM2.5-VL-450M, an updated version of the earlier LFM2-VL-450M model that shipped several months prior. The new release packs bounding box prediction, multilingual vision understanding across eight languages, improved instruction following, and function calling support into the same 450-million-parameter footprint as its predecessor. The model is available now on Hugging Face, through Liquid AI's LEAP managed inference service, and on the company's Playground interface.

Why This Matters

A 450-million-parameter model that can predict bounding boxes, understand eight languages, and still hit sub-250-millisecond inference on embedded hardware is not a compromise product. It is a direct challenge to the prevailing assumption that real vision-language capability requires billions of parameters running on cloud GPUs. Most enterprises deploying AI at the edge today are stuck choosing between small-but-dumb models or powerful-but-expensive cloud APIs. LFM2.5-VL-450M narrows that gap more aggressively than anything else currently shipping at this parameter count. For robotics, industrial inspection, and autonomous systems teams, this release deserves serious attention.

Stay ahead in AI agents

Daily briefing from 50+ sources. Free, 5-minute read.

The Full Story

Liquid AI built LFM2.5-VL-450M on a substantially expanded training foundation compared to the model it replaces. The original LFM2-VL-450M was pre-trained on 10 trillion tokens. The new version was trained on 28 trillion tokens, nearly three times as much data, which is notably large even by the standards of many open-source models that typically train on 2 to 7 trillion tokens. After that pre-training phase, the team applied post-training optimization techniques including preference optimization and reinforcement learning, both tuned specifically for multimodal behavior in production settings rather than just benchmark performance.

The most striking new capability is bounding box prediction. The previous model had zero ability in this area. LFM2.5-VL-450M scores 81.28 on the relevant grounding evaluation metric, going from a blank slate to a genuinely functional object localization system. That means developers can now use this model for object detection tasks where the model must not only identify what is in an image but also specify exactly where the object sits within the frame. That capability matters enormously for robotics, augmented reality, and industrial computer vision pipelines that need spatial context, not just labels.

The multilingual expansion is equally practical. Liquid AI built vision-language understanding into the model for Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish. This is not simple text translation layered on top of an English-only vision model. The multilingual capability runs through the full vision-language pipeline, meaning the model can answer questions about an image in German or describe a scene in Japanese without routing through an English intermediary. That changes the deployment math for companies building products in non-English markets.

Liquid AI also added function calling support, which lets the model handle text-only inputs in structured agent workflows even when no image is present. That might sound minor, but it matters for developers building AI tools where the vision component is optional depending on the user's request. The model sits on top of an updated backbone called LFM2.5-350M, which represents architectural improvements over the original LFM2 design the company introduced with the first generation.

Deployment options are wide. The Hugging Face release lets teams run and fine-tune the model locally. The LEAP API gives access to managed inference for those who prefer hosted infrastructure. Liquid AI also ships a WebGPU demo that runs real-time video stream captioning directly in a browser with no external GPU required, which is a compelling showcase of just how lean this model actually .

Key Details

Released April 8, 2026, on Hugging Face, LEAP, and Liquid AI Playground
450 million parameters total, built on the LFM2.5-350M backbone architecture
Pre-trained on 28 trillion tokens, up from 10 trillion in LFM2-VL-450M
Bounding box prediction score of 81.28, compared to zero in the previous version
Multilingual support covers 8 languages: Arabic, Chinese, French, German, Japanese, Korean, Portuguese, and Spanish
Inference time stays under 250 milliseconds on edge hardware including NVIDIA Jetson Orin
Post-training optimization includes preference optimization and reinforcement learning for multimodal production use

What's Next

The immediate test for LFM2.5-VL-450M is independent validation. Liquid AI's internal benchmarks are promising, but the robotics and industrial CV communities will run their own evaluations over the next several weeks, and those real-world results will determine whether the bounding box and multilingual claims hold up under production conditions. If they do, expect Liquid AI to push the model into more embedded hardware partnerships, particularly with semiconductor companies building inference accelerators for IoT and automotive applications. Fine-tuning documentation is already live, so domain-specific variants targeting medical imaging, retail analytics, or security monitoring could appear within months.

How This Compares

The obvious comparison point is with Google's Gemma 4 family and Meta's smaller Llama vision variants, both of which also target efficient deployment. But those models are still considerably larger than 450 million parameters when vision capabilities are included, and neither has demonstrated sub-250-millisecond inference on an embedded module like the Jetson Orin with bounding box output. Gemma 4's smallest multimodal configuration requires substantially more compute than what Liquid AI is advertising here.

OpenAI's GPT-4V and Anthropic's Claude Vision models are in a different conversation entirely. They are cloud-only, multi-billion-parameter systems that charge per API call. They are not competing for the same deployment scenarios. The real competitive pressure on Liquid AI comes from companies like Qualcomm, which is pushing AI inference onto Snapdragon hardware at the silicon level, and from open-source efforts like MobileVLM and moondream, which have also targeted sub-one-billion-parameter vision-language performance. Moondream 2 specifically has drawn attention for edge inference, but it has not shipped native bounding box prediction at this accuracy level within a 450-million-parameter envelope.

What makes Liquid AI's position interesting is that it competes on model design rather than on chip design. While Qualcomm and MediaTek optimize the hardware side, Liquid AI is making the model itself efficient enough to run on whatever hardware already exists in the field. That is a more portable and frankly more scalable strategy for reaching embedded deployments at scale. Check the AI Agents Daily news feed for ongoing coverage as competitor responses emerge.

FAQ

Q: What is a vision-language model and what can it do? A: A vision-language model is an AI system that understands both images and text together. You can show it a photo and ask a question about it in plain language, and it will respond with a relevant answer. More advanced versions like LFM2.5-VL-450M can also locate specific objects within the image and describe them spatially, which is useful for robotics and automated inspection systems.

Q: What does bounding box prediction mean in practical terms? A: Bounding box prediction means the model can draw a rectangle around a specific object in an image, not just name what it sees. If you ask the model to find a stop sign in a street photo, it returns the coordinates of a box around that sign. This is critical for any application where a system needs to know exactly where an object is, not just whether it exists.

Q: Can I run LFM2.5-VL-450M on my own hardware today? A: Yes. The model is available on Hugging Face right now, and Liquid AI provides documentation for local execution and fine-tuning. If you have an NVIDIA Jetson Orin or similar embedded AI module, the model is designed to run directly on that hardware with inference times under 250 milliseconds. There is also a browser-based WebGPU demo that requires no external GPU at all. For setup guides and tutorials, check curated developer resources.

The LFM2.5-VL-450M release is a concrete data point in an ongoing argument about where AI inference should actually happen, and Liquid AI's answer is clearly "as close to the sensor as possible." Whether the broader industry follows this design philosophy or continues betting on ever-larger cloud models will shape enterprise AI architecture decisions throughout 2026. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.

Our Take

This story matters because it signals a shift in how AI agents are being adopted across the industry. The research findings here could reshape how developers build agentic systems in the coming months.

Post Share

Get stories like this daily

Free briefing. Curated from 50+ sources. 5-minute read every morning.

Liquid AI Releases LFM2.5-VL-450M: a 450M-Parameter Vision-Language Model with Bounding Box Prediction, Multilingual Support, and Sub-250ms Edge Inference

Why This Matters

The Full Story

Key Details

What's Next

How This Compares

FAQ

Get stories like this daily

More in Research

Google AI Research Proposes Vantage: An LLM-Based Protocol for Measuring Collaboration, Creativity, and Critical Thinking

A Hands-On Coding Tutorial for Microsoft VibeVoice Covering Speaker-Aware ASR, Real-Time TTS, and Speech-to-Speech Pipelines

Meta AI and KAUST Researchers Propose Neural Computers That Fold Computation, Memory, and I/O Into One Learned Model

Learn more — Guides