Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers
Tom Aarsen at Hugging Face published a detailed guide on April 16, 2026, showing developers how to train and finetune multimodal embedding and reranker models using Sentence Transformers. The post demonstrates that finetuning a 2-billion-parameter model on domain-specific data ca...
Tom Aarsen, writing for the Hugging Face Blog, published a comprehensive technical walkthrough on April 16, 2026, detailing how developers can train and finetune multimodal embedding and reranker models using the Sentence Transformers library. The post builds directly on Aarsen's earlier work introducing multimodal capabilities to the framework and moves from "here is what this can do" to "here is exactly how you build it yourself." For developers working on retrieval systems that need to handle both text and images, this is the missing manual they have been waiting for.
Why This Matters
Finetuning beats general-purpose models in specialized domains, and Aarsen's numbers prove it decisively. The finetuned model he built, tomaarsen/Qwen3-VL-Embedding-2B-vdr, scores an NDCG@10 of 0.947 compared to the base model's 0.888, a meaningful jump that translates directly to better search results in production systems. More importantly, that 2-billion-parameter finetuned model outperforms every competing Visual Document Retrieval model Aarsen tested, including models with up to 8 billion parameters. That performance-per-parameter ratio should make any engineering team rethink the instinct to always reach for the biggest available model.
Daily briefing from 50+ sources. Free, 5-minute read.
The Full Story
Sentence Transformers is an open-source Python library built on top of Hugging Face's Transformers, and it has become one of the standard tools for building semantic search, retrieval-augmented generation pipelines, and similarity ranking systems. Until recently, it worked primarily with text. The multimodal extension, which Aarsen covered in a previous post, opened the library up to images, audio, and video as input types. This new post tackles the harder question: how do you actually teach one of these models to be good at your specific task?
The central example Aarsen walks through is Visual Document Retrieval, or VDR. The task is straightforward to describe but hard to solve well: given a text query, retrieve the most relevant document pages, where those pages are stored as images with all their charts, tables, and visual layout preserved rather than as extracted plain text. This matters enormously in enterprise settings where documents are scanned PDFs or complex spreadsheets that lose critical information when converted to raw text.
Aarsen finetuned Qwen/Qwen3-VL-Embedding-2B, a vision-language model from Alibaba's Qwen team, on a Visual Document Retrieval dataset. The result, published to the Hugging Face Hub as tomaarsen/Qwen3-VL-Embedding-2B-vdr, demonstrates what targeted training can accomplish. The base model already performs reasonably well at 0.888 NDCG@10, but the finetuned version pushes that to 0.947, outperforming all other VDR models Aarsen benchmarked, including models roughly four times larger by parameter count.
What makes the post especially useful is that Aarsen breaks down the training components in a way that generalizes beyond this single example. The framework requires three things: a model, a dataset in the right format, and a loss function matched to the task. For retrieval tasks, the loss function needs to capture relevance relationships between queries and candidate documents, which is different from what you would use for clustering or classification. Getting that choice right is where most practitioners stumble, and Aarsen explains the reasoning clearly enough that readers can apply it to their own domains.
The collaborative nature of the post is worth noting. While Aarsen is the featured author, the byline includes at least 16 contributors, including stefan-jo, kalyan-ks, ahujachirag, wissamantoun, radames, and marco, among others. That depth of contribution signals that this is not a quick tutorial thrown together by one person but a considered, community-reviewed piece of documentation that reflects real-world implementation experience across multiple teams.
Key Details
- Published April 16, 2026, by Tom Aarsen on the Hugging Face Blog, with more than 16 listed contributors.
- The finetuned model,
tomaarsen/Qwen3-VL-Embedding-2B-vdr, achieves an NDCG@10 of 0.947 versus the base model's 0.888. - The 2-billion-parameter finetuned model outperforms competing VDR models up to 4 times its size.
- The base model used for finetuning is
Qwen/Qwen3-VL-Embedding-2Bfrom Alibaba's Qwen team. - Sentence Transformers supports multimodal inputs including text, images, audio, and video.
- The post builds on a March 2025 Aarsen post covering reranker model training for Sentence Transformers v4.
- The earlier multimodal introduction post received 43 upvotes on Hugging Face, while the reranker training post received 190 upvotes, indicating strong community interest.
What's Next
Developers who have been waiting for clear documentation to start building domain-specific multimodal retrieval systems now have a working blueprint. The next logical step for the community is extending these finetuning techniques to audio and video retrieval tasks, which Sentence Transformers supports in inference but where training guides are still sparse. Teams building enterprise document search products should treat this post as a starting gun, because the performance gap between general-purpose and domain-finetuned models is now too large to ignore.
How This Compares
Research presented at ICLR 2026, from teams at Harbin Institute of Technology and The Hong Kong Polytechnic University, examined whether contrastive learning or supervised fine-tuning works better for multimodal reranking. Their finding that the answer depends heavily on architecture, with BERT-style encoders favoring contrastive loss and large language models favoring supervised fine-tuning on yes/no token prediction, lines up with the flexible, loss-function-aware approach that Sentence Transformers promotes. Aarsen's framework does not impose one training strategy. It lets you match the approach to the model architecture, which is exactly what that ICLR research recommends.
Compare this to the release of the Lychee reranker model at vec-ai/lychee-rerank-mm on Hugging Face, which shows a deployable multimodal reranker built from similar foundations. Lychee demonstrates that the theoretical framework translates into production-ready models, but it does not give you the tools to build your own. Aarsen's post does both: it gives you the conceptual grounding and the actual code path. That combination is what separates a useful research artifact from something engineering teams can actually adopt.
The broader trend here is that the AI industry is moving away from the assumption that one giant general-purpose model handles everything adequately. Check any recent AI news and you will find story after story about domain-specific models punching above their weight class. Aarsen's benchmark result, where a 2-billion-parameter finetuned model beats 8-billion-parameter competitors, is one more data point confirming that targeted training on the right data beats raw scale. That is a message with real budget implications for any team running inference at scale.
FAQ
Q: What is Sentence Transformers and who uses it? A: Sentence Transformers is an open-source Python library that makes it easier to train and use embedding models for tasks like semantic search, document retrieval, and retrieval-augmented generation. It is widely used by developers and researchers who need to convert text, images, or other inputs into numerical representations that capture meaning and similarity relationships.
Q: What does NDCG@10 mean and why should I care? A: NDCG@10 stands for Normalized Discounted Cumulative Gain at 10 results. It measures how well a retrieval model ranks relevant documents within the top 10 results it returns, with higher scores meaning better ranking quality. A score of 0.947 versus 0.888 is a real gap in practice, especially when users rarely look past the first page of results.
Q: Do I need a massive GPU cluster to finetune these multimodal models? A: Not necessarily. The model Aarsen finetuned has 2 billion parameters, which is large but manageable on modern single-GPU setups with techniques like gradient checkpointing and mixed-precision training. The Sentence Transformers guides and tutorials can help you understand the hardware requirements for your specific use case before committing resources.
The publication of this training guide marks a meaningful step toward making multimodal retrieval a practical option for teams outside of large research labs. With a clear methodology, open-source tooling, and benchmark results that hold up against much larger models, the barrier to building specialized visual document search is now primarily about having the right training data, not the right budget. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.
Get stories like this daily
Free briefing. Curated from 50+ sources. 5-minute read every morning.


