Finetuning Multimodal Embedding Models with Sentence Transformers: A Practical Walkthrough

Tom Aarsen is back with a follow-up to his earlier post on multimodal embeddings in Sentence Transformers. If you missed that one, go read it first—this assumes you know the basics. What I want to talk about today is the part that actually matters: training and finetuning these models on your own data.

Why Bother Finetuning?

General-purpose multimodal embedding models like Qwen/Qwen3-VL-Embedding-2B are trained on everything under the sun—image-text pairs, visual QA, document understanding, you name it. That breadth is impressive, but it comes at a cost: the model is rarely the best at any single task.

Take Visual Document Retrieval (VDR). You give it a query like “What was the company’s Q3 revenue?” and it needs to find the right document screenshot from thousands of pages. That means understanding layouts, charts, tables, and dense text—not matching cat pictures to captions. These are fundamentally different skills.

Finetuning on domain-specific data lets the model learn those patterns. In Aarsen’s experiment, NDCG@10 jumped from 0.888 to 0.947—a massive leap—and beat every other model he tested, including ones four times larger.

The Training Pipeline: Same Parts, New Modalities

The good news is that training multimodal models in Sentence Transformers uses the same components as text-only training:

Model: The multimodal backbone you want to finetune.
Dataset: Your training and evaluation data.
Loss function: Drives the optimization.
Training arguments: Controls performance and logging.
Evaluator: Checks how you’re doing during training.
Trainer: Ties it all together.

It’s the same SentenceTransformerTrainer you’d use for text. The main difference is that your datasets now include images (or audio, or video) alongside text, and the model’s processor handles preprocessing automatically.

Let’s walk through each piece, using VDR as the running example.

Picking a Model

You have two paths. First, finetune an existing multimodal embedding model—one that already has a modules.json file. You pass processor_kwargs and model_kwargs to control things like image resolution and precision:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
    processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)

Second, start from a fresh VLM checkpoint that hasn’t been trained for embeddings yet. Sentence Transformers tries to detect the architecture and set up the right forward method and pooling automatically:

model = SentenceTransformer("Qwen/Qwen3-VL-2B")

If auto-detection fails, you can manually edit the saved sentence_bert_config.json to fix modality settings, forward methods, or output handling. The Transformer module inspects the processor to figure out which modalities are available, and Pooling is added automatically if needed. You can always check:

print(model.modalities)
print(model.supports("image"))

There’s also a third option: composing separate encoders for different modalities using a Router module. This is more flexible if you want, say, a CLIP vision encoder paired with a BERT text encoder. But for most people, finetuning a single VLM backbone is simpler and works well.

Building the Dataset

For VDR, you need pairs of text queries and document page images. The dataset format is straightforward: each entry has a text query and a list of positive and negative document images. Sentence Transformers expects a dataset with query, positive, and optionally negative columns.

Here’s a minimal example using Hugging Face Datasets:

from datasets import Dataset

data = {
    "query": ["What was Q3 revenue?", "Who is the CEO?"],
    "positive": [["doc_page_1.png"], ["doc_page_5.png"]],
    "negative": [["doc_page_2.png", "doc_page_3.png"], ["doc_page_4.png"]]
}
dataset = Dataset.from_dict(data)

Images are loaded automatically if you provide paths or URLs. You can also use Image columns directly. The processor handles resizing, normalization, and padding.

Choosing a Loss Function

For embedding models, the go-to is CachedMultipleNegativesRankingLoss. It’s efficient because it caches embeddings and reuses them across batches, which matters when you’re dealing with large image encoders. The loss pushes positive pairs closer together and pushes negatives apart.

If you want a model that supports Matryoshka embeddings (where you can truncate the embedding dimension at inference time for speed), wrap it in MatryoshkaLoss:

from sentence_transformers.losses import CachedMultipleNegativesRankingLoss, MatryoshkaLoss

base_loss = CachedMultipleNegativesRankingLoss(model)
loss = MatryoshkaLoss(model, base_loss, [64, 128, 256, 512, 768])

This is useful if you want to trade off accuracy for speed at inference time without retraining.

Training Arguments

You configure training with SentenceTransformerTrainingArguments. Key parameters include learning rate, batch size, number of epochs, and evaluation strategy. For multimodal models, batch size is often constrained by GPU memory—images are bigger than text tokens.

from sentence_transformers import SentenceTransformerTrainingArguments

args = SentenceTransformerTrainingArguments(
    output_dir="finetuned-vdr-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
    logging_steps=100,
)

Evaluating During Training

You can add an evaluator to track performance. InformationRetrievalEvaluator is perfect for VDR—it computes NDCG, MRR, and recall at various k values:

from sentence_transformers.evaluation import InformationRetrievalEvaluator

evaluator = InformationRetrievalEvaluator(
    queries=eval_queries,
    corpus=eval_corpus,
    relevant_docs=eval_relevant_docs,
    name="vdr-eval",
)

Running the Trainer

Everything comes together in SentenceTransformerTrainer:

from sentence_transformers import SentenceTransformerTrainer

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss=loss,
    evaluator=evaluator,
)
trainer.train()

That’s it. The trainer handles batching, gradient accumulation, logging, and checkpointing.

Results: What You Get

Aarsen’s finetuned model (tomaarsen/Qwen3-VL-Embedding-2B-vdr) hit NDCG@10 of 0.947—beating the base model’s 0.888 and outperforming every existing VDR model he tested, including ones four times larger. That’s a real-world win: you don’t need a bigger model, you need one that’s trained on the right data.

He also tested Matryoshka dimensions. At the smallest dimension (64), NDCG@10 dropped to around 0.85—still competitive with many full-size models. At 768 dimensions, it matched the full model. So if you need speed, you can truncate aggressively and still get usable results.

Training Multimodal Reranker Models

Reranker models work differently: they take a query and a candidate document (both as text or images) and output a relevance score. The training process is similar, but you use a cross-encoder loss like CachedCrossEncoderLoss and a CrossEncoder model instead of SentenceTransformer.

The dataset format changes too: each entry has a query, a document, and a label (1 for relevant, 0 for not). The trainer handles the rest.

Wrapping Up

Finetuning multimodal embedding models isn’t magic—it’s the same pipeline you already know, just with images thrown in. The key takeaway: domain-specific finetuning is worth the effort. A 2B model trained on your data can beat a 8B model trained on everything.

If you want to dive deeper, check out the Sentence Transformers documentation and the training examples. The code from this post is available on GitHub.

Now go train something that actually works on your data.