Multimodal Embedding and Reranker Models in Sentence Transformers: A Practical Walkthrough

Sentence Transformers has been my go-to library for embedding and reranker models for years. It’s simple, well-maintained, and just works. The v5.4 update adds something I’ve been waiting for: native support for multimodal inputs. You can now encode and compare texts, images, audio, and videos using the same API you already know. No more hacking together separate pipelines for different modalities.

What’s the big deal?

Traditional embedding models are text-only. You feed in sentences, you get vectors. Multimodal embedding models map inputs from different modalities—text, images, audio, video—into a shared embedding space. That means you can compare a text query against image documents, find video clips matching a description, or build RAG pipelines that work across modalities. The same goes for reranker models: multimodal rerankers can score relevance between mixed-modality pairs, like an image and a text description.

This opens up use cases like visual document retrieval, cross-modal search, and multimodal RAG. If you’ve ever tried to build a system that searches images by natural language descriptions, you know how painful it used to be. Sentence Transformers just made it a lot easier.

Installation

You’ll need some extra dependencies depending on which modalities you want to use. The installation is straightforward:

pip install -U "sentence-transformers[image]"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers[image,video,train]"

A word of caution: VLM-based models like Qwen3-VL-2B need a GPU with at least ~8 GB of VRAM. For the 8B variants, expect ~20 GB. If you don’t have a local GPU, consider using a cloud GPU service or Google Colab. On CPU, these models will be painfully slow. Stick to text-only or CLIP models for CPU inference.

Multimodal Embedding Models

Loading a Model

Loading a multimodal embedding model is identical to loading a text-only model. The library handles all the modality detection internally:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")

Some models might require a revision argument if the integration pull request hasn’t been merged yet, but once it is, you can load them as shown above. The model automatically detects which modalities it supports, so there’s nothing extra to configure. If you need to control image resolution or model precision, check the Processor and Model kwargs.

Encoding Images

model.encode() now accepts images alongside text. Images can be URLs, local file paths, or PIL Image objects:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")

img_embeddings = model.encode([
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
print(img_embeddings.shape)

The API is clean and intuitive. No separate encode_image() method, no boilerplate.

Cross-Modal Similarity

Here’s where it gets interesting. You can compute similarities between text embeddings and image embeddings directly, since the model maps both into the same space:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")

img_embeddings = model.encode([
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])

text_embeddings = model.encode([
    "A green car parked in front of a yellow building",
    "A red car driving on a highway",
    "A bee on a pink flower",
    "A wasp on a wooden table",
])

similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)

As expected, “A green car parked in front of a yellow building” is most similar to the car image (0.51), and “A bee on a pink flower” is most similar to the bee image (0.67). The hard negatives (“A red car driving on a highway”, “A wasp on a wooden table”) correctly receive lower scores.

You might notice that even the best matching scores (0.51, 0.67) aren’t very close to 1.0. This is due to the modality gap: embeddings from different modalities tend to cluster in separate regions of the space. Cross-modal similarities are typically lower than within-modal ones (e.g., text-to-text), but the relative ordering is preserved, so retrieval still works well. I’ve seen this behavior in other multimodal models too—it’s not a bug, it’s a feature of the embedding space.

Encoding Queries and Documents

For retrieval tasks, encode_query() and encode_document() are the recommended methods. Many retrieval models prepend different instruction prompts depending on whether the input is a query or a document. Using these methods ensures the correct prompt is applied automatically:

query_emb = model.encode_query("A green car")
doc_emb = model.encode_document("https://example.com/car.jpg")

This is a small but important detail that can make a big difference in retrieval quality.

Multimodal Reranker Models

Reranker models (Cross Encoders) compute relevance scores between pairs of inputs. Multimodal rerankers extend this to mixed-modality pairs. The API is just as simple:

from sentence_transformers import CrossEncoder

model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")

scores = model.predict([
    ("A green car", "https://example.com/car.jpg"),
    ("A bee", "https://example.com/bee.jpg"),
])
print(scores)

You can also rank a list of mixed-modality documents against a query:

results = model.rank(
    "A green car",
    [
        "https://example.com/car.jpg",
        "https://example.com/bee.jpg",
        "A text document about cars",
    ]
)
print(results)

The rank method returns a list of dictionaries with corpus_id and score keys, sorted by relevance. This makes it easy to integrate into a retrieve-and-rerank pipeline.

Supported Input Types

Sentence Transformers supports a variety of input formats for each modality:

Text: strings
Images: URLs (strings starting with http:// or https://), local file paths, PIL.Image.Image objects, torch.Tensor or numpy.ndarray with shape (H, W, C) or (C, H, W)
Audio: URLs, local file paths, numpy.ndarray with shape (num_samples,) or (num_channels, num_samples)
Video: URLs, local file paths, numpy.ndarray with shape (num_frames, H, W, C) or (num_frames, C, H, W)

The library handles loading and preprocessing automatically. You can check which modalities a model supports using:

model.supported_modalities  # Returns a set like {"text", "image"}

Supported Models

At launch, Sentence Transformers supports several multimodal models:

Qwen3-VL-Embedding-2B: A 2B parameter embedding model supporting text and images
Qwen3-VL-Reranker-2B: A 2B parameter reranker model supporting text and images
Qwen3-VL-Embedding-8B: An 8B parameter embedding model (needs ~20 GB VRAM)
Qwen3-VL-Reranker-8B: An 8B parameter reranker model
CLIP-based models: For lighter-weight multimodal embeddings

More models will likely be added as the community contributes integrations.

Final Thoughts

Sentence Transformers v5.4 is a solid update. The multimodal support feels natural—it doesn’t force you to learn new APIs or change your workflow. If you’re already using Sentence Transformers for text, adding images, audio, or video is just a matter of passing different inputs.

There are some limitations worth mentioning. The VLM-based models are resource-hungry, and the modality gap means cross-modal similarity scores are lower than within-modal ones. But for retrieval tasks, relative ordering is what matters, and these models deliver.

If you want to train your own multimodal models, the companion blogpost on training and finetuning is worth a read. For now, I’m just happy to have a clean, unified API for multimodal embeddings in Python.