Building Multimodal Search Systems — Delivering Search That Understands Text, Images, and Beyond

An e-commerce AI agency in Seattle was hired by a home furnishings marketplace with 2.8 million products to build a search system that accepted any combination of text and images as input. Customers wanted to search by uploading a photo from Pinterest and asking for "something like this but in blue," or by describing a product they saw in a friend's house, or by combining a screenshot of a competitor's product with text specifying a different size. The existing keyword search could not handle any of these query types. The agency built a multimodal search system using CLIP-based embeddings that unified text and image representations in a shared vector space. Customers could search with text, images, or any combination. The system processed 840,000 queries per day with sub-second latency and increased product discovery metrics by 47% — customers were finding and purchasing products they never would have discovered through keyword search alone. Average order value increased by 12% because customers found higher-quality matches for their aesthetic preferences.

Multimodal search enables users to search across and between different data types — finding images with text descriptions, finding text with image queries, or combining multiple modalities in a single search. For AI agencies, multimodal search is an increasingly critical capability as businesses recognize that their content is inherently multimodal and their users want to search naturally across all of it.

Understanding Multimodal Search

Search Modality Combinations

Text-to-text: Traditional semantic search. Covered extensively in the vector search and semantic search guides.

Text-to-image: Find images matching a text description. "A modern minimalist living room with exposed brick" returns relevant product images.

Image-to-image: Find images similar to a query image. Upload a photo of a chair and find similar chairs in the catalog.

Image-to-text: Find text descriptions or documents matching a query image. Upload a photo of a plant disease and find relevant agricultural guides.

Multimodal-to-multimodal: Search with a combination of text and images, and return results that can be text, images, or both. This is the most flexible and most demanded capability.

Audio-to-text/image: Search using audio queries (voice descriptions) or find content matching audio characteristics.

Video search: Find specific moments in videos matching text or image queries.

When Multimodal Search Adds Value

High value:

E-commerce product search (customers often cannot describe what they want in words but can show an image)
Design and creative tools (searching for visual inspiration across large asset libraries)
Real estate (searching for properties that match a visual style)
Fashion (outfit matching, style search)
Medical imaging (finding similar cases in clinical image databases)
Manufacturing (finding similar defects or components from visual reference)

Moderate value:

Document search with embedded images (finding documents based on their visual content)
Knowledge bases with mixed media (text, images, diagrams, charts)
Social media content search

Architecture

Unified Embedding Space

The foundation of multimodal search is a shared embedding space where different modalities are represented as vectors that can be directly compared.

How it works:

Text and images are encoded into vectors in the same high-dimensional space
Semantically similar content is close together, regardless of modality
A text description of a "red leather sofa" and an image of a red leather sofa should have similar embeddings
Search is performed by encoding the query (text, image, or both) and finding the nearest vectors in the index

Multimodal Embedding Models

CLIP (OpenAI): The foundational model for multimodal embeddings. Trained on 400 million image-text pairs to align image and text representations. Available in multiple sizes (ViT-B/32, ViT-L/14). Open-source.

OpenCLIP (LAION): Open-source implementation of CLIP trained on larger datasets. Several variants available, some outperforming the original CLIP.

SigLIP (Google): Sigmoid loss-based image-language pre-training. More efficient training than CLIP with competitive or better performance.

BLIP-2 (Salesforce): Stronger image understanding than CLIP, with the ability to generate text descriptions of images. Useful when you need both search and description capabilities.

Cohere Embed v3: Commercial API that supports multimodal embeddings (text + images in the same space). Strong quality with minimal integration effort.

Jina CLIP: Open-source multimodal embedding model designed specifically for search applications. Strong retrieval performance.

Model Selection Criteria

Quality: Evaluate on your specific domain data. CLIP variants perform differently across domains — fashion, medical, industrial, general.
Dimensionality: CLIP ViT-B/32 produces 512-dimensional embeddings; ViT-L/14 produces 768-dimensional. Higher dimensions capture more information but increase storage and search costs.
Inference speed: For real-time search, embedding latency must be under 50ms. Larger models are slower.
Domain adaptation: Fine-tuning the embedding model on domain-specific data typically improves retrieval quality by 15-30%. Prioritize models that are easy to fine-tune.

Building the Multimodal Index

Image Processing Pipeline

Image preprocessing:

Resize images to the model's expected input resolution (224x224 for most CLIP variants, 384x384 for larger variants)
Apply center crop or padding to handle varying aspect ratios
Normalize pixel values according to the model's training statistics
Handle diverse input formats (JPEG, PNG, WebP, TIFF)

Image quality filtering:

Filter out corrupted images, blank images, and images below minimum resolution
Detect and filter duplicate or near-duplicate images
For product catalogs: filter out non-product images (lifestyle shots, logos, text-only images) or tag them appropriately

Image augmentation for indexing (optional):

Generate embeddings for multiple crops of each image (full image, center crop, quadrant crops) to capture different aspects
This increases recall for partial-match queries but increases index size proportionally

Text Processing Pipeline

Text preprocessing for multimodal indexing:

Clean and normalize text content
For product catalogs: combine product title, description, category, and key attributes into a single text for embedding
For documents: chunk text and embed each chunk alongside any associated images

Text enrichment:

Generate text descriptions of images using a vision-language model (BLIP-2, LLaVA)
Index these generated descriptions alongside the images to improve text-to-image retrieval
This bridges the modality gap — even if the embedding model imperfectly aligns text and image, the generated text descriptions provide an additional retrieval path

Indexing Architecture

Dual-index approach:

Maintain a text embedding index and an image embedding index in the same vector database
All embeddings are in the same vector space (thanks to the multimodal model)
Search queries are embedded once and compared against both indices simultaneously
Results from both indices are merged and ranked by similarity

Unified index approach:

Store all embeddings (text and image) in a single index with a modality tag
Use the modality tag for filtering (search only images, search only text, search all)
Simpler architecture but less flexibility for modality-specific optimization

Metadata and Filtering

Essential metadata for multimodal search:

Modality (text, image, video, audio)
Source document or product ID
Content type (product photo, lifestyle image, diagram, screenshot)
Date created/modified
Category/tag
Quality score

Metadata filtering in search:

Allow users to filter by modality ("show me only images")
Allow users to filter by category, date, or other attributes
Apply filters before or during vector search for efficient retrieval

Query Processing

Multimodal Query Handling

Text-only queries:

Embed the text query using the text encoder
Search the unified index
Return results ranked by embedding similarity

Image-only queries:

Embed the image query using the image encoder
Search the unified index
Return results ranked by embedding similarity

Combined text + image queries:

This is the most powerful and most complex query type.

Approaches for combining modalities:

Weighted average: Embed the text and image separately, then compute a weighted average of the two embeddings. The weight determines the relative importance of each modality. Simple and effective.
Late fusion: Retrieve results separately for the text query and the image query, then combine the result lists using reciprocal rank fusion or a learned combination model.
Conditional text modification: Use the text to modify the image embedding — "like this image but in blue" shifts the image embedding toward the "blue" direction in the embedding space. Models like Pic2Word or SEARLE specifically support this.
Vision-language model: Pass the image and text to a vision-language model that generates a unified query embedding capturing both modalities. This is the most sophisticated approach and handles complex combined queries best.

Query Expansion and Refinement

Image-based query expansion:

When the user searches with an image, generate a text description of the image using a vision-language model
Use the generated text as an additional search signal
This helps when the image query is ambiguous (a photo of a room — is the user searching for the sofa, the lamp, or the rug?)

Interactive search refinement:

After showing initial results, allow the user to provide feedback ("more like this," "less like this")
Adjust the query embedding based on feedback signals
This iterative refinement converges on the user's intent faster than reformulating queries

Fine-Tuning for Domain Performance

Contrastive Fine-Tuning

Fine-tune the multimodal embedding model on domain-specific image-text pairs to improve retrieval quality.

Training data:

Image-text pairs from the client's catalog (product images with their descriptions)
Minimum 10,000 pairs for noticeable improvement, 50,000+ for substantial improvement
Include hard negatives — pairs that are similar but not matches (two different red sofas)

Fine-tuning approach:

Use contrastive learning (same loss as CLIP training) on the domain-specific pairs
Freeze the lower layers of the model and fine-tune the upper layers to preserve general knowledge while adapting to the domain
Monitor retrieval quality on a held-out evaluation set during training
Fine-tuning typically improves domain-specific retrieval by 15-30%

Evaluation

Evaluation metrics for multimodal search:

Recall@K (separately for text-to-image, image-to-text, and image-to-image queries)
MRR (mean reciprocal rank of the first relevant result)
NDCG@10 (quality of the top 10 results ranking)

Evaluation dataset:

Create 200+ query-result pairs covering all modality combinations
Include queries of varying difficulty
Have domain experts annotate relevance on a graded scale
Include queries where the user's intent requires understanding both modalities

Production Deployment

Serving Architecture

Embedding service:

Deploy the multimodal embedding model behind a low-latency API
Cache embeddings for frequently queried images
Use GPU inference for image embedding (CPU inference is too slow for real-time search)
Text embedding can run on CPU if latency requirements allow

Search service:

Vector database serving the multimodal index
Support for hybrid queries (vector similarity + metadata filtering)
Horizontal scaling for high query volumes

API design:

Accept text, image URL, base64-encoded image, or any combination as query input
Return results with relevance scores, thumbnails, and metadata
Support pagination and filtering parameters
Provide feedback endpoints for search refinement

Performance Optimization

Latency budget:

Image embedding: 20-50ms on GPU
Text embedding: 10-30ms on GPU, 30-100ms on CPU
Vector search: 10-50ms
Re-ranking (optional): 50-200ms
Total target: under 500ms for end-to-end search

Throughput optimization:

Batch concurrent queries for GPU inference
Cache frequent query embeddings
Use approximate nearest neighbor search (HNSW) for fast retrieval
Pre-compute embeddings for catalog items during ingestion, not at query time

Monitoring

Search quality metrics:

Click-through rate on search results (by query modality)
Search abandonment rate
Average position of clicked results
User feedback signals (if available)

System metrics:

Embedding latency, search latency, end-to-end latency
Throughput (queries per second)
Index size and growth rate
GPU utilization for embedding service

Your Next Step

Collect 100 representative queries from your client's users — including text queries, image queries, and ideally some combined queries. For each query, identify the 5 most relevant items in the catalog. Embed the catalog using an off-the-shelf CLIP model (OpenCLIP ViT-L/14 is a strong default). Run the 100 queries and compute Recall@10. This baseline measurement takes 1-2 days and tells you three things: whether multimodal search is feasible for this domain, which query types work well and which need improvement, and how much room there is for improvement through fine-tuning. Present the results to the client with side-by-side examples of multimodal search results versus their current keyword search. The visual comparison is the most powerful demonstration of multimodal search value.

Understanding Multimodal Search

Search Modality Combinations

Text-to-text: Traditional semantic search. Covered extensively in the vector search and semantic search guides.

Text-to-image: Find images matching a text description. "A modern minimalist living room with exposed brick" returns relevant product images.

Image-to-image: Find images similar to a query image. Upload a photo of a chair and find similar chairs in the catalog.

Image-to-text: Find text descriptions or documents matching a query image. Upload a photo of a plant disease and find relevant agricultural guides.

Multimodal-to-multimodal: Search with a combination of text and images, and return results that can be text, images, or both. This is the most flexible and most demanded capability.

Audio-to-text/image: Search using audio queries (voice descriptions) or find content matching audio characteristics.

Video search: Find specific moments in videos matching text or image queries.

When Multimodal Search Adds Value

High value:

E-commerce product search (customers often cannot describe what they want in words but can show an image)
Design and creative tools (searching for visual inspiration across large asset libraries)
Real estate (searching for properties that match a visual style)
Fashion (outfit matching, style search)
Medical imaging (finding similar cases in clinical image databases)
Manufacturing (finding similar defects or components from visual reference)

Moderate value:

Document search with embedded images (finding documents based on their visual content)
Knowledge bases with mixed media (text, images, diagrams, charts)
Social media content search

Architecture

Unified Embedding Space

The foundation of multimodal search is a shared embedding space where different modalities are represented as vectors that can be directly compared.

How it works:

Text and images are encoded into vectors in the same high-dimensional space
Semantically similar content is close together, regardless of modality
A text description of a "red leather sofa" and an image of a red leather sofa should have similar embeddings
Search is performed by encoding the query (text, image, or both) and finding the nearest vectors in the index

Multimodal Embedding Models

OpenCLIP (LAION): Open-source implementation of CLIP trained on larger datasets. Several variants available, some outperforming the original CLIP.

SigLIP (Google): Sigmoid loss-based image-language pre-training. More efficient training than CLIP with competitive or better performance.

BLIP-2 (Salesforce): Stronger image understanding than CLIP, with the ability to generate text descriptions of images. Useful when you need both search and description capabilities.

Cohere Embed v3: Commercial API that supports multimodal embeddings (text + images in the same space). Strong quality with minimal integration effort.

Jina CLIP: Open-source multimodal embedding model designed specifically for search applications. Strong retrieval performance.

Model Selection Criteria

Quality: Evaluate on your specific domain data. CLIP variants perform differently across domains — fashion, medical, industrial, general.
Dimensionality: CLIP ViT-B/32 produces 512-dimensional embeddings; ViT-L/14 produces 768-dimensional. Higher dimensions capture more information but increase storage and search costs.
Inference speed: For real-time search, embedding latency must be under 50ms. Larger models are slower.
Domain adaptation: Fine-tuning the embedding model on domain-specific data typically improves retrieval quality by 15-30%. Prioritize models that are easy to fine-tune.

Building the Multimodal Index

Image Processing Pipeline

Image preprocessing:

Resize images to the model's expected input resolution (224x224 for most CLIP variants, 384x384 for larger variants)
Apply center crop or padding to handle varying aspect ratios
Normalize pixel values according to the model's training statistics
Handle diverse input formats (JPEG, PNG, WebP, TIFF)

Image quality filtering:

Filter out corrupted images, blank images, and images below minimum resolution
Detect and filter duplicate or near-duplicate images
For product catalogs: filter out non-product images (lifestyle shots, logos, text-only images) or tag them appropriately

Image augmentation for indexing (optional):

Generate embeddings for multiple crops of each image (full image, center crop, quadrant crops) to capture different aspects
This increases recall for partial-match queries but increases index size proportionally

Text Processing Pipeline

Text preprocessing for multimodal indexing:

Clean and normalize text content
For product catalogs: combine product title, description, category, and key attributes into a single text for embedding
For documents: chunk text and embed each chunk alongside any associated images

Text enrichment:

Generate text descriptions of images using a vision-language model (BLIP-2, LLaVA)
Index these generated descriptions alongside the images to improve text-to-image retrieval
This bridges the modality gap — even if the embedding model imperfectly aligns text and image, the generated text descriptions provide an additional retrieval path

Indexing Architecture

Dual-index approach:

Maintain a text embedding index and an image embedding index in the same vector database
All embeddings are in the same vector space (thanks to the multimodal model)
Search queries are embedded once and compared against both indices simultaneously
Results from both indices are merged and ranked by similarity

Unified index approach:

Store all embeddings (text and image) in a single index with a modality tag
Use the modality tag for filtering (search only images, search only text, search all)
Simpler architecture but less flexibility for modality-specific optimization

Metadata and Filtering

Essential metadata for multimodal search:

Modality (text, image, video, audio)
Source document or product ID
Content type (product photo, lifestyle image, diagram, screenshot)
Date created/modified
Category/tag
Quality score

Metadata filtering in search:

Allow users to filter by modality ("show me only images")
Allow users to filter by category, date, or other attributes
Apply filters before or during vector search for efficient retrieval

Query Processing

Multimodal Query Handling

Text-only queries:

Embed the text query using the text encoder
Search the unified index
Return results ranked by embedding similarity

Image-only queries:

Embed the image query using the image encoder
Search the unified index
Return results ranked by embedding similarity

Combined text + image queries:

This is the most powerful and most complex query type.

Approaches for combining modalities:

Weighted average: Embed the text and image separately, then compute a weighted average of the two embeddings. The weight determines the relative importance of each modality. Simple and effective.
Late fusion: Retrieve results separately for the text query and the image query, then combine the result lists using reciprocal rank fusion or a learned combination model.
Conditional text modification: Use the text to modify the image embedding — "like this image but in blue" shifts the image embedding toward the "blue" direction in the embedding space. Models like Pic2Word or SEARLE specifically support this.
Vision-language model: Pass the image and text to a vision-language model that generates a unified query embedding capturing both modalities. This is the most sophisticated approach and handles complex combined queries best.

Query Expansion and Refinement

Image-based query expansion:

When the user searches with an image, generate a text description of the image using a vision-language model
Use the generated text as an additional search signal
This helps when the image query is ambiguous (a photo of a room — is the user searching for the sofa, the lamp, or the rug?)

Interactive search refinement:

After showing initial results, allow the user to provide feedback ("more like this," "less like this")
Adjust the query embedding based on feedback signals
This iterative refinement converges on the user's intent faster than reformulating queries

Fine-Tuning for Domain Performance

Contrastive Fine-Tuning

Fine-tune the multimodal embedding model on domain-specific image-text pairs to improve retrieval quality.

Training data:

Image-text pairs from the client's catalog (product images with their descriptions)
Minimum 10,000 pairs for noticeable improvement, 50,000+ for substantial improvement
Include hard negatives — pairs that are similar but not matches (two different red sofas)

Fine-tuning approach:

Use contrastive learning (same loss as CLIP training) on the domain-specific pairs
Freeze the lower layers of the model and fine-tune the upper layers to preserve general knowledge while adapting to the domain
Monitor retrieval quality on a held-out evaluation set during training
Fine-tuning typically improves domain-specific retrieval by 15-30%

Evaluation

Evaluation metrics for multimodal search:

Recall@K (separately for text-to-image, image-to-text, and image-to-image queries)
MRR (mean reciprocal rank of the first relevant result)
NDCG@10 (quality of the top 10 results ranking)

Evaluation dataset:

Create 200+ query-result pairs covering all modality combinations
Include queries of varying difficulty
Have domain experts annotate relevance on a graded scale
Include queries where the user's intent requires understanding both modalities

Production Deployment

Serving Architecture

Embedding service:

Deploy the multimodal embedding model behind a low-latency API
Cache embeddings for frequently queried images
Use GPU inference for image embedding (CPU inference is too slow for real-time search)
Text embedding can run on CPU if latency requirements allow

Search service:

Vector database serving the multimodal index
Support for hybrid queries (vector similarity + metadata filtering)
Horizontal scaling for high query volumes

API design:

Accept text, image URL, base64-encoded image, or any combination as query input
Return results with relevance scores, thumbnails, and metadata
Support pagination and filtering parameters
Provide feedback endpoints for search refinement

Performance Optimization

Latency budget:

Image embedding: 20-50ms on GPU
Text embedding: 10-30ms on GPU, 30-100ms on CPU
Vector search: 10-50ms
Re-ranking (optional): 50-200ms
Total target: under 500ms for end-to-end search

Throughput optimization:

Batch concurrent queries for GPU inference
Cache frequent query embeddings
Use approximate nearest neighbor search (HNSW) for fast retrieval
Pre-compute embeddings for catalog items during ingestion, not at query time

Monitoring

Search quality metrics:

Click-through rate on search results (by query modality)
Search abandonment rate
Average position of clicked results
User feedback signals (if available)

System metrics:

Embedding latency, search latency, end-to-end latency
Throughput (queries per second)
Index size and growth rate
GPU utilization for embedding service

Building Multimodal Search Systems — Delivering Search That Understands Text, Images, and Beyond

Understanding Multimodal Search

Search Modality Combinations

When Multimodal Search Adds Value

Architecture

Unified Embedding Space

Multimodal Embedding Models

Model Selection Criteria

Building the Multimodal Index

Image Processing Pipeline

Text Processing Pipeline

Indexing Architecture

Metadata and Filtering

Query Processing

Multimodal Query Handling

Query Expansion and Refinement

Fine-Tuning for Domain Performance

Contrastive Fine-Tuning

Evaluation

Production Deployment

Serving Architecture

Performance Optimization

Monitoring

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?

Building Multimodal Search Systems — Delivering Search That Understands Text, Images, and Beyond

Understanding Multimodal Search

Search Modality Combinations

When Multimodal Search Adds Value

Architecture

Unified Embedding Space

Multimodal Embedding Models

Model Selection Criteria

Building the Multimodal Index

Image Processing Pipeline

Text Processing Pipeline

Indexing Architecture

Metadata and Filtering

Query Processing

Multimodal Query Handling

Query Expansion and Refinement

Fine-Tuning for Domain Performance

Contrastive Fine-Tuning

Evaluation

Production Deployment

Serving Architecture

Performance Optimization

Monitoring

Your Next Step

Agency Script Editorial

Related Articles

Delivering AI Analytics for Sports Organizations: From Player Performance to Fan Engagement

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

Ready to certify your AI capability?