An e-commerce AI agency in Seattle was hired by a home furnishings marketplace with 2.8 million products to build a search system that accepted any combination of text and images as input. Customers wanted to search by uploading a photo from Pinterest and asking for "something like this but in blue," or by describing a product they saw in a friend's house, or by combining a screenshot of a competitor's product with text specifying a different size. The existing keyword search could not handle any of these query types. The agency built a multimodal search system using CLIP-based embeddings that unified text and image representations in a shared vector space. Customers could search with text, images, or any combination. The system processed 840,000 queries per day with sub-second latency and increased product discovery metrics by 47% โ customers were finding and purchasing products they never would have discovered through keyword search alone. Average order value increased by 12% because customers found higher-quality matches for their aesthetic preferences.
Multimodal search enables users to search across and between different data types โ finding images with text descriptions, finding text with image queries, or combining multiple modalities in a single search. For AI agencies, multimodal search is an increasingly critical capability as businesses recognize that their content is inherently multimodal and their users want to search naturally across all of it.
Understanding Multimodal Search
Search Modality Combinations
Text-to-text: Traditional semantic search. Covered extensively in the vector search and semantic search guides.
Text-to-image: Find images matching a text description. "A modern minimalist living room with exposed brick" returns relevant product images.
Image-to-image: Find images similar to a query image. Upload a photo of a chair and find similar chairs in the catalog.
Image-to-text: Find text descriptions or documents matching a query image. Upload a photo of a plant disease and find relevant agricultural guides.
Multimodal-to-multimodal: Search with a combination of text and images, and return results that can be text, images, or both. This is the most flexible and most demanded capability.
Audio-to-text/image: Search using audio queries (voice descriptions) or find content matching audio characteristics.
Video search: Find specific moments in videos matching text or image queries.
When Multimodal Search Adds Value
High value:
- E-commerce product search (customers often cannot describe what they want in words but can show an image)
- Design and creative tools (searching for visual inspiration across large asset libraries)
- Real estate (searching for properties that match a visual style)
- Fashion (outfit matching, style search)
- Medical imaging (finding similar cases in clinical image databases)
- Manufacturing (finding similar defects or components from visual reference)
Moderate value:
- Document search with embedded images (finding documents based on their visual content)
- Knowledge bases with mixed media (text, images, diagrams, charts)
- Social media content search
Architecture
Unified Embedding Space
The foundation of multimodal search is a shared embedding space where different modalities are represented as vectors that can be directly compared.
How it works:
- Text and images are encoded into vectors in the same high-dimensional space
- Semantically similar content is close together, regardless of modality
- A text description of a "red leather sofa" and an image of a red leather sofa should have similar embeddings
- Search is performed by encoding the query (text, image, or both) and finding the nearest vectors in the index
Multimodal Embedding Models
CLIP (OpenAI): The foundational model for multimodal embeddings. Trained on 400 million image-text pairs to align image and text representations. Available in multiple sizes (ViT-B/32, ViT-L/14). Open-source.
OpenCLIP (LAION): Open-source implementation of CLIP trained on larger datasets. Several variants available, some outperforming the original CLIP.
SigLIP (Google): Sigmoid loss-based image-language pre-training. More efficient training than CLIP with competitive or better performance.
BLIP-2 (Salesforce): Stronger image understanding than CLIP, with the ability to generate text descriptions of images. Useful when you need both search and description capabilities.
Cohere Embed v3: Commercial API that supports multimodal embeddings (text + images in the same space). Strong quality with minimal integration effort.
Jina CLIP: Open-source multimodal embedding model designed specifically for search applications. Strong retrieval performance.
Model Selection Criteria
- Quality: Evaluate on your specific domain data. CLIP variants perform differently across domains โ fashion, medical, industrial, general.
- Dimensionality: CLIP ViT-B/32 produces 512-dimensional embeddings; ViT-L/14 produces 768-dimensional. Higher dimensions capture more information but increase storage and search costs.
- Inference speed: For real-time search, embedding latency must be under 50ms. Larger models are slower.
- Domain adaptation: Fine-tuning the embedding model on domain-specific data typically improves retrieval quality by 15-30%. Prioritize models that are easy to fine-tune.
Building the Multimodal Index
Image Processing Pipeline
Image preprocessing:
- Resize images to the model's expected input resolution (224x224 for most CLIP variants, 384x384 for larger variants)
- Apply center crop or padding to handle varying aspect ratios
- Normalize pixel values according to the model's training statistics
- Handle diverse input formats (JPEG, PNG, WebP, TIFF)
Image quality filtering:
- Filter out corrupted images, blank images, and images below minimum resolution
- Detect and filter duplicate or near-duplicate images
- For product catalogs: filter out non-product images (lifestyle shots, logos, text-only images) or tag them appropriately
Image augmentation for indexing (optional):
- Generate embeddings for multiple crops of each image (full image, center crop, quadrant crops) to capture different aspects
- This increases recall for partial-match queries but increases index size proportionally
Text Processing Pipeline
Text preprocessing for multimodal indexing:
- Clean and normalize text content
- For product catalogs: combine product title, description, category, and key attributes into a single text for embedding
- For documents: chunk text and embed each chunk alongside any associated images
Text enrichment:
- Generate text descriptions of images using a vision-language model (BLIP-2, LLaVA)
- Index these generated descriptions alongside the images to improve text-to-image retrieval
- This bridges the modality gap โ even if the embedding model imperfectly aligns text and image, the generated text descriptions provide an additional retrieval path
Indexing Architecture
Dual-index approach:
- Maintain a text embedding index and an image embedding index in the same vector database
- All embeddings are in the same vector space (thanks to the multimodal model)
- Search queries are embedded once and compared against both indices simultaneously
- Results from both indices are merged and ranked by similarity
Unified index approach:
- Store all embeddings (text and image) in a single index with a modality tag
- Use the modality tag for filtering (search only images, search only text, search all)
- Simpler architecture but less flexibility for modality-specific optimization
Metadata and Filtering
Essential metadata for multimodal search:
- Modality (text, image, video, audio)
- Source document or product ID
- Content type (product photo, lifestyle image, diagram, screenshot)
- Date created/modified
- Category/tag
- Quality score
Metadata filtering in search:
- Allow users to filter by modality ("show me only images")
- Allow users to filter by category, date, or other attributes
- Apply filters before or during vector search for efficient retrieval
Query Processing
Multimodal Query Handling
Text-only queries:
- Embed the text query using the text encoder
- Search the unified index
- Return results ranked by embedding similarity
Image-only queries:
- Embed the image query using the image encoder
- Search the unified index
- Return results ranked by embedding similarity
Combined text + image queries:
This is the most powerful and most complex query type.
Approaches for combining modalities:
- Weighted average: Embed the text and image separately, then compute a weighted average of the two embeddings. The weight determines the relative importance of each modality. Simple and effective.
- Late fusion: Retrieve results separately for the text query and the image query, then combine the result lists using reciprocal rank fusion or a learned combination model.
- Conditional text modification: Use the text to modify the image embedding โ "like this image but in blue" shifts the image embedding toward the "blue" direction in the embedding space. Models like Pic2Word or SEARLE specifically support this.
- Vision-language model: Pass the image and text to a vision-language model that generates a unified query embedding capturing both modalities. This is the most sophisticated approach and handles complex combined queries best.
Query Expansion and Refinement
Image-based query expansion:
- When the user searches with an image, generate a text description of the image using a vision-language model
- Use the generated text as an additional search signal
- This helps when the image query is ambiguous (a photo of a room โ is the user searching for the sofa, the lamp, or the rug?)
Interactive search refinement:
- After showing initial results, allow the user to provide feedback ("more like this," "less like this")
- Adjust the query embedding based on feedback signals
- This iterative refinement converges on the user's intent faster than reformulating queries
Fine-Tuning for Domain Performance
Contrastive Fine-Tuning
Fine-tune the multimodal embedding model on domain-specific image-text pairs to improve retrieval quality.
Training data:
- Image-text pairs from the client's catalog (product images with their descriptions)
- Minimum 10,000 pairs for noticeable improvement, 50,000+ for substantial improvement
- Include hard negatives โ pairs that are similar but not matches (two different red sofas)
Fine-tuning approach:
- Use contrastive learning (same loss as CLIP training) on the domain-specific pairs
- Freeze the lower layers of the model and fine-tune the upper layers to preserve general knowledge while adapting to the domain
- Monitor retrieval quality on a held-out evaluation set during training
- Fine-tuning typically improves domain-specific retrieval by 15-30%
Evaluation
Evaluation metrics for multimodal search:
- Recall@K (separately for text-to-image, image-to-text, and image-to-image queries)
- MRR (mean reciprocal rank of the first relevant result)
- NDCG@10 (quality of the top 10 results ranking)
Evaluation dataset:
- Create 200+ query-result pairs covering all modality combinations
- Include queries of varying difficulty
- Have domain experts annotate relevance on a graded scale
- Include queries where the user's intent requires understanding both modalities
Production Deployment
Serving Architecture
Embedding service:
- Deploy the multimodal embedding model behind a low-latency API
- Cache embeddings for frequently queried images
- Use GPU inference for image embedding (CPU inference is too slow for real-time search)
- Text embedding can run on CPU if latency requirements allow
Search service:
- Vector database serving the multimodal index
- Support for hybrid queries (vector similarity + metadata filtering)
- Horizontal scaling for high query volumes
API design:
- Accept text, image URL, base64-encoded image, or any combination as query input
- Return results with relevance scores, thumbnails, and metadata
- Support pagination and filtering parameters
- Provide feedback endpoints for search refinement
Performance Optimization
Latency budget:
- Image embedding: 20-50ms on GPU
- Text embedding: 10-30ms on GPU, 30-100ms on CPU
- Vector search: 10-50ms
- Re-ranking (optional): 50-200ms
- Total target: under 500ms for end-to-end search
Throughput optimization:
- Batch concurrent queries for GPU inference
- Cache frequent query embeddings
- Use approximate nearest neighbor search (HNSW) for fast retrieval
- Pre-compute embeddings for catalog items during ingestion, not at query time
Monitoring
Search quality metrics:
- Click-through rate on search results (by query modality)
- Search abandonment rate
- Average position of clicked results
- User feedback signals (if available)
System metrics:
- Embedding latency, search latency, end-to-end latency
- Throughput (queries per second)
- Index size and growth rate
- GPU utilization for embedding service
Your Next Step
Collect 100 representative queries from your client's users โ including text queries, image queries, and ideally some combined queries. For each query, identify the 5 most relevant items in the catalog. Embed the catalog using an off-the-shelf CLIP model (OpenCLIP ViT-L/14 is a strong default). Run the 100 queries and compute Recall@10. This baseline measurement takes 1-2 days and tells you three things: whether multimodal search is feasible for this domain, which query types work well and which need improvement, and how much room there is for improvement through fine-tuning. Present the results to the client with side-by-side examples of multimodal search results versus their current keyword search. The visual comparison is the most powerful demonstration of multimodal search value.