AGENCYSCRIPT
CoursesEnterpriseBlog
๐Ÿ‘‘FoundersSign inJoin Waitlist
AGENCYSCRIPT

Governed Certification Framework

The operating system for AI-enabled agency building. Certify judgment under constraint. Standards over scale. Governance over shortcuts.

Stay informed

Governance updates, certification insights, and industry standards.

Products

  • Platform
  • Certification
  • Launch Program
  • Vault
  • The Book

Certification

  • Foundation (AS-F)
  • Operator (AS-O)
  • Architect (AS-A)
  • Principal (AS-P)

Resources

  • Blog
  • Verify Credential
  • Enterprise
  • Partners
  • Pricing

Company

  • About
  • Contact
  • Careers
  • Press
ยฉ 2026 Agency Script, Inc.ยท
Privacy PolicyTerms of ServiceCertification AgreementSecurity

Standards over scale. Judgment over volume. Governance over shortcuts.

On This Page

Understanding Multimodal SearchSearch Modality CombinationsWhen Multimodal Search Adds ValueArchitectureUnified Embedding SpaceMultimodal Embedding ModelsModel Selection CriteriaBuilding the Multimodal IndexImage Processing PipelineText Processing PipelineIndexing ArchitectureMetadata and FilteringQuery ProcessingMultimodal Query HandlingQuery Expansion and RefinementFine-Tuning for Domain PerformanceContrastive Fine-TuningEvaluationProduction DeploymentServing ArchitecturePerformance OptimizationMonitoringYour Next Step
Home/Blog/Building Multimodal Search Systems โ€” Delivering Search That Understands Text, Images, and Beyond
Delivery

Building Multimodal Search Systems โ€” Delivering Search That Understands Text, Images, and Beyond

A

Agency Script Editorial

Editorial Team

ยทMarch 20, 2026ยท11 min read
multimodal searchvision language modelsinformation retrievalcross-modal search

An e-commerce AI agency in Seattle was hired by a home furnishings marketplace with 2.8 million products to build a search system that accepted any combination of text and images as input. Customers wanted to search by uploading a photo from Pinterest and asking for "something like this but in blue," or by describing a product they saw in a friend's house, or by combining a screenshot of a competitor's product with text specifying a different size. The existing keyword search could not handle any of these query types. The agency built a multimodal search system using CLIP-based embeddings that unified text and image representations in a shared vector space. Customers could search with text, images, or any combination. The system processed 840,000 queries per day with sub-second latency and increased product discovery metrics by 47% โ€” customers were finding and purchasing products they never would have discovered through keyword search alone. Average order value increased by 12% because customers found higher-quality matches for their aesthetic preferences.

Multimodal search enables users to search across and between different data types โ€” finding images with text descriptions, finding text with image queries, or combining multiple modalities in a single search. For AI agencies, multimodal search is an increasingly critical capability as businesses recognize that their content is inherently multimodal and their users want to search naturally across all of it.

Understanding Multimodal Search

Search Modality Combinations

Text-to-text: Traditional semantic search. Covered extensively in the vector search and semantic search guides.

Text-to-image: Find images matching a text description. "A modern minimalist living room with exposed brick" returns relevant product images.

Image-to-image: Find images similar to a query image. Upload a photo of a chair and find similar chairs in the catalog.

Image-to-text: Find text descriptions or documents matching a query image. Upload a photo of a plant disease and find relevant agricultural guides.

Multimodal-to-multimodal: Search with a combination of text and images, and return results that can be text, images, or both. This is the most flexible and most demanded capability.

Audio-to-text/image: Search using audio queries (voice descriptions) or find content matching audio characteristics.

Video search: Find specific moments in videos matching text or image queries.

When Multimodal Search Adds Value

High value:

  • E-commerce product search (customers often cannot describe what they want in words but can show an image)
  • Design and creative tools (searching for visual inspiration across large asset libraries)
  • Real estate (searching for properties that match a visual style)
  • Fashion (outfit matching, style search)
  • Medical imaging (finding similar cases in clinical image databases)
  • Manufacturing (finding similar defects or components from visual reference)

Moderate value:

  • Document search with embedded images (finding documents based on their visual content)
  • Knowledge bases with mixed media (text, images, diagrams, charts)
  • Social media content search

Architecture

Unified Embedding Space

The foundation of multimodal search is a shared embedding space where different modalities are represented as vectors that can be directly compared.

How it works:

  • Text and images are encoded into vectors in the same high-dimensional space
  • Semantically similar content is close together, regardless of modality
  • A text description of a "red leather sofa" and an image of a red leather sofa should have similar embeddings
  • Search is performed by encoding the query (text, image, or both) and finding the nearest vectors in the index

Multimodal Embedding Models

CLIP (OpenAI): The foundational model for multimodal embeddings. Trained on 400 million image-text pairs to align image and text representations. Available in multiple sizes (ViT-B/32, ViT-L/14). Open-source.

OpenCLIP (LAION): Open-source implementation of CLIP trained on larger datasets. Several variants available, some outperforming the original CLIP.

SigLIP (Google): Sigmoid loss-based image-language pre-training. More efficient training than CLIP with competitive or better performance.

BLIP-2 (Salesforce): Stronger image understanding than CLIP, with the ability to generate text descriptions of images. Useful when you need both search and description capabilities.

Cohere Embed v3: Commercial API that supports multimodal embeddings (text + images in the same space). Strong quality with minimal integration effort.

Jina CLIP: Open-source multimodal embedding model designed specifically for search applications. Strong retrieval performance.

Model Selection Criteria

  • Quality: Evaluate on your specific domain data. CLIP variants perform differently across domains โ€” fashion, medical, industrial, general.
  • Dimensionality: CLIP ViT-B/32 produces 512-dimensional embeddings; ViT-L/14 produces 768-dimensional. Higher dimensions capture more information but increase storage and search costs.
  • Inference speed: For real-time search, embedding latency must be under 50ms. Larger models are slower.
  • Domain adaptation: Fine-tuning the embedding model on domain-specific data typically improves retrieval quality by 15-30%. Prioritize models that are easy to fine-tune.

Building the Multimodal Index

Image Processing Pipeline

Image preprocessing:

  • Resize images to the model's expected input resolution (224x224 for most CLIP variants, 384x384 for larger variants)
  • Apply center crop or padding to handle varying aspect ratios
  • Normalize pixel values according to the model's training statistics
  • Handle diverse input formats (JPEG, PNG, WebP, TIFF)

Image quality filtering:

  • Filter out corrupted images, blank images, and images below minimum resolution
  • Detect and filter duplicate or near-duplicate images
  • For product catalogs: filter out non-product images (lifestyle shots, logos, text-only images) or tag them appropriately

Image augmentation for indexing (optional):

  • Generate embeddings for multiple crops of each image (full image, center crop, quadrant crops) to capture different aspects
  • This increases recall for partial-match queries but increases index size proportionally

Text Processing Pipeline

Text preprocessing for multimodal indexing:

  • Clean and normalize text content
  • For product catalogs: combine product title, description, category, and key attributes into a single text for embedding
  • For documents: chunk text and embed each chunk alongside any associated images

Text enrichment:

  • Generate text descriptions of images using a vision-language model (BLIP-2, LLaVA)
  • Index these generated descriptions alongside the images to improve text-to-image retrieval
  • This bridges the modality gap โ€” even if the embedding model imperfectly aligns text and image, the generated text descriptions provide an additional retrieval path

Indexing Architecture

Dual-index approach:

  • Maintain a text embedding index and an image embedding index in the same vector database
  • All embeddings are in the same vector space (thanks to the multimodal model)
  • Search queries are embedded once and compared against both indices simultaneously
  • Results from both indices are merged and ranked by similarity

Unified index approach:

  • Store all embeddings (text and image) in a single index with a modality tag
  • Use the modality tag for filtering (search only images, search only text, search all)
  • Simpler architecture but less flexibility for modality-specific optimization

Metadata and Filtering

Essential metadata for multimodal search:

  • Modality (text, image, video, audio)
  • Source document or product ID
  • Content type (product photo, lifestyle image, diagram, screenshot)
  • Date created/modified
  • Category/tag
  • Quality score

Metadata filtering in search:

  • Allow users to filter by modality ("show me only images")
  • Allow users to filter by category, date, or other attributes
  • Apply filters before or during vector search for efficient retrieval

Query Processing

Multimodal Query Handling

Text-only queries:

  • Embed the text query using the text encoder
  • Search the unified index
  • Return results ranked by embedding similarity

Image-only queries:

  • Embed the image query using the image encoder
  • Search the unified index
  • Return results ranked by embedding similarity

Combined text + image queries:

This is the most powerful and most complex query type.

Approaches for combining modalities:

  • Weighted average: Embed the text and image separately, then compute a weighted average of the two embeddings. The weight determines the relative importance of each modality. Simple and effective.
  • Late fusion: Retrieve results separately for the text query and the image query, then combine the result lists using reciprocal rank fusion or a learned combination model.
  • Conditional text modification: Use the text to modify the image embedding โ€” "like this image but in blue" shifts the image embedding toward the "blue" direction in the embedding space. Models like Pic2Word or SEARLE specifically support this.
  • Vision-language model: Pass the image and text to a vision-language model that generates a unified query embedding capturing both modalities. This is the most sophisticated approach and handles complex combined queries best.

Query Expansion and Refinement

Image-based query expansion:

  • When the user searches with an image, generate a text description of the image using a vision-language model
  • Use the generated text as an additional search signal
  • This helps when the image query is ambiguous (a photo of a room โ€” is the user searching for the sofa, the lamp, or the rug?)

Interactive search refinement:

  • After showing initial results, allow the user to provide feedback ("more like this," "less like this")
  • Adjust the query embedding based on feedback signals
  • This iterative refinement converges on the user's intent faster than reformulating queries

Fine-Tuning for Domain Performance

Contrastive Fine-Tuning

Fine-tune the multimodal embedding model on domain-specific image-text pairs to improve retrieval quality.

Training data:

  • Image-text pairs from the client's catalog (product images with their descriptions)
  • Minimum 10,000 pairs for noticeable improvement, 50,000+ for substantial improvement
  • Include hard negatives โ€” pairs that are similar but not matches (two different red sofas)

Fine-tuning approach:

  • Use contrastive learning (same loss as CLIP training) on the domain-specific pairs
  • Freeze the lower layers of the model and fine-tune the upper layers to preserve general knowledge while adapting to the domain
  • Monitor retrieval quality on a held-out evaluation set during training
  • Fine-tuning typically improves domain-specific retrieval by 15-30%

Evaluation

Evaluation metrics for multimodal search:

  • Recall@K (separately for text-to-image, image-to-text, and image-to-image queries)
  • MRR (mean reciprocal rank of the first relevant result)
  • NDCG@10 (quality of the top 10 results ranking)

Evaluation dataset:

  • Create 200+ query-result pairs covering all modality combinations
  • Include queries of varying difficulty
  • Have domain experts annotate relevance on a graded scale
  • Include queries where the user's intent requires understanding both modalities

Production Deployment

Serving Architecture

Embedding service:

  • Deploy the multimodal embedding model behind a low-latency API
  • Cache embeddings for frequently queried images
  • Use GPU inference for image embedding (CPU inference is too slow for real-time search)
  • Text embedding can run on CPU if latency requirements allow

Search service:

  • Vector database serving the multimodal index
  • Support for hybrid queries (vector similarity + metadata filtering)
  • Horizontal scaling for high query volumes

API design:

  • Accept text, image URL, base64-encoded image, or any combination as query input
  • Return results with relevance scores, thumbnails, and metadata
  • Support pagination and filtering parameters
  • Provide feedback endpoints for search refinement

Performance Optimization

Latency budget:

  • Image embedding: 20-50ms on GPU
  • Text embedding: 10-30ms on GPU, 30-100ms on CPU
  • Vector search: 10-50ms
  • Re-ranking (optional): 50-200ms
  • Total target: under 500ms for end-to-end search

Throughput optimization:

  • Batch concurrent queries for GPU inference
  • Cache frequent query embeddings
  • Use approximate nearest neighbor search (HNSW) for fast retrieval
  • Pre-compute embeddings for catalog items during ingestion, not at query time

Monitoring

Search quality metrics:

  • Click-through rate on search results (by query modality)
  • Search abandonment rate
  • Average position of clicked results
  • User feedback signals (if available)

System metrics:

  • Embedding latency, search latency, end-to-end latency
  • Throughput (queries per second)
  • Index size and growth rate
  • GPU utilization for embedding service

Your Next Step

Collect 100 representative queries from your client's users โ€” including text queries, image queries, and ideally some combined queries. For each query, identify the 5 most relevant items in the catalog. Embed the catalog using an off-the-shelf CLIP model (OpenCLIP ViT-L/14 is a strong default). Run the 100 queries and compute Recall@10. This baseline measurement takes 1-2 days and tells you three things: whether multimodal search is feasible for this domain, which query types work well and which need improvement, and how much room there is for improvement through fine-tuning. Present the results to the client with side-by-side examples of multimodal search results versus their current keyword search. The visual comparison is the most powerful demonstration of multimodal search value.

Search Articles

Categories

OperationsSalesDeliveryGovernance

Popular Tags

prompt engineeringai fundamentalsai toolsthe difference between AIMLagency operationsagency growthenterprise sales

Share Article

A

Agency Script Editorial

Editorial Team

The Agency Script editorial team delivers operational insights on AI delivery, certification, and governance for modern agency operators.

Related Articles

Delivery

Real-Time Stream Processing for AI Applications: The Complete Delivery Guide

When your client's AI model needs predictions in milliseconds instead of minutes, batch processing is not an option. Here is how to deliver production-grade stream processing for AI workloads.

A
Agency Script Editorial
March 21, 2026ยท14 min read
Delivery

Delivering Survival Analysis for Customer Retention: The AI Agency Playbook

A SaaS company knew their churn rate was 18 percent annually but could not predict when specific customers would leave. Survival analysis gave them a 90-day early warning system that saved $2.1 million in ARR.

A
Agency Script Editorial
March 21, 2026ยท13 min read
Delivery

Building Synthetic Data Generation Pipelines โ€” Creating Training Data When Real Data Is Scarce, Sensitive, or Biased

A healthcare AI company generated 500,000 synthetic patient records that preserved statistical patterns while eliminating privacy risk, cutting their model development timeline by 60%. Here is how to build synthetic data pipelines.

A
Agency Script Editorial
March 21, 2026ยท12 min read

Ready to certify your AI capability?

Join the professionals building governed, repeatable AI delivery systems.

Explore Certification