A conversational AI agency in New York was operating a customer support system for an e-commerce client that processed 180,000 queries per day through GPT-4. Monthly inference costs were $54,000, and p95 latency was 4.2 seconds โ both exceeding the client's targets. The agency analyzed the query patterns and discovered that 34% of queries were semantically identical to previous queries (customers asking the same questions with slight wording variations), 22% of queries could be answered from cached responses to similar questions, and 18% of queries triggered the same downstream API calls repeatedly. By implementing a three-layer caching strategy โ exact match caching for identical queries, semantic caching for similar queries, and result caching for repeated API operations โ the agency reduced monthly costs to $21,000 (61% reduction), cut p95 latency to 1.1 seconds (74% reduction), and maintained answer quality within 2% of the uncached system. The caching infrastructure cost $2,800 per month, delivering a net savings of $30,200 monthly.
Caching for AI inference is the practice of storing and reusing the results of expensive AI computations โ model predictions, embeddings, API responses, and intermediate results โ to reduce latency, cost, and compute load. For AI agencies, caching is one of the highest-leverage optimizations because it addresses the two biggest operational pain points simultaneously: cost and speed. But caching AI systems is fundamentally different from caching traditional web applications because AI inputs are often high-dimensional, similarity matters more than exact matching, and cache invalidation is driven by model updates rather than data updates.
Why AI Systems Need Different Caching
The AI Caching Challenge
Traditional web caching stores exact URL-to-response mappings. A request for /products/123 always returns the same product page, and the cache key is the URL. AI caching is harder because:
Inputs are high-dimensional: An LLM prompt, an image for classification, or a feature vector for prediction contains far more information than a URL. Two nearly identical prompts may produce the same response, but their cache keys are different strings.
Similarity matters more than identity: "What is your return policy?" and "How do I return an item?" should map to the same cached response, but they are different strings. Traditional exact-match caching misses these opportunities.
Outputs may be non-deterministic: LLMs with temperature greater than zero produce different outputs for identical inputs. Caching must account for acceptable response variation.
Freshness is model-dependent: A cached response becomes stale not when the underlying data changes (as in web caching) but when the model is updated, the prompt is changed, or the knowledge base is modified.
The Business Case for AI Caching
Cost reduction: LLM API calls cost $0.01-0.10 per query. At 100,000 queries per day, a 40% cache hit rate saves $120,000-1,200,000 annually.
Latency reduction: An LLM call takes 1-10 seconds. A cache lookup takes 1-50 milliseconds. Cached responses are 100-1,000x faster.
Throughput increase: Caching reduces the load on GPU inference servers, allowing the same infrastructure to handle more unique queries.
Consistency: Cached responses are deterministic โ the same question always gets the same answer. This is desirable for many enterprise applications where response consistency matters.
Caching Layer Architecture
Layer 1 โ Exact Match Cache
The simplest and most effective cache layer. Store the exact input-output mapping and return the cached output when the exact same input is seen again.
Implementation:
- Compute a hash (SHA-256) of the complete input (prompt, parameters, model version)
- Use the hash as the cache key
- Store the output alongside the key in a fast key-value store (Redis, Memcached, DynamoDB)
- Set a TTL (time-to-live) based on the expected staleness rate
When exact match caching works well:
- Automated systems that generate the same queries repeatedly (monitoring dashboards, reporting systems, scheduled analyses)
- User queries with limited variation (FAQ-style questions, form-based inputs)
- Embedding computations (the same text always produces the same embedding)
- Classification of recurring inputs (same product descriptions, same customer profiles)
Expected hit rate: 10-40% for customer-facing applications, 40-80% for automated systems with repetitive queries.
Cache key design:
The cache key must include everything that affects the output:
- The complete input text or data
- The model version or identifier
- The model parameters (temperature, max tokens, system prompt)
- Any context or retrieval results that influence the output
If any of these change, the cache key must change. A common bug is caching LLM responses without including the system prompt in the cache key โ changing the system prompt then returns stale responses from the previous prompt.
Layer 2 โ Semantic Cache
Semantic caching returns cached responses for inputs that are semantically similar to previously seen inputs, even if the exact wording differs.
Implementation:
- When a query arrives, embed it using a text embedding model
- Search the cache for embeddings within a similarity threshold of the query embedding
- If a match is found, return the cached response
- If no match is found, compute the fresh response and store it in the cache with its embedding
Semantic cache components:
- Embedding model: A fast, lightweight embedding model (all-MiniLM-L6-v2 or similar). The embedding computation should be much faster than the full inference โ otherwise the caching overhead negates the benefit.
- Vector index: A small vector index (HNSW in Redis, Qdrant, or an in-memory index) for fast similarity search against cached embeddings.
- Similarity threshold: The minimum cosine similarity required to consider a cached response as a match. Typical values: 0.92-0.97. Higher thresholds are more conservative (fewer false matches) but lower hit rates.
Threshold calibration:
The similarity threshold determines the tradeoff between hit rate and response quality:
- Too low (below 0.90): High hit rate but returns incorrect cached responses for queries that are superficially similar but semantically different
- Too high (above 0.97): Very few cache hits because even slight wording differences exceed the threshold
- Calibration method: Collect 500 query pairs with human similarity judgments. Plot cache hit rate and response appropriateness against threshold values. Choose the threshold that maximizes hit rate while maintaining response appropriateness above 95%.
Expected hit rate: 15-35% additional hits on top of exact match caching, for a combined hit rate of 30-60%.
Layer 3 โ Computation Cache
Cache intermediate computation results that are reused across multiple queries.
Embedding cache: Cache the embedding vectors for text inputs. If the same text appears in multiple queries (as part of a prompt, as a retrieved document, or as a classification input), reuse the cached embedding rather than recomputing it.
Retrieval cache: In RAG systems, cache the retrieval results for similar queries. If two queries retrieve the same documents, the retrieval step can be cached.
Feature cache: For ML prediction pipelines, cache computed features. If the same entity (customer, product, transaction) appears in multiple prediction requests within a short window, reuse the cached features.
API response cache: Cache the responses from external API calls (weather data, stock prices, customer records) that are used as inputs to AI models. These often change slowly (hourly or daily) and can be cached with appropriate TTLs.
Layer 4 โ Response Composition Cache
For systems that compose responses from multiple components, cache the components individually and compose cached components into new responses.
Example โ RAG system:
- Cache the retrieval results for common query patterns
- Cache the generated answers for common retrieval result sets
- When a new query retrieves the same documents as a previous query, skip directly to the cached answer
Example โ Multi-step agent:
- Cache the results of individual agent steps (tool calls, API responses, intermediate reasoning)
- When a new query triggers the same sequence of steps as a previous query, reuse cached step results
Cache Infrastructure
Storage Options
Redis: The default choice for AI caching. In-memory storage provides sub-millisecond read latency. Supports key-value storage for exact match caching and can be paired with RediSearch for vector similarity search. Handles the volume and latency requirements of most AI applications.
Memcached: Simple, fast, distributed cache. Good for exact match caching when you do not need vector search. Less feature-rich than Redis but slightly faster for pure key-value operations.
DynamoDB or similar managed databases: For caches that need persistence, durability, and automatic scaling. Slightly higher latency than Redis (single-digit milliseconds) but zero operational overhead.
Local in-memory cache: For single-instance deployments, a process-local cache (Python dictionary, LRU cache, or TTLCache) provides the fastest possible access. Limited by the instance's memory and not shared across instances.
Tiered caching:
Use multiple storage tiers for optimal cost-performance balance:
- Local in-memory cache (fastest, smallest โ cache the most frequent items)
- Redis (fast, medium size โ cache the broader working set)
- Persistent storage (slower, largest โ cache historical results for reuse after Redis eviction)
Cache Sizing
Memory estimation:
- Exact match cache: Average response size x number of unique queries x desired retention period
- Semantic cache: (Embedding dimension x 4 bytes + average response size) x number of unique queries
- Feature cache: Feature vector size x number of entities x update frequency
Example calculation for an LLM application:
- 100,000 unique queries per day
- Average response: 500 tokens = approximately 2KB
- 7-day retention: 100,000 x 7 x 2KB = 1.4GB
- Semantic embeddings: 100,000 x 7 x (384 x 4 bytes) = 1.1GB
- Total cache size: approximately 2.5GB โ fits comfortably in a small Redis instance
Cache Eviction Policies
LRU (Least Recently Used): Evict the least recently accessed items when the cache is full. The default choice for most AI caching scenarios because it naturally retains frequently accessed items.
TTL (Time-To-Live): Automatically expire items after a fixed duration. Essential for AI caching to ensure responses do not become stale when models or knowledge bases are updated.
Frequency-based: Evict the least frequently accessed items. Better than LRU when there is a mix of one-time queries and recurring queries โ LRU can evict a frequently accessed item if it has not been accessed recently.
Recommended approach: Combine LRU eviction with TTL expiration. Set TTL based on the model update frequency:
- Models updated daily: TTL = 24 hours
- Models updated weekly: TTL = 7 days
- Models updated monthly: TTL = 30 days
- Embedding models (rarely updated): TTL = 90 days
Cache Invalidation
Model-Driven Invalidation
When the AI model is updated, some or all cached responses become stale.
Invalidation strategies:
- Full invalidation: Clear the entire cache when the model is updated. Simple but wasteful โ many cached responses may still be valid with the new model.
- Versioned caching: Include the model version in the cache key. When the model is updated, all cache keys include the new version, automatically avoiding stale responses. Old-version entries expire naturally through TTL.
- Selective invalidation: After a model update, identify which types of queries are likely to produce different responses with the new model. Invalidate only those cache entries. This requires understanding what changed in the model update.
Recommended approach: Versioned caching is the safest and simplest. Include the model version (or a hash of the model configuration including the system prompt) in every cache key.
Knowledge Base-Driven Invalidation
For RAG systems, the cache must be invalidated when the knowledge base changes.
Document-level invalidation:
- When a document is added, updated, or deleted from the knowledge base, invalidate cache entries that referenced that document
- Maintain a mapping from document IDs to cache keys
- When a document changes, look up and invalidate all cache keys that used that document
Time-based invalidation:
- For knowledge bases that change frequently, set short TTLs on cached RAG responses
- For knowledge bases that change rarely, longer TTLs are acceptable
Proactive Cache Warming
Pre-populate the cache with responses to anticipated queries before they arrive.
Cache warming strategies:
- Analyze historical query logs to identify the most common queries
- Generate and cache responses for the top 1,000-5,000 queries during off-peak hours
- Update the warm cache after model updates to ensure fresh responses are ready
Cache warming benefits:
- Eliminates cold-start latency after model updates or cache flushes
- Ensures the highest-traffic queries are always served from cache
- Smooths out traffic spikes by pre-computing responses for predictable query patterns
Monitoring and Quality Assurance
Cache Performance Metrics
Hit rate: The percentage of requests served from cache. Track separately for each cache layer (exact match, semantic, computation).
Miss rate: The percentage of requests that require fresh computation. This is 1 minus the hit rate.
Latency by cache status: Compare latency for cache hits vs. cache misses. Cache hits should be 100-1,000x faster.
Cache size and growth rate: Monitor memory usage to ensure the cache does not exceed infrastructure capacity.
Eviction rate: The rate at which items are evicted from the cache. High eviction rates indicate the cache is too small or the TTL is too short.
Response Quality Monitoring
Caching introduces a quality risk: cached responses may not be as appropriate as freshly computed responses.
Quality monitoring for semantic caching:
- Sample 1-2% of semantically cached responses for human review
- Compare the cached response to what a fresh computation would have produced
- Track the "cache appropriateness rate" โ the percentage of cached responses that are appropriate for the query
- If the appropriateness rate drops below 95%, tighten the similarity threshold
A/B testing cache quality:
- Route 5% of traffic to an uncached path (always compute fresh)
- Compare user satisfaction metrics (click-through, follow-up queries, explicit feedback) between cached and uncached responses
- The cached path should perform within 2-3% of the uncached path on quality metrics
Cost Tracking
Savings calculation:
- Track the number of cache hits per day
- Multiply by the per-query cost of the AI inference that was avoided
- Subtract the cost of cache infrastructure (Redis instance, embedding computation for semantic caching)
- Report net savings monthly
Cost per query by path:
- Cache hit: Infrastructure cost / total hits (typically $0.0001-0.001 per query)
- Cache miss: Full inference cost + cache storage cost (typically $0.01-0.10 per query)
- Blended cost: Weighted average based on hit rate
Advanced Caching Patterns
Streaming Cache
For LLM applications that stream responses token by token, caching requires special handling.
Streaming cache implementation:
- Store the complete response in the cache after streaming completes
- When a cache hit occurs, stream the cached response to the client at a natural pace (not all at once) to maintain the user experience
- Include a "streaming signature" that marks the response as cached so the UI can handle it appropriately
Multi-Model Cache Sharing
For agencies running multiple models or model versions, some cache entries may be shareable.
Shared embedding cache:
- If multiple models use the same embedding model, they can share the embedding cache
- Tag cached embeddings with the embedding model version to ensure compatibility
Cross-model response cache:
- If two model versions produce similar outputs for the same inputs, responses cached from one model can potentially serve requests for the other
- This requires validation that the cross-model responses are sufficiently similar
Hierarchical Caching for Multi-Step Pipelines
For multi-step AI pipelines (RAG, agent systems, multi-model pipelines), cache at each stage independently.
Stage-level caching:
- Cache retrieval results separately from generation results
- Cache embedding computations separately from similarity search results
- Cache tool call results separately from agent reasoning
Benefits:
- Each stage can have its own TTL and invalidation strategy
- A change in one stage (e.g., new generation model) only invalidates that stage's cache, not the entire pipeline
- Intermediate caches provide value even when the end-to-end cache does not hit
Your Next Step
Analyze one week of your production AI system's query logs. Compute the exact duplicate rate โ what percentage of queries are character-for-character identical to a previous query? Then compute the near-duplicate rate โ what percentage of queries would match a previous query if you normalized whitespace, capitalization, and punctuation? The exact duplicate rate tells you the minimum savings from exact match caching. The near-duplicate rate tells you the additional savings from simple normalization-based caching. If these two rates together exceed 20%, implementing a basic two-layer cache (exact match plus normalized match) will produce measurable cost and latency improvements within a week. Build that first, measure the impact, and then evaluate whether semantic caching is worth the additional complexity. Start with the simplest cache that delivers measurable value, then layer on sophistication as the data justifies it.