Reducing LLM Bills: Architecting High-Performance Semantic Caching
The Inefficiency of Exact-Match Caching
Traditional caching strategies (like key-value stores matching exact strings) work poorly for AI applications. In natural language, two prompts can have the exact same meaning while using slightly different wording (e.g., "What is the capital of France?" vs. "Tell me France's capital city").
To cache these requests effectively, we must implement Semantic Caching.
How Semantic Caching Works
Instead of matching strings, semantic caching works by converting prompts into vector embeddings. When a request comes in, the gateway performs a vector similarity search (usually Cosine Similarity or L2 distance) against previously cached prompts.
- Embedding Generation: The prompt is converted into a vector (e.g., using a fast local embedding model or OpenAI's
text-embedding-3-small). - Similarity Search: The gateway queries a vector database or Redis Stack for cached vectors within a threshold (e.g., cosine similarity > 0.95).
- Cache Hit: If a close match is found, the cached response is returned instantly, bypassing the costly LLM generation.
- Cache Miss: If no match is found, the request is sent to the LLM, and the response is saved in the vector index for future matches.
// Pseudocode for Semantic Cache Match
async function getSemanticCache(prompt) {
const promptVector = await generateEmbedding(prompt);
const match = await vectorDb.querySimilarity(promptVector, { threshold: 0.96 });
if (match) {
return match.response; // Cache Hit!
}
return null; // Cache Miss
}
Balancing Accuracy and Cost
Semantic caching is a trade-off. Setting the similarity threshold too high leads to cache misses, while setting it too low can result in returning inaccurate responses. Platform teams should fine-tune thresholds based on the specificity of the tasks—using high thresholds (0.97+) for code generation or factual data, and lower thresholds (0.92+) for creative writing or conversational chatbots.
See It in Action
Selixes implements everything described in this article — circuit breaking, session budgets, local edge fallback, and private VPC deployment.