Cost ControlCachingSemantic CacheRedis

Reducing LLM Bills: Architecting High-Performance Semantic Caching

June 15, 2026·8 min read·Selixes Engineering

The Inefficiency of Exact-Match Caching

Traditional caching strategies (like key-value stores matching exact strings) work poorly for AI applications. In natural language, two prompts can have the exact same meaning while using slightly different wording (e.g., "What is the capital of France?" vs. "Tell me France's capital city").

To cache these requests effectively, we must implement Semantic Caching.

How Semantic Caching Works

Instead of matching strings, semantic caching works by converting prompts into vector embeddings. When a request comes in, the gateway performs a vector similarity search (usually Cosine Similarity or L2 distance) against previously cached prompts.

Embedding Generation: The prompt is converted into a vector (e.g., using a fast local embedding model or OpenAI's text-embedding-3-small).
Similarity Search: The gateway queries a vector database or Redis Stack for cached vectors within a threshold (e.g., cosine similarity > 0.95).
Cache Hit: If a close match is found, the cached response is returned instantly, bypassing the costly LLM generation.
Cache Miss: If no match is found, the request is sent to the LLM, and the response is saved in the vector index for future matches.

// Pseudocode for Semantic Cache Match
async function getSemanticCache(prompt) {
  const promptVector = await generateEmbedding(prompt);
  const match = await vectorDb.querySimilarity(promptVector, { threshold: 0.96 });
  if (match) {
    return match.response; // Cache Hit!
  }
  return null; // Cache Miss
}

Balancing Accuracy and Cost

Semantic caching is a trade-off. Setting the similarity threshold too high leads to cache misses, while setting it too low can result in returning inaccurate responses. Platform teams should fine-tune thresholds based on the specificity of the tasks—using high thresholds (0.97+) for code generation or factual data, and lower thresholds (0.92+) for creative writing or conversational chatbots.

See It in Action

Selixes implements everything described in this article — circuit breaking, session budgets, local edge fallback, and private VPC deployment.

Read the Docs ->Book a Demo

AI Gateway

Best Open-Source AI Gateway for Enterprise VPC Deployments

8 min read

Failover

How to Implement Zero-Downtime LLM Failover for OpenAI and Anthropic

9 min read

Reducing LLM Bills: Architecting High-Performance Semantic Caching

The Inefficiency of Exact-Match Caching

How Semantic Caching Works

Balancing Accuracy and Cost

See It in Action

More Articles