🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Semantic Caching for LLM APIs with Redis — Cut Costs Without Cutting Quality

Exact-match caching misses most repeat LLM queries because users phrase things differently. Semantic caching with embeddings + Redis catches near-duplicate questions and can cut your LLM API bill significantly.

DevOpsBoysJun 15, 20264 min read
Share:Tweet

A standard cache keys on exact string match. For LLM APIs, that's nearly useless — "How do I reset my password?" and "I forgot my password, how do I change it?" are the same question to a human, but a different cache key to a dictionary. Semantic caching solves this by caching based on meaning, not exact text.

How It Works

User query → Embed query → Search Redis vector index for similar past queries
                                    ↓
                    Similarity > threshold? 
                        ↓ Yes                    ↓ No
                Return cached response      Call LLM, cache the new response

The key design decision is the similarity threshold — too loose and you return wrong answers for genuinely different questions; too strict and you barely cache anything.

Setup

bash
pip install redis redisvl anthropic sentence-transformers

You need Redis with the RediSearch module (or Redis Stack, which includes it by default).

bash
docker run -d --name redis-stack -p 6379:6379 redis/redis-stack:latest

Step 1: Build the Semantic Cache

python
import hashlib
import json
import time
from redis import Redis
from redisvl.index import SearchIndex
from redisvl.schema import IndexSchema
from sentence_transformers import SentenceTransformer
 
embedder = SentenceTransformer("all-MiniLM-L6-v2")  # fast, 384-dim, good enough for caching
 
schema = IndexSchema.from_dict({
    "index": {"name": "llm_cache", "prefix": "cache"},
    "fields": [
        {"name": "query_text", "type": "text"},
        {"name": "response", "type": "text"},
        {"name": "created_at", "type": "numeric"},
        {
            "name": "embedding",
            "type": "vector",
            "attrs": {"dims": 384, "distance_metric": "cosine", "algorithm": "hnsw"}
        }
    ]
})
 
redis_client = Redis(host="localhost", port=6379)
index = SearchIndex(schema, redis_client)
index.create(overwrite=False)
 
class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.92, ttl_seconds: int = 86400):
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds
    
    def get(self, query: str) -> str | None:
        query_embedding = embedder.encode(query).tolist()
        
        results = index.query(
            vector_field_name="embedding",
            vector=query_embedding,
            num_results=1,
            return_fields=["response", "query_text"]
        )
        
        if not results:
            return None
        
        top = results[0]
        similarity = 1 - float(top["vector_distance"])  # cosine distance to similarity
        
        if similarity >= self.threshold:
            return top["response"]
        
        return None
    
    def set(self, query: str, response: str):
        query_embedding = embedder.encode(query).tolist()
        key = hashlib.sha256(query.encode()).hexdigest()[:16]
        
        index.load([{
            "id": key,
            "query_text": query,
            "response": response,
            "created_at": time.time(),
            "embedding": query_embedding
        }])
        redis_client.expire(f"cache:{key}", self.ttl)

Step 2: Wrap Your LLM Calls

python
from anthropic import Anthropic
 
client = Anthropic()
cache = SemanticCache(similarity_threshold=0.92, ttl_seconds=86400)
 
def ask_with_cache(query: str) -> dict:
    cached_response = cache.get(query)
    if cached_response:
        return {"response": cached_response, "cache_hit": True, "cost": 0}
    
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    )
    
    answer = response.content[0].text
    cache.set(query, answer)
    
    return {
        "response": answer,
        "cache_hit": False,
        "tokens": response.usage.input_tokens + response.usage.output_tokens
    }

Picking the Right Threshold

This is the part teams get wrong. Too aggressive a threshold causes wrong answers; too conservative defeats the purpose.

python
# Test your threshold against known query pairs before deploying
test_pairs = [
    ("How do I reset my password?", "I forgot my password, how do I change it?", True),   # should match
    ("How do I reset my password?", "How do I delete my account?", False),                  # should NOT match
    ("What's the pricing for the pro plan?", "How much does pro cost?", True),               # should match
]
 
for q1, q2, should_match in test_pairs:
    e1, e2 = embedder.encode(q1), embedder.encode(q2)
    import numpy as np
    similarity = np.dot(e1, e2) / (np.linalg.norm(e1) * np.linalg.norm(e2))
    print(f"{q1[:30]}... vs {q2[:30]}...: {similarity:.3f} (expected match: {should_match})")

Run this against 30-50 real query pairs from your actual traffic before picking a production threshold. 0.90-0.93 is a reasonable starting range for support/FAQ-style queries, but it's domain dependent — for queries where a small wording difference changes the answer meaningfully (legal, medical, financial), push the threshold higher or skip semantic caching entirely for that use case.

What NOT to Cache

python
def is_cacheable(query: str) -> bool:
    # Never semantically cache queries with user-specific or time-sensitive context
    non_cacheable_patterns = [
        "my account", "my order", "today", "right now", 
        "current", "this week", "my balance"
    ]
    return not any(p in query.lower() for p in non_cacheable_patterns)

Personalized or time-sensitive queries will match semantically similar past queries but return stale or wrong information. Filter these out before even checking the cache.

Measuring the Impact

python
class CacheMetrics:
    def __init__(self):
        self.hits = 0
        self.misses = 0
        self.tokens_saved = 0
    
    def record(self, result: dict, estimated_tokens_if_uncached: int = 500):
        if result["cache_hit"]:
            self.hits += 1
            self.tokens_saved += estimated_tokens_if_uncached
        else:
            self.misses += 1
    
    def report(self):
        total = self.hits + self.misses
        hit_rate = self.hits / total if total else 0
        print(f"Cache hit rate: {hit_rate:.1%}")
        print(f"Estimated tokens saved: {self.tokens_saved:,}")

For high-traffic, repetitive query patterns — support bots, internal documentation Q&A, FAQ assistants — semantic caching commonly achieves 30-50% hit rates, which translates directly to a 30-50% reduction in LLM API spend for that traffic.

Route requests across providers efficiently too: LiteLLM Gateway Multi-Provider Routing

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments