Semantic Caching for LLM APIs with Redis — Cut Costs Without Cutting Quality
Exact-match caching misses most repeat LLM queries because users phrase things differently. Semantic caching with embeddings + Redis catches near-duplicate questions and can cut your LLM API bill significantly.
A standard cache keys on exact string match. For LLM APIs, that's nearly useless — "How do I reset my password?" and "I forgot my password, how do I change it?" are the same question to a human, but a different cache key to a dictionary. Semantic caching solves this by caching based on meaning, not exact text.
How It Works
User query → Embed query → Search Redis vector index for similar past queries
↓
Similarity > threshold?
↓ Yes ↓ No
Return cached response Call LLM, cache the new response
The key design decision is the similarity threshold — too loose and you return wrong answers for genuinely different questions; too strict and you barely cache anything.
Setup
pip install redis redisvl anthropic sentence-transformersYou need Redis with the RediSearch module (or Redis Stack, which includes it by default).
docker run -d --name redis-stack -p 6379:6379 redis/redis-stack:latestStep 1: Build the Semantic Cache
import hashlib
import json
import time
from redis import Redis
from redisvl.index import SearchIndex
from redisvl.schema import IndexSchema
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2") # fast, 384-dim, good enough for caching
schema = IndexSchema.from_dict({
"index": {"name": "llm_cache", "prefix": "cache"},
"fields": [
{"name": "query_text", "type": "text"},
{"name": "response", "type": "text"},
{"name": "created_at", "type": "numeric"},
{
"name": "embedding",
"type": "vector",
"attrs": {"dims": 384, "distance_metric": "cosine", "algorithm": "hnsw"}
}
]
})
redis_client = Redis(host="localhost", port=6379)
index = SearchIndex(schema, redis_client)
index.create(overwrite=False)
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.92, ttl_seconds: int = 86400):
self.threshold = similarity_threshold
self.ttl = ttl_seconds
def get(self, query: str) -> str | None:
query_embedding = embedder.encode(query).tolist()
results = index.query(
vector_field_name="embedding",
vector=query_embedding,
num_results=1,
return_fields=["response", "query_text"]
)
if not results:
return None
top = results[0]
similarity = 1 - float(top["vector_distance"]) # cosine distance to similarity
if similarity >= self.threshold:
return top["response"]
return None
def set(self, query: str, response: str):
query_embedding = embedder.encode(query).tolist()
key = hashlib.sha256(query.encode()).hexdigest()[:16]
index.load([{
"id": key,
"query_text": query,
"response": response,
"created_at": time.time(),
"embedding": query_embedding
}])
redis_client.expire(f"cache:{key}", self.ttl)Step 2: Wrap Your LLM Calls
from anthropic import Anthropic
client = Anthropic()
cache = SemanticCache(similarity_threshold=0.92, ttl_seconds=86400)
def ask_with_cache(query: str) -> dict:
cached_response = cache.get(query)
if cached_response:
return {"response": cached_response, "cache_hit": True, "cost": 0}
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": query}]
)
answer = response.content[0].text
cache.set(query, answer)
return {
"response": answer,
"cache_hit": False,
"tokens": response.usage.input_tokens + response.usage.output_tokens
}Picking the Right Threshold
This is the part teams get wrong. Too aggressive a threshold causes wrong answers; too conservative defeats the purpose.
# Test your threshold against known query pairs before deploying
test_pairs = [
("How do I reset my password?", "I forgot my password, how do I change it?", True), # should match
("How do I reset my password?", "How do I delete my account?", False), # should NOT match
("What's the pricing for the pro plan?", "How much does pro cost?", True), # should match
]
for q1, q2, should_match in test_pairs:
e1, e2 = embedder.encode(q1), embedder.encode(q2)
import numpy as np
similarity = np.dot(e1, e2) / (np.linalg.norm(e1) * np.linalg.norm(e2))
print(f"{q1[:30]}... vs {q2[:30]}...: {similarity:.3f} (expected match: {should_match})")Run this against 30-50 real query pairs from your actual traffic before picking a production threshold. 0.90-0.93 is a reasonable starting range for support/FAQ-style queries, but it's domain dependent — for queries where a small wording difference changes the answer meaningfully (legal, medical, financial), push the threshold higher or skip semantic caching entirely for that use case.
What NOT to Cache
def is_cacheable(query: str) -> bool:
# Never semantically cache queries with user-specific or time-sensitive context
non_cacheable_patterns = [
"my account", "my order", "today", "right now",
"current", "this week", "my balance"
]
return not any(p in query.lower() for p in non_cacheable_patterns)Personalized or time-sensitive queries will match semantically similar past queries but return stale or wrong information. Filter these out before even checking the cache.
Measuring the Impact
class CacheMetrics:
def __init__(self):
self.hits = 0
self.misses = 0
self.tokens_saved = 0
def record(self, result: dict, estimated_tokens_if_uncached: int = 500):
if result["cache_hit"]:
self.hits += 1
self.tokens_saved += estimated_tokens_if_uncached
else:
self.misses += 1
def report(self):
total = self.hits + self.misses
hit_rate = self.hits / total if total else 0
print(f"Cache hit rate: {hit_rate:.1%}")
print(f"Estimated tokens saved: {self.tokens_saved:,}")For high-traffic, repetitive query patterns — support bots, internal documentation Q&A, FAQ assistants — semantic caching commonly achieves 30-50% hit rates, which translates directly to a 30-50% reduction in LLM API spend for that traffic.
Route requests across providers efficiently too: LiteLLM Gateway Multi-Provider Routing
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI-Powered Incident Report Generator with Claude API (2026)
Writing postmortems takes 2-3 hours. Here's how to build an AI tool that generates a structured incident report from Slack logs, metrics screenshots, and alert data in minutes.
Build an AI Kubernetes Runbook Generator with LLMs (2026)
Manual runbooks go stale. Build a system that watches your Kubernetes cluster, detects incidents, and generates step-by-step runbooks automatically using LLMs. Full implementation with Python, kubectl, and Ollama.
Build an AI Terraform Cost Estimator Using Claude (2026)
Before you run terraform apply, wouldn't you want to know how much it'll cost? Build an AI cost estimator that reads your Terraform plan output and gives you a detailed cost breakdown using Claude as the reasoning engine.