LLM Conversation Memory with Redis and Vector Stores in Production
Stateless LLMs forget everything between turns. Here's how to implement persistent conversation memory using Redis for short-term and vector databases for long-term memory.
LLMs are stateless — every API call starts fresh. Building a useful chatbot or assistant requires memory: remembering what the user said earlier, their preferences, and context from previous sessions.
Types of Memory
Working memory (in-context): Include recent messages in every API call. Simple, but limited by context window.
Short-term memory (Redis): Store recent conversation history in Redis. Fast, but expires.
Long-term memory (vector store): Store important facts as embeddings. Semantically searchable. Persists across sessions.
Episodic memory: Store and retrieve complete past conversations by similarity.
Pattern 1: Simple Redis Buffer
The simplest approach — store the last N messages in Redis, include them in every prompt:
import json
import redis
from anthropic import Anthropic
client = Anthropic()
redis_client = redis.Redis(host="redis", port=6379)
class RedisConversationMemory:
def __init__(self, session_id: str, max_messages: int = 20, ttl_seconds: int = 3600):
self.session_id = session_id
self.max_messages = max_messages
self.ttl = ttl_seconds
self.key = f"conversation:{session_id}"
def add_message(self, role: str, content: str):
message = json.dumps({"role": role, "content": content})
redis_client.rpush(self.key, message)
redis_client.ltrim(self.key, -self.max_messages, -1) # keep last N
redis_client.expire(self.key, self.ttl)
def get_messages(self) -> list[dict]:
raw = redis_client.lrange(self.key, 0, -1)
return [json.loads(m) for m in raw]
def clear(self):
redis_client.delete(self.key)
def chat(session_id: str, user_message: str) -> str:
memory = RedisConversationMemory(session_id)
# Get history
messages = memory.get_messages()
messages.append({"role": "user", "content": user_message})
# Call Claude with full history
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1000,
system="You are a helpful DevOps assistant.",
messages=messages
)
assistant_message = response.content[0].text
# Save both turns to memory
memory.add_message("user", user_message)
memory.add_message("assistant", assistant_message)
return assistant_messageUsage:
# Turn 1
reply = chat("user-123", "I'm having trouble with my Kubernetes pods")
# "I can help with that. What specifically is happening with your pods?"
# Turn 2
reply = chat("user-123", "They're in CrashLoopBackOff")
# "I see — CrashLoopBackOff usually means the container is starting but failing.
# Let's check the logs. What namespace are the pods in?"
# ↑ Remembers the previous context about Kubernetes podsPattern 2: Summary Memory (Compressing Long Conversations)
For long conversations, include old messages as a summary:
class SummaryConversationMemory:
def __init__(self, session_id: str, max_recent: int = 6):
self.session_id = session_id
self.max_recent = max_recent
self.summary_key = f"summary:{session_id}"
self.messages_key = f"messages:{session_id}"
def add_message(self, role: str, content: str):
message = json.dumps({"role": role, "content": content})
redis_client.rpush(self.messages_key, message)
redis_client.expire(self.messages_key, 7200)
# Summarize if too long
total = redis_client.llen(self.messages_key)
if total > self.max_recent * 2:
self._compress()
def _compress(self):
"""Summarize old messages, keep recent ones."""
all_messages = [json.loads(m) for m in redis_client.lrange(self.messages_key, 0, -1)]
old_messages = all_messages[:-self.max_recent]
recent_messages = all_messages[-self.max_recent:]
if not old_messages:
return
summary_prompt = "Summarize this conversation, capturing key facts, decisions, and context:\n\n"
for msg in old_messages:
summary_prompt += f"{msg['role'].upper()}: {msg['content']}\n\n"
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=500,
messages=[{"role": "user", "content": summary_prompt}]
)
summary = response.content[0].text
# Store summary, replace messages with recent only
redis_client.set(self.summary_key, summary, ex=7200)
redis_client.delete(self.messages_key)
for msg in recent_messages:
redis_client.rpush(self.messages_key, json.dumps(msg))
def get_context(self) -> tuple[str, list[dict]]:
summary = redis_client.get(self.summary_key)
summary_text = summary.decode() if summary else ""
messages = [json.loads(m) for m in redis_client.lrange(self.messages_key, 0, -1)]
return summary_text, messages
def chat_with_summary(session_id: str, user_message: str) -> str:
memory = SummaryConversationMemory(session_id)
summary, messages = memory.get_context()
system = "You are a helpful DevOps assistant."
if summary:
system += f"\n\nConversation history summary:\n{summary}"
messages.append({"role": "user", "content": user_message})
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1000,
system=system,
messages=messages
)
assistant_message = response.content[0].text
memory.add_message("user", user_message)
memory.add_message("assistant", assistant_message)
return assistant_messagePattern 3: Vector Store Long-Term Memory
For remembering user preferences and facts across sessions:
import chromadb
from sentence_transformers import SentenceTransformer
import time
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
chroma = chromadb.HttpClient(host="chromadb", port=8000)
memory_collection = chroma.get_or_create_collection("user_memory")
def extract_and_store_facts(session_id: str, user_id: str, conversation: list[dict]):
"""Extract important facts from conversation and store as embeddings."""
conversation_text = "\n".join([f"{m['role']}: {m['content']}" for m in conversation])
extract_prompt = f"""From this conversation, extract 3-5 important facts about the user
that would be useful to remember in future conversations.
Focus on: preferences, their environment, tech stack, ongoing projects, past issues.
Conversation:
{conversation_text}
Format as a JSON list of strings:
["fact 1", "fact 2", "fact 3"]"""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=300,
messages=[{"role": "user", "content": extract_prompt}]
)
try:
facts = json.loads(response.content[0].text)
except:
return
# Store facts as embeddings
for i, fact in enumerate(facts):
fact_id = f"{user_id}:{session_id}:{i}:{int(time.time())}"
embedding = embedding_model.encode(fact).tolist()
memory_collection.add(
ids=[fact_id],
embeddings=[embedding],
documents=[fact],
metadatas=[{"user_id": user_id, "session_id": session_id, "timestamp": int(time.time())}]
)
def retrieve_relevant_memories(user_id: str, current_message: str, top_k: int = 5) -> list[str]:
"""Retrieve relevant memories for the current conversation."""
query_embedding = embedding_model.encode(current_message).tolist()
results = memory_collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
where={"user_id": user_id},
)
if not results["documents"] or not results["documents"][0]:
return []
return results["documents"][0]
def chat_with_long_term_memory(user_id: str, session_id: str, user_message: str) -> str:
# Get short-term history
memory = RedisConversationMemory(session_id)
short_term_messages = memory.get_messages()
# Get relevant long-term memories
memories = retrieve_relevant_memories(user_id, user_message)
system = "You are a helpful DevOps assistant with memory of past conversations."
if memories:
memory_text = "\n".join(f"- {m}" for m in memories)
system += f"\n\nWhat you remember about this user:\n{memory_text}"
messages = short_term_messages + [{"role": "user", "content": user_message}]
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=1000,
system=system,
messages=messages
)
assistant_message = response.content[0].text
# Update short-term memory
memory.add_message("user", user_message)
memory.add_message("assistant", assistant_message)
# Periodically extract long-term facts (every 10 messages)
if len(messages) % 10 == 0:
extract_and_store_facts(session_id, user_id, messages[-10:])
return assistant_messageKubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-chat-service
spec:
template:
spec:
containers:
- name: api
image: llm-chat:latest
env:
- name: REDIS_URL
value: "redis://redis:6379"
- name: CHROMADB_URL
value: "http://chromadb:8000"
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: ai-secrets
key: anthropic_api_key
---
# Redis for short-term memory
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
spec:
template:
spec:
containers:
- name: redis
image: redis:7-alpine
args: ["--appendonly", "yes", "--maxmemory", "2gb", "--maxmemory-policy", "allkeys-lru"]
---
# ChromaDB for vector memory
apiVersion: apps/v1
kind: Deployment
metadata:
name: chromadb
spec:
template:
spec:
containers:
- name: chromadb
image: chromadb/chroma:latest
ports:
- containerPort: 8000
volumeMounts:
- name: chroma-data
mountPath: /chroma/chromaMemory Architecture Decision
| Scenario | Best Memory Pattern |
|---|---|
| Simple chatbot, single session | Redis buffer |
| Long conversations (> 20 turns) | Redis + summary compression |
| Multi-session user history | Vector store (ChromaDB/Qdrant) |
| Enterprise: compliance, audit trails | Relational DB (PostgreSQL) |
| Highest quality recall | All three (buffer + summary + vector) |
Start with the Redis buffer — it handles 90% of use cases. Add vector memory when users complain "you forgot what I told you last week."
Resources: ChromaDB | Sentence Transformers | Anthropic API
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
Argo Workflows vs Prefect vs Airflow — Best for ML Pipelines 2026
Choosing a workflow orchestrator for your ML pipelines? Argo Workflows, Prefect, and Apache Airflow each have distinct strengths. Here's which to pick for your use case.