Production LLM Observability — Traces, Costs, Latency with OpenTelemetry
Running LLMs in production without observability is flying blind. Here's how to instrument your LLM calls with OpenTelemetry to track traces, costs, latency, and quality metrics.
You've deployed an LLM feature. Now you need to answer: Which model is slowest? Which prompts are most expensive? What's the error rate? Where are the latency spikes?
Without observability, you're guessing.
What to Measure for LLMs
Standard web metrics aren't enough. LLM apps need additional dimensions:
Standard:
├── Latency (time to first token, total response time)
├── Error rate (API failures, timeouts)
└── Throughput (requests/second)
LLM-specific:
├── Token counts (input, output, cached)
├── Cost per request ($)
├── Model used (claude-sonnet, gpt-4o, llama-3)
├── Prompt version (to A/B test prompts)
├── Cache hit rate (prompt caching effectiveness)
└── Quality scores (if you have human feedback)
OpenTelemetry for LLMs
The OpenTelemetry GenAI semantic conventions define standard attributes for LLM traces:
# Standard OTel GenAI attributes
gen_ai.system = "anthropic" # Provider
gen_ai.request.model = "claude-sonnet-4-6"
gen_ai.request.max_tokens = 1000
gen_ai.response.finish_reason = "end_turn"
gen_ai.usage.input_tokens = 245
gen_ai.usage.output_tokens = 189
gen_ai.usage.input_tokens_cached = 200 # Prompt cache hitsSetup: Full Observability Stack
pip install opentelemetry-sdk opentelemetry-api \
opentelemetry-exporter-otlp-proto-grpc \
anthropic prometheus-clientCore Instrumentation
# llm_tracer.py
import time
import os
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
import anthropic
# Setup tracing
trace_provider = TracerProvider()
trace_provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer("llm-service")
# Setup metrics
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://otel-collector:4317"),
export_interval_millis=10000
)
meter_provider = MeterProvider(metric_readers=[metric_reader])
meter = meter_provider.get_meter("llm-service")
# Define metrics
llm_request_duration = meter.create_histogram(
"gen_ai.client.operation.duration",
unit="s",
description="Duration of LLM API calls"
)
llm_input_tokens = meter.create_counter(
"gen_ai.client.token.usage",
unit="{token}",
description="Number of tokens used"
)
llm_cost_counter = meter.create_counter(
"gen_ai.client.cost",
unit="$",
description="Estimated cost in USD"
)
# Token costs (update as pricing changes)
TOKEN_COSTS = {
"claude-sonnet-4-6": {
"input": 3.0 / 1_000_000,
"output": 15.0 / 1_000_000,
"cache_read": 0.30 / 1_000_000,
},
"claude-haiku-4-5-20251001": {
"input": 0.25 / 1_000_000,
"output": 1.25 / 1_000_000,
"cache_read": 0.03 / 1_000_000,
},
}
client = anthropic.Anthropic()
def instrumented_llm_call(
messages: list,
model: str = "claude-sonnet-4-6",
system: str = "",
max_tokens: int = 1024,
feature: str = "unknown",
prompt_version: str = "v1"
) -> str:
"""Make an LLM call with full observability."""
with tracer.start_as_current_span(f"gen_ai.{model}") as span:
# Set span attributes (OTel GenAI conventions)
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.request.max_tokens", max_tokens)
span.set_attribute("feature", feature)
span.set_attribute("prompt_version", prompt_version)
start_time = time.time()
error = None
try:
response = client.messages.create(
model=model,
max_tokens=max_tokens,
system=system,
messages=messages
)
duration = time.time() - start_time
# Token usage
input_tokens = response.usage.input_tokens
output_tokens = response.usage.output_tokens
cache_read = getattr(response.usage, "cache_read_input_tokens", 0)
# Calculate cost
costs = TOKEN_COSTS.get(model, TOKEN_COSTS["claude-sonnet-4-6"])
cost = (
input_tokens * costs["input"] +
output_tokens * costs["output"] +
cache_read * costs["cache_read"]
)
# Set span attributes
span.set_attribute("gen_ai.response.finish_reason", response.stop_reason or "")
span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
span.set_attribute("gen_ai.usage.input_tokens_cached", cache_read)
span.set_attribute("gen_ai.cost_usd", cost)
# Record metrics
attrs = {
"gen_ai.system": "anthropic",
"gen_ai.request.model": model,
"feature": feature,
}
llm_request_duration.record(duration, attrs)
llm_input_tokens.add(input_tokens, {**attrs, "gen_ai.token.type": "input"})
llm_input_tokens.add(output_tokens, {**attrs, "gen_ai.token.type": "output"})
llm_cost_counter.add(cost, attrs)
return response.content[0].text
except Exception as e:
error = e
span.set_attribute("error", True)
span.set_attribute("error.message", str(e))
raiseGrafana Dashboard Queries
With metrics flowing to Prometheus:
# Average latency by model
histogram_quantile(0.99,
rate(gen_ai_client_operation_duration_bucket[5m])
) by (gen_ai_request_model)
# Cost per hour by feature
sum(rate(gen_ai_client_cost_total[1h])) by (feature) * 3600
# Cache hit rate
sum(rate(gen_ai_client_token_usage_total{gen_ai_token_type="cached"}[5m]))
/
sum(rate(gen_ai_client_token_usage_total{gen_ai_token_type="input"}[5m]))
# Error rate
rate(gen_ai_client_errors_total[5m]) / rate(gen_ai_client_requests_total[5m])Alerts
# alertmanager rules
groups:
- name: llm_alerts
rules:
- alert: LLMHighLatency
expr: histogram_quantile(0.95, rate(gen_ai_client_operation_duration_bucket[5m])) > 10
annotations:
summary: "LLM P95 latency > 10s"
- alert: LLMHighCost
expr: increase(gen_ai_client_cost_total[1h]) > 50
annotations:
summary: "LLM spending > $50/hour"
- alert: LLMHighErrorRate
expr: rate(gen_ai_client_errors_total[5m]) / rate(gen_ai_client_requests_total[5m]) > 0.05
annotations:
summary: "LLM error rate > 5%"Add Langfuse for Quality Tracking
For LLM-specific quality metrics (not just infra):
from langfuse import Langfuse
langfuse = Langfuse(
public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
)
# Create trace
trace = langfuse.trace(name="code-review", user_id=user_id)
generation = trace.generation(
name="review-generation",
model=model,
input=prompt,
output=response_text,
usage={"input": input_tokens, "output": output_tokens},
metadata={"prompt_version": "v2", "feature": "code-review"}
)
# Add human feedback when available
generation.score(name="quality", value=4, comment="Good explanation")Langfuse gives you: prompt performance over time, A/B test results, human feedback scores, and cost per trace.
Instrumented LLM calls give you answers to the questions that matter: which prompts cost too much, which models are slow, where errors come from. Build this from day one — retrofitting observability is always harder.
For managed observability infrastructure, Grafana Cloud has a generous free tier for metrics, logs, and traces.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Why Agentic AI Will Kill the Traditional On-Call Rotation by 2028
60% of enterprises now use AIOps self-healing. 83% of alerts auto-resolve without humans. The era of 2 AM PagerDuty wake-ups is ending. Here's what replaces it.
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
AI-Powered Log Analysis Is Replacing Manual Debugging in DevOps (2026)
How LLMs and AI are transforming log analysis, anomaly detection, and root cause analysis — and the tools DevOps engineers should know about in 2026.