Production LLM Observability — Traces, Costs, Latency with OpenTelemetry

Running LLMs in production without observability is flying blind. Here's how to instrument your LLM calls with OpenTelemetry to track traces, costs, latency, and quality metrics.

You've deployed an LLM feature. Now you need to answer: Which model is slowest? Which prompts are most expensive? What's the error rate? Where are the latency spikes?

Without observability, you're guessing.

What to Measure for LLMs

Standard web metrics aren't enough. LLM apps need additional dimensions:

Standard:
├── Latency (time to first token, total response time)
├── Error rate (API failures, timeouts)
└── Throughput (requests/second)

LLM-specific:
├── Token counts (input, output, cached)
├── Cost per request ($)
├── Model used (claude-sonnet, gpt-4o, llama-3)
├── Prompt version (to A/B test prompts)
├── Cache hit rate (prompt caching effectiveness)
└── Quality scores (if you have human feedback)

OpenTelemetry for LLMs

The OpenTelemetry GenAI semantic conventions define standard attributes for LLM traces:

python

# Standard OTel GenAI attributes
gen_ai.system = "anthropic"          # Provider
gen_ai.request.model = "claude-sonnet-4-6"
gen_ai.request.max_tokens = 1000
gen_ai.response.finish_reason = "end_turn"
gen_ai.usage.input_tokens = 245
gen_ai.usage.output_tokens = 189
gen_ai.usage.input_tokens_cached = 200  # Prompt cache hits

Setup: Full Observability Stack

bash

pip install opentelemetry-sdk opentelemetry-api \
  opentelemetry-exporter-otlp-proto-grpc \
  anthropic prometheus-client

Core Instrumentation

python

# llm_tracer.py
import time
import os
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
import anthropic
 
# Setup tracing
trace_provider = TracerProvider()
trace_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)
trace.set_tracer_provider(trace_provider)
tracer = trace.get_tracer("llm-service")
 
# Setup metrics
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://otel-collector:4317"),
    export_interval_millis=10000
)
meter_provider = MeterProvider(metric_readers=[metric_reader])
meter = meter_provider.get_meter("llm-service")
 
# Define metrics
llm_request_duration = meter.create_histogram(
    "gen_ai.client.operation.duration",
    unit="s",
    description="Duration of LLM API calls"
)
 
llm_input_tokens = meter.create_counter(
    "gen_ai.client.token.usage",
    unit="{token}",
    description="Number of tokens used"
)
 
llm_cost_counter = meter.create_counter(
    "gen_ai.client.cost",
    unit="$",
    description="Estimated cost in USD"
)
 
# Token costs (update as pricing changes)
TOKEN_COSTS = {
    "claude-sonnet-4-6": {
        "input": 3.0 / 1_000_000,
        "output": 15.0 / 1_000_000,
        "cache_read": 0.30 / 1_000_000,
    },
    "claude-haiku-4-5-20251001": {
        "input": 0.25 / 1_000_000,
        "output": 1.25 / 1_000_000,
        "cache_read": 0.03 / 1_000_000,
    },
}
 
client = anthropic.Anthropic()
 
def instrumented_llm_call(
    messages: list,
    model: str = "claude-sonnet-4-6",
    system: str = "",
    max_tokens: int = 1024,
    feature: str = "unknown",
    prompt_version: str = "v1"
) -> str:
    """Make an LLM call with full observability."""
    
    with tracer.start_as_current_span(f"gen_ai.{model}") as span:
        # Set span attributes (OTel GenAI conventions)
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.max_tokens", max_tokens)
        span.set_attribute("feature", feature)
        span.set_attribute("prompt_version", prompt_version)
        
        start_time = time.time()
        error = None
        
        try:
            response = client.messages.create(
                model=model,
                max_tokens=max_tokens,
                system=system,
                messages=messages
            )
            
            duration = time.time() - start_time
            
            # Token usage
            input_tokens = response.usage.input_tokens
            output_tokens = response.usage.output_tokens
            cache_read = getattr(response.usage, "cache_read_input_tokens", 0)
            
            # Calculate cost
            costs = TOKEN_COSTS.get(model, TOKEN_COSTS["claude-sonnet-4-6"])
            cost = (
                input_tokens * costs["input"] +
                output_tokens * costs["output"] +
                cache_read * costs["cache_read"]
            )
            
            # Set span attributes
            span.set_attribute("gen_ai.response.finish_reason", response.stop_reason or "")
            span.set_attribute("gen_ai.usage.input_tokens", input_tokens)
            span.set_attribute("gen_ai.usage.output_tokens", output_tokens)
            span.set_attribute("gen_ai.usage.input_tokens_cached", cache_read)
            span.set_attribute("gen_ai.cost_usd", cost)
            
            # Record metrics
            attrs = {
                "gen_ai.system": "anthropic",
                "gen_ai.request.model": model,
                "feature": feature,
            }
            
            llm_request_duration.record(duration, attrs)
            llm_input_tokens.add(input_tokens, {**attrs, "gen_ai.token.type": "input"})
            llm_input_tokens.add(output_tokens, {**attrs, "gen_ai.token.type": "output"})
            llm_cost_counter.add(cost, attrs)
            
            return response.content[0].text
            
        except Exception as e:
            error = e
            span.set_attribute("error", True)
            span.set_attribute("error.message", str(e))
            raise

Grafana Dashboard Queries

With metrics flowing to Prometheus:

promql

# Average latency by model
histogram_quantile(0.99, 
  rate(gen_ai_client_operation_duration_bucket[5m])
) by (gen_ai_request_model)
 
# Cost per hour by feature
sum(rate(gen_ai_client_cost_total[1h])) by (feature) * 3600
 
# Cache hit rate
sum(rate(gen_ai_client_token_usage_total{gen_ai_token_type="cached"}[5m])) 
/ 
sum(rate(gen_ai_client_token_usage_total{gen_ai_token_type="input"}[5m]))
 
# Error rate
rate(gen_ai_client_errors_total[5m]) / rate(gen_ai_client_requests_total[5m])

Alerts

yaml

# alertmanager rules
groups:
  - name: llm_alerts
    rules:
      - alert: LLMHighLatency
        expr: histogram_quantile(0.95, rate(gen_ai_client_operation_duration_bucket[5m])) > 10
        annotations:
          summary: "LLM P95 latency > 10s"
          
      - alert: LLMHighCost
        expr: increase(gen_ai_client_cost_total[1h]) > 50
        annotations:
          summary: "LLM spending > $50/hour"
          
      - alert: LLMHighErrorRate
        expr: rate(gen_ai_client_errors_total[5m]) / rate(gen_ai_client_requests_total[5m]) > 0.05
        annotations:
          summary: "LLM error rate > 5%"

Add Langfuse for Quality Tracking

For LLM-specific quality metrics (not just infra):

python

from langfuse import Langfuse
 
langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY"),
)
 
# Create trace
trace = langfuse.trace(name="code-review", user_id=user_id)
 
generation = trace.generation(
    name="review-generation",
    model=model,
    input=prompt,
    output=response_text,
    usage={"input": input_tokens, "output": output_tokens},
    metadata={"prompt_version": "v2", "feature": "code-review"}
)
 
# Add human feedback when available
generation.score(name="quality", value=4, comment="Good explanation")

Langfuse gives you: prompt performance over time, A/B test results, human feedback scores, and cost per trace.

Instrumented LLM calls give you answers to the questions that matter: which prompts cost too much, which models are slow, where errors come from. Build this from day one — retrofitting observability is always harder.

For managed observability infrastructure, Grafana Cloud has a generous free tier for metrics, logs, and traces.

Production LLM Observability — Traces, Costs, Latency with OpenTelemetry

What to Measure for LLMs

OpenTelemetry for LLMs

Setup: Full Observability Stack

Core Instrumentation

Grafana Dashboard Queries

Alerts

Add Langfuse for Quality Tracking

Stay ahead of the curve

Related Articles

Why Agentic AI Will Kill the Traditional On-Call Rotation by 2028

AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds

AI-Powered Log Analysis Is Replacing Manual Debugging in DevOps (2026)

Comments