🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Victoria Metrics vs Thanos vs Cortex: Which Long-Term Prometheus Storage?

Prometheus doesn't scale to multi-cluster, long-term storage on its own. Compare Victoria Metrics, Thanos, and Cortex to pick the right solution for your scale.

DevOpsBoys4 min read
Share:Tweet

Prometheus is great for single-cluster metrics, but it has two problems at scale: retention (default 15 days) and federation (querying metrics across clusters). Victoria Metrics, Thanos, and Cortex all solve this — very differently.

The Problem They're All Solving

Problem 1: Long-term retention
- Prometheus stores data locally (default: 15 days, 200GB limit)
- Production needs 90 days, sometimes 1+ year for compliance

Problem 2: Multi-cluster
- 10 clusters = 10 Prometheus servers
- No unified query layer
- Can't correlate metrics across clusters

Problem 3: High availability
- Single Prometheus = single point of failure
- Deduplication needed when running Prometheus in pairs

Victoria Metrics

VictoriaMetrics is a drop-in replacement for Prometheus that's significantly more efficient. Use vmcluster for distributed mode.

Architecture:

Prometheus (scraper) → vminsert → vmstorage (replicated)
                    ↗                      ↓
vmselect ← Grafana                     vmbackup → S3

Key advantages:

  • 5-10x more efficient than Prometheus (uses less RAM and disk)
  • Handles 1M+ metrics/second with vmcluster
  • Prometheus-compatible query language (MetricsQL extends PromQL)
  • Simple horizontal scaling (add more vminsert/vmstorage nodes)

Setup:

yaml
# victoria-metrics-cluster.yaml
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMCluster
metadata:
  name: production
spec:
  retentionPeriod: "90"  # days
  replicationFactor: 2
  vmstorage:
    replicaCount: 3
    resources:
      requests:
        memory: "8Gi"
        cpu: "2"
    storage:
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 500Gi
  vmselect:
    replicaCount: 2
    resources:
      requests:
        memory: "2Gi"
  vminsert:
    replicaCount: 2
    resources:
      requests:
        memory: "1Gi"

Remote write from Prometheus:

yaml
# prometheus.yml
remote_write:
  - url: http://vminsert.monitoring:8480/insert/0/prometheus/api/v1/write
    queue_config:
      max_samples_per_send: 10000
      capacity: 30000

Best for:

  • Single cluster or multi-cluster with a central store
  • Teams that want simplicity over Thanos/Cortex's complexity
  • Anyone hitting Prometheus memory limits

Thanos

Thanos adds a sidecar to each Prometheus instance, uploads blocks to S3, and provides a unified query layer.

Architecture:

Prometheus + Thanos Sidecar → S3 (long-term storage)
                   ↓
Thanos Store → Thanos Query → Grafana
Thanos Compact (deduplication, downsampling)

Setup with Helm:

bash
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install thanos bitnami/thanos \
  --set query.enabled=true \
  --set store.enabled=true \
  --set compactor.enabled=true \
  --set bucketweb.enabled=false \
  --set storegateway.persistence.size=20Gi \
  --set minio.enabled=false \
  --set objstoreConfig="type: S3
config:
  bucket: my-thanos-bucket
  region: us-east-1
  endpoint: s3.amazonaws.com"

Multi-cluster setup:

yaml
# Each cluster runs Prometheus + sidecar
# thanos-sidecar.yaml
containers:
  - name: thanos-sidecar
    image: quay.io/thanos/thanos:v0.35.0
    args:
      - sidecar
      - --tsdb.path=/data
      - --prometheus.url=http://localhost:9090
      - --grpc-address=0.0.0.0:10901
      - --http-address=0.0.0.0:10902
      - --objstore.config-file=/etc/thanos/objstore.yaml
yaml
# Central Thanos Query discovers all sidecars
apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-query
spec:
  template:
    spec:
      containers:
        - name: thanos-query
          args:
            - query
            - --store=thanos-sidecar-cluster1:10901
            - --store=thanos-sidecar-cluster2:10901
            - --store=thanos-store:10901
            - --query.replica-label=prometheus_replica
            - --query.auto-downsampling

Best for:

  • Multi-cluster environments where you keep Prometheus as the scraper
  • Teams already using Prometheus and want minimal disruption
  • When you need unlimited retention via S3

Downsides:

  • More components to manage (sidecar, store, query, compactor)
  • Compactor must be single-instance (no horizontal scaling)
  • Query performance can be slow for large time ranges

Cortex

Cortex is a horizontally scalable, multi-tenant Prometheus backend. It's what Grafana Cloud runs under the hood (though Grafana is migrating to Mimir, which is a fork).

Architecture:

Prometheus → Cortex Distributor → Cortex Ingester (WAL) → S3
                                      ↓
Cortex Store Gateway → Cortex Querier → Grafana

Key advantages:

  • True multi-tenancy (different teams get isolated metric namespaces)
  • Horizontal scaling for every component
  • Per-tenant rate limiting and quotas
  • Active-active HA by default

When to use Cortex:

  • You're building an internal metrics platform for multiple teams/products
  • You need per-tenant isolation (team A can't see team B's metrics)
  • You're at SaaS scale (millions of metrics)

Setup (simplified):

yaml
# cortex minimal config
target: all  # or run each component separately at scale
auth_enabled: false  # true for multi-tenant
 
distributor:
  ring:
    kvstore:
      store: consul
 
ingester:
  lifecycler:
    ring:
      kvstore:
        store: consul
      replication_factor: 3
 
storage:
  engine: blocks
 
blocks_storage:
  s3:
    bucket_name: my-cortex-blocks
    region: us-east-1

Best for:

  • Platform engineering teams building internal monitoring-as-a-service
  • Multi-tenant environments (SaaS products, enterprises with many teams)
  • When you need Kubernetes-native horizontal scaling for everything

Comparison Table

Victoria MetricsThanosCortex
ComplexityLowMediumHigh
Multi-tenancyNo (vmcluster is single-tenant)NoYes (natively)
Multi-clusterYes (via federation)Yes (native)Yes (native)
Prometheus-compatibleYesYesYes
Long-term storageBuilt-in (local + S3)S3 requiredS3 required
Resource efficiencyExcellentGoodGood
Best atSingle/small multi-clusterMulti-cluster, existing PrometheusLarge scale, multi-tenant
Managed optionManaged VMGrafana CloudGrafana Cloud

What Should You Choose?

Choose Victoria Metrics if:

  • You're scaling a single cluster or a handful of clusters
  • You're hitting Prometheus memory limits
  • You want the simplest possible setup with long retention

Choose Thanos if:

  • You have 5+ Prometheus instances across clusters/regions
  • You want to keep Prometheus as the scraper unchanged
  • You're okay managing more components

Choose Cortex/Mimir if:

  • You're building a shared metrics platform for multiple teams
  • Multi-tenancy is a requirement
  • You're at 1M+ metrics/second scale

For most DevOps teams: Victoria Metrics gets you 90% of what Thanos offers with half the complexity. Start there.

Resources: VictoriaMetrics docs | Thanos | Grafana Mimir (Cortex fork, actively maintained)

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments