How to Run MLflow on Kubernetes for ML Experiment Tracking (2026)

MLflow tracks your ML experiments, models, and metrics. Here's how to deploy a production MLflow tracking server on Kubernetes with PostgreSQL and S3 artifact storage.

Every ML team runs experiments — different models, hyperparameters, datasets. Without tracking, you lose which config gave the best accuracy. MLflow is the standard open-source solution. Here's how to run it on Kubernetes properly.

Architecture

ML Training Job (local/K8s)
         ↓
    MLflow Client
         ↓
  MLflow Tracking Server (K8s Deployment)
         ↓
  ┌──────────────────────────┐
  │  PostgreSQL (metadata)   │  ← runs, params, metrics
  │  S3 / MinIO (artifacts)  │  ← models, plots, files
  └──────────────────────────┘

What MLflow Tracks

Runs — each training run with its config
Parameters — learning rate, epochs, batch size
Metrics — accuracy, loss, F1 over time
Artifacts — saved models, plots, feature importances
Tags — environment, git commit, team

python

import mlflow
 
with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("epochs", 100)
    
    # ... train model ...
    
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("loss", 0.12)
    mlflow.log_artifact("model.pkl")

Step 1: PostgreSQL for Metadata

yaml

# postgres-deployment.yaml
apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: mlflow
type: Opaque
stringData:
  POSTGRES_USER: mlflow
  POSTGRES_PASSWORD: mlflow-secret-password
  POSTGRES_DB: mlflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  namespace: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15-alpine
        envFrom:
        - secretRef:
            name: postgres-secret
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: postgres-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
  namespace: mlflow
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: mlflow
spec:
  selector:
    app: postgres
  ports:
  - port: 5432

Step 2: S3 Bucket for Artifacts

bash

# Create S3 bucket
aws s3 mb s3://my-mlflow-artifacts --region us-east-1
 
# Create IAM policy
cat > mlflow-s3-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::my-mlflow-artifacts",
      "arn:aws:s3:::my-mlflow-artifacts/*"
    ]
  }]
}
EOF
 
aws iam create-policy \
  --policy-name MLflowS3Policy \
  --policy-document file://mlflow-s3-policy.json

If using EKS, attach via IRSA (IAM Roles for Service Accounts):

bash

# Create service account with IRSA
eksctl create iamserviceaccount \
  --cluster my-cluster \
  --namespace mlflow \
  --name mlflow \
  --attach-policy-arn arn:aws:iam::123456789:policy/MLflowS3Policy \
  --approve

Step 3: MLflow Tracking Server Deployment

yaml

# mlflow-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: mlflow
---
apiVersion: v1
kind: Secret
metadata:
  name: mlflow-secret
  namespace: mlflow
type: Opaque
stringData:
  BACKEND_STORE_URI: "postgresql://mlflow:mlflow-secret-password@postgres:5432/mlflow"
  ARTIFACT_ROOT: "s3://my-mlflow-artifacts"
  AWS_DEFAULT_REGION: "us-east-1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
  namespace: mlflow
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      serviceAccountName: mlflow   # for IRSA
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:v2.11.0
        command:
        - mlflow
        - server
        - --host=0.0.0.0
        - --port=5000
        - --backend-store-uri=$(BACKEND_STORE_URI)
        - --default-artifact-root=$(ARTIFACT_ROOT)
        - --workers=2
        envFrom:
        - secretRef:
            name: mlflow-secret
        ports:
        - containerPort: 5000
        resources:
          requests:
            cpu: "200m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 15
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow
  namespace: mlflow
spec:
  selector:
    app: mlflow
  ports:
  - port: 5000
    targetPort: 5000

Step 4: Ingress for UI Access

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mlflow-ingress
  namespace: mlflow
  annotations:
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: mlflow-basic-auth
    nginx.ingress.kubernetes.io/auth-realm: "MLflow — authenticate"
spec:
  ingressClassName: nginx
  rules:
  - host: mlflow.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: mlflow
            port:
              number: 5000

Add basic auth (never expose MLflow publicly without auth):

bash

# Create htpasswd
htpasswd -c auth admin
kubectl create secret generic mlflow-basic-auth \
  --from-file=auth \
  -n mlflow

Step 5: Deploy Everything

bash

kubectl apply -f postgres-deployment.yaml
kubectl apply -f mlflow-deployment.yaml
 
# Wait for pods
kubectl get pods -n mlflow -w
 
# Check MLflow is up
kubectl port-forward svc/mlflow 5000:5000 -n mlflow
# Open http://localhost:5000

Step 6: Connect Your Training Script

python

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
 
# Point to your K8s MLflow server
mlflow.set_tracking_uri("http://mlflow.yourdomain.com")
mlflow.set_experiment("iris-classification")
 
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
with mlflow.start_run(run_name="rf-baseline"):
    n_estimators = 100
    max_depth = 5
    
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    model.fit(X_train, y_train)
    
    accuracy = model.score(X_test, y_test)
    mlflow.log_metric("accuracy", accuracy)
    
    # Log the model
    mlflow.sklearn.log_model(model, "model")
    
    print(f"Accuracy: {accuracy:.3f}")
    print(f"Run ID: {mlflow.active_run().info.run_id}")

Model Registry

MLflow also has a model registry for promoting models to staging/production:

python

# Register model
mlflow.register_model(
    f"runs:/{run_id}/model",
    "iris-classifier"
)

bash

# Transition to production via CLI
mlflow models transition-model-version-stage \
  --name iris-classifier \
  --version 1 \
  --stage Production

MinIO Alternative (No AWS Needed)

For local/on-prem without S3:

bash

# Install MinIO via Helm
helm repo add minio https://charts.min.io/
helm install minio minio/minio \
  --namespace mlflow \
  --set rootUser=admin \
  --set rootPassword=adminpassword \
  --set persistence.size=50Gi

Update MLflow artifact root:

ARTIFACT_ROOT=s3://mlflow-bucket
AWS_ACCESS_KEY_ID=admin
AWS_SECRET_ACCESS_KEY=adminpassword
MLFLOW_S3_ENDPOINT_URL=http://minio:9000

Resources

MLflow Docs — complete reference
KodeKloud MLOps Path — hands-on MLOps labs
Udemy MLOps Course — MLflow + Kubernetes projects

How to Run MLflow on Kubernetes for ML Experiment Tracking (2026)

Architecture

What MLflow Tracks

Step 1: PostgreSQL for Metadata

Step 2: S3 Bucket for Artifacts

Step 3: MLflow Tracking Server Deployment

Step 4: Ingress for UI Access

Step 5: Deploy Everything

Step 6: Connect Your Training Script

Model Registry

MinIO Alternative (No AWS Needed)

Resources

Stay ahead of the curve

Related Articles

Build an AI Kubernetes Cost Optimizer with Python and Claude API

Build an AI Kubernetes Resource Rightsizer with Claude API

Build a Kubernetes Cost Optimization Bot with AI in 2026

Comments