All Articles

How to Run MLflow on Kubernetes for ML Experiment Tracking (2026)

MLflow tracks your ML experiments, models, and metrics. Here's how to deploy a production MLflow tracking server on Kubernetes with PostgreSQL and S3 artifact storage.

DevOpsBoysApr 14, 20264 min read
Share:Tweet

Every ML team runs experiments — different models, hyperparameters, datasets. Without tracking, you lose which config gave the best accuracy. MLflow is the standard open-source solution. Here's how to run it on Kubernetes properly.

Architecture

ML Training Job (local/K8s)
         ↓
    MLflow Client
         ↓
  MLflow Tracking Server (K8s Deployment)
         ↓
  ┌──────────────────────────┐
  │  PostgreSQL (metadata)   │  ← runs, params, metrics
  │  S3 / MinIO (artifacts)  │  ← models, plots, files
  └──────────────────────────┘

What MLflow Tracks

  • Runs — each training run with its config
  • Parameters — learning rate, epochs, batch size
  • Metrics — accuracy, loss, F1 over time
  • Artifacts — saved models, plots, feature importances
  • Tags — environment, git commit, team
python
import mlflow
 
with mlflow.start_run():
    mlflow.log_param("learning_rate", 0.01)
    mlflow.log_param("epochs", 100)
    
    # ... train model ...
    
    mlflow.log_metric("accuracy", 0.94)
    mlflow.log_metric("loss", 0.12)
    mlflow.log_artifact("model.pkl")

Step 1: PostgreSQL for Metadata

yaml
# postgres-deployment.yaml
apiVersion: v1
kind: Secret
metadata:
  name: postgres-secret
  namespace: mlflow
type: Opaque
stringData:
  POSTGRES_USER: mlflow
  POSTGRES_PASSWORD: mlflow-secret-password
  POSTGRES_DB: mlflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  namespace: mlflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:15-alpine
        envFrom:
        - secretRef:
            name: postgres-secret
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: data
          mountPath: /var/lib/postgresql/data
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: postgres-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
  namespace: mlflow
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  namespace: mlflow
spec:
  selector:
    app: postgres
  ports:
  - port: 5432

Step 2: S3 Bucket for Artifacts

bash
# Create S3 bucket
aws s3 mb s3://my-mlflow-artifacts --region us-east-1
 
# Create IAM policy
cat > mlflow-s3-policy.json << 'EOF'
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"],
    "Resource": [
      "arn:aws:s3:::my-mlflow-artifacts",
      "arn:aws:s3:::my-mlflow-artifacts/*"
    ]
  }]
}
EOF
 
aws iam create-policy \
  --policy-name MLflowS3Policy \
  --policy-document file://mlflow-s3-policy.json

If using EKS, attach via IRSA (IAM Roles for Service Accounts):

bash
# Create service account with IRSA
eksctl create iamserviceaccount \
  --cluster my-cluster \
  --namespace mlflow \
  --name mlflow \
  --attach-policy-arn arn:aws:iam::123456789:policy/MLflowS3Policy \
  --approve

Step 3: MLflow Tracking Server Deployment

yaml
# mlflow-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: mlflow
---
apiVersion: v1
kind: Secret
metadata:
  name: mlflow-secret
  namespace: mlflow
type: Opaque
stringData:
  BACKEND_STORE_URI: "postgresql://mlflow:mlflow-secret-password@postgres:5432/mlflow"
  ARTIFACT_ROOT: "s3://my-mlflow-artifacts"
  AWS_DEFAULT_REGION: "us-east-1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow
  namespace: mlflow
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      serviceAccountName: mlflow   # for IRSA
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:v2.11.0
        command:
        - mlflow
        - server
        - --host=0.0.0.0
        - --port=5000
        - --backend-store-uri=$(BACKEND_STORE_URI)
        - --default-artifact-root=$(ARTIFACT_ROOT)
        - --workers=2
        envFrom:
        - secretRef:
            name: mlflow-secret
        ports:
        - containerPort: 5000
        resources:
          requests:
            cpu: "200m"
            memory: "512Mi"
          limits:
            cpu: "500m"
            memory: "1Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 15
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 5000
          initialDelaySeconds: 30
          periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow
  namespace: mlflow
spec:
  selector:
    app: mlflow
  ports:
  - port: 5000
    targetPort: 5000

Step 4: Ingress for UI Access

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: mlflow-ingress
  namespace: mlflow
  annotations:
    nginx.ingress.kubernetes.io/auth-type: basic
    nginx.ingress.kubernetes.io/auth-secret: mlflow-basic-auth
    nginx.ingress.kubernetes.io/auth-realm: "MLflow — authenticate"
spec:
  ingressClassName: nginx
  rules:
  - host: mlflow.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: mlflow
            port:
              number: 5000

Add basic auth (never expose MLflow publicly without auth):

bash
# Create htpasswd
htpasswd -c auth admin
kubectl create secret generic mlflow-basic-auth \
  --from-file=auth \
  -n mlflow

Step 5: Deploy Everything

bash
kubectl apply -f postgres-deployment.yaml
kubectl apply -f mlflow-deployment.yaml
 
# Wait for pods
kubectl get pods -n mlflow -w
 
# Check MLflow is up
kubectl port-forward svc/mlflow 5000:5000 -n mlflow
# Open http://localhost:5000

Step 6: Connect Your Training Script

python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
 
# Point to your K8s MLflow server
mlflow.set_tracking_uri("http://mlflow.yourdomain.com")
mlflow.set_experiment("iris-classification")
 
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
with mlflow.start_run(run_name="rf-baseline"):
    n_estimators = 100
    max_depth = 5
    
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    model.fit(X_train, y_train)
    
    accuracy = model.score(X_test, y_test)
    mlflow.log_metric("accuracy", accuracy)
    
    # Log the model
    mlflow.sklearn.log_model(model, "model")
    
    print(f"Accuracy: {accuracy:.3f}")
    print(f"Run ID: {mlflow.active_run().info.run_id}")

Model Registry

MLflow also has a model registry for promoting models to staging/production:

python
# Register model
mlflow.register_model(
    f"runs:/{run_id}/model",
    "iris-classifier"
)
bash
# Transition to production via CLI
mlflow models transition-model-version-stage \
  --name iris-classifier \
  --version 1 \
  --stage Production

MinIO Alternative (No AWS Needed)

For local/on-prem without S3:

bash
# Install MinIO via Helm
helm repo add minio https://charts.min.io/
helm install minio minio/minio \
  --namespace mlflow \
  --set rootUser=admin \
  --set rootPassword=adminpassword \
  --set persistence.size=50Gi

Update MLflow artifact root:

ARTIFACT_ROOT=s3://mlflow-bucket
AWS_ACCESS_KEY_ID=admin
AWS_SECRET_ACCESS_KEY=adminpassword
MLFLOW_S3_ENDPOINT_URL=http://minio:9000

Resources

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments