How to Run MLflow on Kubernetes for ML Experiment Tracking (2026)
MLflow tracks your ML experiments, models, and metrics. Here's how to deploy a production MLflow tracking server on Kubernetes with PostgreSQL and S3 artifact storage.
Every ML team runs experiments — different models, hyperparameters, datasets. Without tracking, you lose which config gave the best accuracy. MLflow is the standard open-source solution. Here's how to run it on Kubernetes properly.
Architecture
ML Training Job (local/K8s)
↓
MLflow Client
↓
MLflow Tracking Server (K8s Deployment)
↓
┌──────────────────────────┐
│ PostgreSQL (metadata) │ ← runs, params, metrics
│ S3 / MinIO (artifacts) │ ← models, plots, files
└──────────────────────────┘
What MLflow Tracks
- Runs — each training run with its config
- Parameters — learning rate, epochs, batch size
- Metrics — accuracy, loss, F1 over time
- Artifacts — saved models, plots, feature importances
- Tags — environment, git commit, team
import mlflow
with mlflow.start_run():
mlflow.log_param("learning_rate", 0.01)
mlflow.log_param("epochs", 100)
# ... train model ...
mlflow.log_metric("accuracy", 0.94)
mlflow.log_metric("loss", 0.12)
mlflow.log_artifact("model.pkl")Step 1: PostgreSQL for Metadata
# postgres-deployment.yaml
apiVersion: v1
kind: Secret
metadata:
name: postgres-secret
namespace: mlflow
type: Opaque
stringData:
POSTGRES_USER: mlflow
POSTGRES_PASSWORD: mlflow-secret-password
POSTGRES_DB: mlflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
namespace: mlflow
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:15-alpine
envFrom:
- secretRef:
name: postgres-secret
ports:
- containerPort: 5432
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumes:
- name: data
persistentVolumeClaim:
claimName: postgres-pvc
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-pvc
namespace: mlflow
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
name: postgres
namespace: mlflow
spec:
selector:
app: postgres
ports:
- port: 5432Step 2: S3 Bucket for Artifacts
# Create S3 bucket
aws s3 mb s3://my-mlflow-artifacts --region us-east-1
# Create IAM policy
cat > mlflow-s3-policy.json << 'EOF'
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-mlflow-artifacts",
"arn:aws:s3:::my-mlflow-artifacts/*"
]
}]
}
EOF
aws iam create-policy \
--policy-name MLflowS3Policy \
--policy-document file://mlflow-s3-policy.jsonIf using EKS, attach via IRSA (IAM Roles for Service Accounts):
# Create service account with IRSA
eksctl create iamserviceaccount \
--cluster my-cluster \
--namespace mlflow \
--name mlflow \
--attach-policy-arn arn:aws:iam::123456789:policy/MLflowS3Policy \
--approveStep 3: MLflow Tracking Server Deployment
# mlflow-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
name: mlflow
---
apiVersion: v1
kind: Secret
metadata:
name: mlflow-secret
namespace: mlflow
type: Opaque
stringData:
BACKEND_STORE_URI: "postgresql://mlflow:mlflow-secret-password@postgres:5432/mlflow"
ARTIFACT_ROOT: "s3://my-mlflow-artifacts"
AWS_DEFAULT_REGION: "us-east-1"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow
namespace: mlflow
spec:
replicas: 2
selector:
matchLabels:
app: mlflow
template:
metadata:
labels:
app: mlflow
spec:
serviceAccountName: mlflow # for IRSA
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:v2.11.0
command:
- mlflow
- server
- --host=0.0.0.0
- --port=5000
- --backend-store-uri=$(BACKEND_STORE_URI)
- --default-artifact-root=$(ARTIFACT_ROOT)
- --workers=2
envFrom:
- secretRef:
name: mlflow-secret
ports:
- containerPort: 5000
resources:
requests:
cpu: "200m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 15
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 30
periodSeconds: 30
---
apiVersion: v1
kind: Service
metadata:
name: mlflow
namespace: mlflow
spec:
selector:
app: mlflow
ports:
- port: 5000
targetPort: 5000Step 4: Ingress for UI Access
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: mlflow-ingress
namespace: mlflow
annotations:
nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: mlflow-basic-auth
nginx.ingress.kubernetes.io/auth-realm: "MLflow — authenticate"
spec:
ingressClassName: nginx
rules:
- host: mlflow.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: mlflow
port:
number: 5000Add basic auth (never expose MLflow publicly without auth):
# Create htpasswd
htpasswd -c auth admin
kubectl create secret generic mlflow-basic-auth \
--from-file=auth \
-n mlflowStep 5: Deploy Everything
kubectl apply -f postgres-deployment.yaml
kubectl apply -f mlflow-deployment.yaml
# Wait for pods
kubectl get pods -n mlflow -w
# Check MLflow is up
kubectl port-forward svc/mlflow 5000:5000 -n mlflow
# Open http://localhost:5000Step 6: Connect Your Training Script
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Point to your K8s MLflow server
mlflow.set_tracking_uri("http://mlflow.yourdomain.com")
mlflow.set_experiment("iris-classification")
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
with mlflow.start_run(run_name="rf-baseline"):
n_estimators = 100
max_depth = 5
mlflow.log_param("n_estimators", n_estimators)
mlflow.log_param("max_depth", max_depth)
model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
model.fit(X_train, y_train)
accuracy = model.score(X_test, y_test)
mlflow.log_metric("accuracy", accuracy)
# Log the model
mlflow.sklearn.log_model(model, "model")
print(f"Accuracy: {accuracy:.3f}")
print(f"Run ID: {mlflow.active_run().info.run_id}")Model Registry
MLflow also has a model registry for promoting models to staging/production:
# Register model
mlflow.register_model(
f"runs:/{run_id}/model",
"iris-classifier"
)# Transition to production via CLI
mlflow models transition-model-version-stage \
--name iris-classifier \
--version 1 \
--stage ProductionMinIO Alternative (No AWS Needed)
For local/on-prem without S3:
# Install MinIO via Helm
helm repo add minio https://charts.min.io/
helm install minio minio/minio \
--namespace mlflow \
--set rootUser=admin \
--set rootPassword=adminpassword \
--set persistence.size=50GiUpdate MLflow artifact root:
ARTIFACT_ROOT=s3://mlflow-bucket
AWS_ACCESS_KEY_ID=admin
AWS_SECRET_ACCESS_KEY=adminpassword
MLFLOW_S3_ENDPOINT_URL=http://minio:9000
Resources
- MLflow Docs — complete reference
- KodeKloud MLOps Path — hands-on MLOps labs
- Udemy MLOps Course — MLflow + Kubernetes projects
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
How to Run Ollama on Kubernetes — Self-Host LLMs in Your Cluster (2026)
Ollama makes running LLMs locally easy. Running it on Kubernetes makes it scalable, persistent, and accessible to your whole team or application stack. Here's the complete setup — CPU and GPU, with persistent model storage and a production-ready deployment.
Run vLLM on Kubernetes for Fast LLM Inference (2026)
vLLM is the fastest open-source LLM inference engine. Here's how to deploy it on Kubernetes with GPU nodes, expose an OpenAI-compatible API, and scale it.
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.