Set Up MLflow + Airflow MLOps Pipeline on Kubernetes (2026)
Build a production MLOps pipeline on Kubernetes using MLflow for experiment tracking and model registry, and Apache Airflow for pipeline orchestration. Full setup guide.
MLflow tracks experiments and manages models. Airflow orchestrates the training pipeline. Together on Kubernetes, they give you a production-grade MLOps platform. Here's the full setup.
Architecture Overview
[Data Sources] → [Airflow DAG] → [Training Job] → [MLflow Tracking]
↓
[MLflow Model Registry]
↓
[Model Serving / Inference]
- Airflow orchestrates: data ingestion → preprocessing → training → evaluation → registration
- MLflow tracks: parameters, metrics, artifacts, models
- Kubernetes runs it all with GPU support and auto-scaling
Step 1: Install MLflow on Kubernetes
Create namespace and dependencies:
kubectl create namespace mlops
# PostgreSQL for MLflow backend (tracking server metadata)
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install mlflow-postgres bitnami/postgresql \
--namespace mlops \
--set auth.database=mlflow \
--set auth.username=mlflow \
--set auth.password=mlflowpassword \
--set primary.persistence.size=20GiMLflow deployment with S3 artifact store:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-server
namespace: mlops
spec:
replicas: 1
selector:
matchLabels:
app: mlflow-server
template:
metadata:
labels:
app: mlflow-server
spec:
serviceAccountName: mlflow-sa # needs S3 access via IRSA
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:latest
command:
- mlflow
- server
- --host=0.0.0.0
- --port=5000
- --backend-store-uri=postgresql://mlflow:mlflowpassword@mlflow-postgres:5432/mlflow
- --default-artifact-root=s3://my-mlflow-bucket/artifacts
- --serve-artifacts
ports:
- containerPort: 5000
env:
- name: AWS_DEFAULT_REGION
value: us-east-1
---
apiVersion: v1
kind: Service
metadata:
name: mlflow-server
namespace: mlops
spec:
selector:
app: mlflow-server
ports:
- port: 5000
targetPort: 5000S3 bucket for artifacts:
aws s3 mb s3://my-mlflow-bucket
# IRSA for MLflow pod to access S3
eksctl create iamserviceaccount \
--name mlflow-sa \
--namespace mlops \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
--approveStep 2: Install Apache Airflow with Helm
helm repo add apache-airflow https://airflow.apache.org
helm repo update
helm install airflow apache-airflow/airflow \
--namespace mlops \
--set executor=KubernetesExecutor \
--set dags.gitSync.enabled=true \
--set dags.gitSync.repo=https://github.com/myorg/ml-dags \
--set dags.gitSync.branch=main \
--set dags.gitSync.subPath=dags \
--set webserver.service.type=ClusterIPKubernetesExecutor is important for MLOps — each task runs in its own pod with custom resources (GPUs for training, small pods for preprocessing).
Step 3: Write the Training DAG
# dags/train_model.py
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime, timedelta
from kubernetes.client import models as k8s
default_args = {
"owner": "mlops-team",
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
with DAG(
dag_id="train_model_pipeline",
default_args=default_args,
schedule_interval="0 2 * * *", # daily at 2am
start_date=datetime(2026, 1, 1),
catchup=False,
) as dag:
# Step 1: Data preprocessing (small pod)
preprocess = KubernetesPodOperator(
task_id="preprocess_data",
name="preprocess-data",
namespace="mlops",
image="myregistry.com/ml-preprocess:latest",
cmds=["python", "preprocess.py"],
env_vars={"S3_BUCKET": "my-ml-data"},
container_resources=k8s.V1ResourceRequirements(
requests={"cpu": "500m", "memory": "2Gi"},
limits={"cpu": "1", "memory": "4Gi"},
),
)
# Step 2: Training (GPU pod)
train = KubernetesPodOperator(
task_id="train_model",
name="train-model",
namespace="mlops",
image="myregistry.com/ml-train:latest",
cmds=["python", "train.py"],
env_vars={
"MLFLOW_TRACKING_URI": "http://mlflow-server.mlops.svc.cluster.local:5000",
"EXPERIMENT_NAME": "production-model",
"S3_BUCKET": "my-ml-data",
},
container_resources=k8s.V1ResourceRequirements(
requests={"cpu": "2", "memory": "8Gi", "nvidia.com/gpu": "1"},
limits={"cpu": "4", "memory": "16Gi", "nvidia.com/gpu": "1"},
),
node_selector={"nvidia.com/gpu": "true"},
tolerations=[
k8s.V1Toleration(key="nvidia.com/gpu", operator="Exists", effect="NoSchedule")
],
)
# Step 3: Evaluate and register model
evaluate_and_register = KubernetesPodOperator(
task_id="evaluate_and_register",
name="evaluate-register",
namespace="mlops",
image="myregistry.com/ml-evaluate:latest",
cmds=["python", "evaluate_and_register.py"],
env_vars={
"MLFLOW_TRACKING_URI": "http://mlflow-server.mlops.svc.cluster.local:5000",
"MIN_ACCURACY": "0.85", # only register if accuracy > 85%
},
container_resources=k8s.V1ResourceRequirements(
requests={"cpu": "500m", "memory": "2Gi"},
),
)
preprocess >> train >> evaluate_and_registerStep 4: Training Script with MLflow Tracking
# train.py
import mlflow
import mlflow.sklearn
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
mlflow.set_experiment(os.environ["EXPERIMENT_NAME"])
with mlflow.start_run():
# Log parameters
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
mlflow.log_param("dataset_version", "v2.3")
# Train
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Evaluate
accuracy = accuracy_score(y_test, model.predict(X_test))
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("f1_score", f1)
# Log model
mlflow.sklearn.log_model(
model,
"model",
registered_model_name="production-classifier",
input_example=X_test[:5],
)
print(f"Accuracy: {accuracy:.4f}")Step 5: Promote Model to Production
# evaluate_and_register.py
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
min_accuracy = float(os.environ["MIN_ACCURACY"])
# Get the latest run
experiment = client.get_experiment_by_name(os.environ["EXPERIMENT_NAME"])
runs = client.search_runs(
experiment_ids=[experiment.experiment_id],
order_by=["start_time DESC"],
max_results=1
)
latest_run = runs[0]
accuracy = latest_run.data.metrics["accuracy"]
if accuracy >= min_accuracy:
# Transition to Production
client.transition_model_version_stage(
name="production-classifier",
version=latest_run.data.tags["mlflow.log-model.history"],
stage="Production",
archive_existing_versions=True,
)
print(f"Model promoted to Production (accuracy: {accuracy:.4f})")
else:
print(f"Model NOT promoted — accuracy {accuracy:.4f} below threshold {min_accuracy}")
raise ValueError("Model quality gate failed")Access MLflow UI
kubectl port-forward svc/mlflow-server -n mlops 5000:5000
# Open http://localhost:5000You'll see all experiments, runs, parameters, metrics, and registered models.
What You Get
- Automated daily retraining via Airflow DAG
- Full experiment lineage — every run tracked in MLflow
- Quality gates — only models above accuracy threshold go to Production
- GPU training on Kubernetes with proper resource isolation
- S3-backed artifact storage — models versioned and stored durably
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI Kubernetes Deployment Readiness Checker with Claude API
Build a Python CLI tool using Claude API that analyzes Kubernetes YAML manifests before deployment — catches missing resource limits, root containers, and security issues with a go/no-go score.
Build an AI-Powered DevOps Chatbot with Streamlit on Kubernetes
Build a DevOps assistant chatbot that answers infrastructure questions, generates kubectl commands, and explains errors — deployed as a Streamlit app on Kubernetes.
Build LLM-Powered Runbook Automation with Haystack and Kubernetes
Turn your static runbooks into an AI system that answers 'what do I do when X happens' with step-by-step instructions retrieved from your actual documentation.