Set Up MLflow + Airflow MLOps Pipeline on Kubernetes (2026)
Build a production MLOps pipeline on Kubernetes using MLflow for experiment tracking and model registry, and Apache Airflow for pipeline orchestration. Full setup guide.
MLflow tracks experiments and manages models. Airflow orchestrates the training pipeline. Together on Kubernetes, they give you a production-grade MLOps platform. Here's the full setup.
Architecture Overview
[Data Sources] → [Airflow DAG] → [Training Job] → [MLflow Tracking]
↓
[MLflow Model Registry]
↓
[Model Serving / Inference]
- Airflow orchestrates: data ingestion → preprocessing → training → evaluation → registration
- MLflow tracks: parameters, metrics, artifacts, models
- Kubernetes runs it all with GPU support and auto-scaling
Step 1: Install MLflow on Kubernetes
Create namespace and dependencies:
kubectl create namespace mlops
# PostgreSQL for MLflow backend (tracking server metadata)
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install mlflow-postgres bitnami/postgresql \
--namespace mlops \
--set auth.database=mlflow \
--set auth.username=mlflow \
--set auth.password=mlflowpassword \
--set primary.persistence.size=20GiMLflow deployment with S3 artifact store:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-server
namespace: mlops
spec:
replicas: 1
selector:
matchLabels:
app: mlflow-server
template:
metadata:
labels:
app: mlflow-server
spec:
serviceAccountName: mlflow-sa # needs S3 access via IRSA
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:latest
command:
- mlflow
- server
- --host=0.0.0.0
- --port=5000
- --backend-store-uri=postgresql://mlflow:mlflowpassword@mlflow-postgres:5432/mlflow
- --default-artifact-root=s3://my-mlflow-bucket/artifacts
- --serve-artifacts
ports:
- containerPort: 5000
env:
- name: AWS_DEFAULT_REGION
value: us-east-1
---
apiVersion: v1
kind: Service
metadata:
name: mlflow-server
namespace: mlops
spec:
selector:
app: mlflow-server
ports:
- port: 5000
targetPort: 5000S3 bucket for artifacts:
aws s3 mb s3://my-mlflow-bucket
# IRSA for MLflow pod to access S3
eksctl create iamserviceaccount \
--name mlflow-sa \
--namespace mlops \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
--approveStep 2: Install Apache Airflow with Helm
helm repo add apache-airflow https://airflow.apache.org
helm repo update
helm install airflow apache-airflow/airflow \
--namespace mlops \
--set executor=KubernetesExecutor \
--set dags.gitSync.enabled=true \
--set dags.gitSync.repo=https://github.com/myorg/ml-dags \
--set dags.gitSync.branch=main \
--set dags.gitSync.subPath=dags \
--set webserver.service.type=ClusterIPKubernetesExecutor is important for MLOps — each task runs in its own pod with custom resources (GPUs for training, small pods for preprocessing).
Step 3: Write the Training DAG
# dags/train_model.py
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime, timedelta
from kubernetes.client import models as k8s
default_args = {
"owner": "mlops-team",
"retries": 1,
"retry_delay": timedelta(minutes=5),
}
with DAG(
dag_id="train_model_pipeline",
default_args=default_args,
schedule_interval="0 2 * * *", # daily at 2am
start_date=datetime(2026, 1, 1),
catchup=False,
) as dag:
# Step 1: Data preprocessing (small pod)
preprocess = KubernetesPodOperator(
task_id="preprocess_data",
name="preprocess-data",
namespace="mlops",
image="myregistry.com/ml-preprocess:latest",
cmds=["python", "preprocess.py"],
env_vars={"S3_BUCKET": "my-ml-data"},
container_resources=k8s.V1ResourceRequirements(
requests={"cpu": "500m", "memory": "2Gi"},
limits={"cpu": "1", "memory": "4Gi"},
),
)
# Step 2: Training (GPU pod)
train = KubernetesPodOperator(
task_id="train_model",
name="train-model",
namespace="mlops",
image="myregistry.com/ml-train:latest",
cmds=["python", "train.py"],
env_vars={
"MLFLOW_TRACKING_URI": "http://mlflow-server.mlops.svc.cluster.local:5000",
"EXPERIMENT_NAME": "production-model",
"S3_BUCKET": "my-ml-data",
},
container_resources=k8s.V1ResourceRequirements(
requests={"cpu": "2", "memory": "8Gi", "nvidia.com/gpu": "1"},
limits={"cpu": "4", "memory": "16Gi", "nvidia.com/gpu": "1"},
),
node_selector={"nvidia.com/gpu": "true"},
tolerations=[
k8s.V1Toleration(key="nvidia.com/gpu", operator="Exists", effect="NoSchedule")
],
)
# Step 3: Evaluate and register model
evaluate_and_register = KubernetesPodOperator(
task_id="evaluate_and_register",
name="evaluate-register",
namespace="mlops",
image="myregistry.com/ml-evaluate:latest",
cmds=["python", "evaluate_and_register.py"],
env_vars={
"MLFLOW_TRACKING_URI": "http://mlflow-server.mlops.svc.cluster.local:5000",
"MIN_ACCURACY": "0.85", # only register if accuracy > 85%
},
container_resources=k8s.V1ResourceRequirements(
requests={"cpu": "500m", "memory": "2Gi"},
),
)
preprocess >> train >> evaluate_and_registerStep 4: Training Script with MLflow Tracking
# train.py
import mlflow
import mlflow.sklearn
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
mlflow.set_experiment(os.environ["EXPERIMENT_NAME"])
with mlflow.start_run():
# Log parameters
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
mlflow.log_param("dataset_version", "v2.3")
# Train
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Evaluate
accuracy = accuracy_score(y_test, model.predict(X_test))
mlflow.log_metric("accuracy", accuracy)
mlflow.log_metric("f1_score", f1)
# Log model
mlflow.sklearn.log_model(
model,
"model",
registered_model_name="production-classifier",
input_example=X_test[:5],
)
print(f"Accuracy: {accuracy:.4f}")Step 5: Promote Model to Production
# evaluate_and_register.py
import mlflow
from mlflow.tracking import MlflowClient
client = MlflowClient()
min_accuracy = float(os.environ["MIN_ACCURACY"])
# Get the latest run
experiment = client.get_experiment_by_name(os.environ["EXPERIMENT_NAME"])
runs = client.search_runs(
experiment_ids=[experiment.experiment_id],
order_by=["start_time DESC"],
max_results=1
)
latest_run = runs[0]
accuracy = latest_run.data.metrics["accuracy"]
if accuracy >= min_accuracy:
# Transition to Production
client.transition_model_version_stage(
name="production-classifier",
version=latest_run.data.tags["mlflow.log-model.history"],
stage="Production",
archive_existing_versions=True,
)
print(f"Model promoted to Production (accuracy: {accuracy:.4f})")
else:
print(f"Model NOT promoted — accuracy {accuracy:.4f} below threshold {min_accuracy}")
raise ValueError("Model quality gate failed")Access MLflow UI
kubectl port-forward svc/mlflow-server -n mlops 5000:5000
# Open http://localhost:5000You'll see all experiments, runs, parameters, metrics, and registered models.
What You Get
- Automated daily retraining via Airflow DAG
- Full experiment lineage — every run tracked in MLflow
- Quality gates — only models above accuracy threshold go to Production
- GPU training on Kubernetes with proper resource isolation
- S3-backed artifact storage — models versioned and stored durably
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
LLM-Powered Kubernetes Operators: How AI Is Automating Cluster Management in 2026
How teams are building Kubernetes operators powered by LLMs to auto-remediate incidents, optimize resources, and manage complex deployments — with architecture patterns and real examples.
Run DeepSeek R1 on Kubernetes — Self-Hosted LLM Guide (2026)
Deploy DeepSeek R1 on your own Kubernetes cluster using Ollama or vLLM. Includes GPU node setup, Helm deployment, persistent model storage, and an OpenAI-compatible API.
Set Up Ray Serve on Kubernetes for ML Model Inference (2026)
Ray Serve is the best way to serve ML models at scale on Kubernetes — handles batching, scaling, model composition, and GPU sharing. Complete setup guide.