🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Set Up MLflow + Airflow MLOps Pipeline on Kubernetes (2026)

Build a production MLOps pipeline on Kubernetes using MLflow for experiment tracking and model registry, and Apache Airflow for pipeline orchestration. Full setup guide.

DevOpsBoysApr 30, 20264 min read
Share:Tweet

MLflow tracks experiments and manages models. Airflow orchestrates the training pipeline. Together on Kubernetes, they give you a production-grade MLOps platform. Here's the full setup.


Architecture Overview

[Data Sources] → [Airflow DAG] → [Training Job] → [MLflow Tracking]
                                       ↓
                              [MLflow Model Registry]
                                       ↓
                              [Model Serving / Inference]
  • Airflow orchestrates: data ingestion → preprocessing → training → evaluation → registration
  • MLflow tracks: parameters, metrics, artifacts, models
  • Kubernetes runs it all with GPU support and auto-scaling

Step 1: Install MLflow on Kubernetes

Create namespace and dependencies:

bash
kubectl create namespace mlops
 
# PostgreSQL for MLflow backend (tracking server metadata)
helm repo add bitnami https://charts.bitnami.com/bitnami
 
helm install mlflow-postgres bitnami/postgresql \
  --namespace mlops \
  --set auth.database=mlflow \
  --set auth.username=mlflow \
  --set auth.password=mlflowpassword \
  --set primary.persistence.size=20Gi

MLflow deployment with S3 artifact store:

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-server
  namespace: mlops
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow-server
  template:
    metadata:
      labels:
        app: mlflow-server
    spec:
      serviceAccountName: mlflow-sa  # needs S3 access via IRSA
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:latest
        command:
        - mlflow
        - server
        - --host=0.0.0.0
        - --port=5000
        - --backend-store-uri=postgresql://mlflow:mlflowpassword@mlflow-postgres:5432/mlflow
        - --default-artifact-root=s3://my-mlflow-bucket/artifacts
        - --serve-artifacts
        ports:
        - containerPort: 5000
        env:
        - name: AWS_DEFAULT_REGION
          value: us-east-1
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-server
  namespace: mlops
spec:
  selector:
    app: mlflow-server
  ports:
  - port: 5000
    targetPort: 5000

S3 bucket for artifacts:

bash
aws s3 mb s3://my-mlflow-bucket
 
# IRSA for MLflow pod to access S3
eksctl create iamserviceaccount \
  --name mlflow-sa \
  --namespace mlops \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
  --approve

Step 2: Install Apache Airflow with Helm

bash
helm repo add apache-airflow https://airflow.apache.org
helm repo update
 
helm install airflow apache-airflow/airflow \
  --namespace mlops \
  --set executor=KubernetesExecutor \
  --set dags.gitSync.enabled=true \
  --set dags.gitSync.repo=https://github.com/myorg/ml-dags \
  --set dags.gitSync.branch=main \
  --set dags.gitSync.subPath=dags \
  --set webserver.service.type=ClusterIP

KubernetesExecutor is important for MLOps — each task runs in its own pod with custom resources (GPUs for training, small pods for preprocessing).


Step 3: Write the Training DAG

python
# dags/train_model.py
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime, timedelta
from kubernetes.client import models as k8s
 
default_args = {
    "owner": "mlops-team",
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}
 
with DAG(
    dag_id="train_model_pipeline",
    default_args=default_args,
    schedule_interval="0 2 * * *",  # daily at 2am
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:
 
    # Step 1: Data preprocessing (small pod)
    preprocess = KubernetesPodOperator(
        task_id="preprocess_data",
        name="preprocess-data",
        namespace="mlops",
        image="myregistry.com/ml-preprocess:latest",
        cmds=["python", "preprocess.py"],
        env_vars={"S3_BUCKET": "my-ml-data"},
        container_resources=k8s.V1ResourceRequirements(
            requests={"cpu": "500m", "memory": "2Gi"},
            limits={"cpu": "1", "memory": "4Gi"},
        ),
    )
 
    # Step 2: Training (GPU pod)
    train = KubernetesPodOperator(
        task_id="train_model",
        name="train-model",
        namespace="mlops",
        image="myregistry.com/ml-train:latest",
        cmds=["python", "train.py"],
        env_vars={
            "MLFLOW_TRACKING_URI": "http://mlflow-server.mlops.svc.cluster.local:5000",
            "EXPERIMENT_NAME": "production-model",
            "S3_BUCKET": "my-ml-data",
        },
        container_resources=k8s.V1ResourceRequirements(
            requests={"cpu": "2", "memory": "8Gi", "nvidia.com/gpu": "1"},
            limits={"cpu": "4", "memory": "16Gi", "nvidia.com/gpu": "1"},
        ),
        node_selector={"nvidia.com/gpu": "true"},
        tolerations=[
            k8s.V1Toleration(key="nvidia.com/gpu", operator="Exists", effect="NoSchedule")
        ],
    )
 
    # Step 3: Evaluate and register model
    evaluate_and_register = KubernetesPodOperator(
        task_id="evaluate_and_register",
        name="evaluate-register",
        namespace="mlops",
        image="myregistry.com/ml-evaluate:latest",
        cmds=["python", "evaluate_and_register.py"],
        env_vars={
            "MLFLOW_TRACKING_URI": "http://mlflow-server.mlops.svc.cluster.local:5000",
            "MIN_ACCURACY": "0.85",  # only register if accuracy > 85%
        },
        container_resources=k8s.V1ResourceRequirements(
            requests={"cpu": "500m", "memory": "2Gi"},
        ),
    )
 
    preprocess >> train >> evaluate_and_register

Step 4: Training Script with MLflow Tracking

python
# train.py
import mlflow
import mlflow.sklearn
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
 
mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
mlflow.set_experiment(os.environ["EXPERIMENT_NAME"])
 
with mlflow.start_run():
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("dataset_version", "v2.3")
 
    # Train
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)
 
    # Evaluate
    accuracy = accuracy_score(y_test, model.predict(X_test))
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_score", f1)
 
    # Log model
    mlflow.sklearn.log_model(
        model,
        "model",
        registered_model_name="production-classifier",
        input_example=X_test[:5],
    )
 
    print(f"Accuracy: {accuracy:.4f}")

Step 5: Promote Model to Production

python
# evaluate_and_register.py
import mlflow
from mlflow.tracking import MlflowClient
 
client = MlflowClient()
min_accuracy = float(os.environ["MIN_ACCURACY"])
 
# Get the latest run
experiment = client.get_experiment_by_name(os.environ["EXPERIMENT_NAME"])
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["start_time DESC"],
    max_results=1
)
 
latest_run = runs[0]
accuracy = latest_run.data.metrics["accuracy"]
 
if accuracy >= min_accuracy:
    # Transition to Production
    client.transition_model_version_stage(
        name="production-classifier",
        version=latest_run.data.tags["mlflow.log-model.history"],
        stage="Production",
        archive_existing_versions=True,
    )
    print(f"Model promoted to Production (accuracy: {accuracy:.4f})")
else:
    print(f"Model NOT promoted — accuracy {accuracy:.4f} below threshold {min_accuracy}")
    raise ValueError("Model quality gate failed")

Access MLflow UI

bash
kubectl port-forward svc/mlflow-server -n mlops 5000:5000
# Open http://localhost:5000

You'll see all experiments, runs, parameters, metrics, and registered models.


What You Get

  • Automated daily retraining via Airflow DAG
  • Full experiment lineage — every run tracked in MLflow
  • Quality gates — only models above accuracy threshold go to Production
  • GPU training on Kubernetes with proper resource isolation
  • S3-backed artifact storage — models versioned and stored durably
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments