Set Up MLflow + Airflow MLOps Pipeline on Kubernetes (2026)

Build a production MLOps pipeline on Kubernetes using MLflow for experiment tracking and model registry, and Apache Airflow for pipeline orchestration. Full setup guide.

MLflow tracks experiments and manages models. Airflow orchestrates the training pipeline. Together on Kubernetes, they give you a production-grade MLOps platform. Here's the full setup.

Architecture Overview

[Data Sources] → [Airflow DAG] → [Training Job] → [MLflow Tracking]
                                       ↓
                              [MLflow Model Registry]
                                       ↓
                              [Model Serving / Inference]

Airflow orchestrates: data ingestion → preprocessing → training → evaluation → registration
MLflow tracks: parameters, metrics, artifacts, models
Kubernetes runs it all with GPU support and auto-scaling

Step 1: Install MLflow on Kubernetes

Create namespace and dependencies:

bash

kubectl create namespace mlops
 
# PostgreSQL for MLflow backend (tracking server metadata)
helm repo add bitnami https://charts.bitnami.com/bitnami
 
helm install mlflow-postgres bitnami/postgresql \
  --namespace mlops \
  --set auth.database=mlflow \
  --set auth.username=mlflow \
  --set auth.password=mlflowpassword \
  --set primary.persistence.size=20Gi

MLflow deployment with S3 artifact store:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-server
  namespace: mlops
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow-server
  template:
    metadata:
      labels:
        app: mlflow-server
    spec:
      serviceAccountName: mlflow-sa  # needs S3 access via IRSA
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:latest
        command:
        - mlflow
        - server
        - --host=0.0.0.0
        - --port=5000
        - --backend-store-uri=postgresql://mlflow:mlflowpassword@mlflow-postgres:5432/mlflow
        - --default-artifact-root=s3://my-mlflow-bucket/artifacts
        - --serve-artifacts
        ports:
        - containerPort: 5000
        env:
        - name: AWS_DEFAULT_REGION
          value: us-east-1
---
apiVersion: v1
kind: Service
metadata:
  name: mlflow-server
  namespace: mlops
spec:
  selector:
    app: mlflow-server
  ports:
  - port: 5000
    targetPort: 5000

S3 bucket for artifacts:

bash

aws s3 mb s3://my-mlflow-bucket
 
# IRSA for MLflow pod to access S3
eksctl create iamserviceaccount \
  --name mlflow-sa \
  --namespace mlops \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess \
  --approve

Step 2: Install Apache Airflow with Helm

bash

helm repo add apache-airflow https://airflow.apache.org
helm repo update
 
helm install airflow apache-airflow/airflow \
  --namespace mlops \
  --set executor=KubernetesExecutor \
  --set dags.gitSync.enabled=true \
  --set dags.gitSync.repo=https://github.com/myorg/ml-dags \
  --set dags.gitSync.branch=main \
  --set dags.gitSync.subPath=dags \
  --set webserver.service.type=ClusterIP

KubernetesExecutor is important for MLOps — each task runs in its own pod with custom resources (GPUs for training, small pods for preprocessing).

Step 3: Write the Training DAG

python

# dags/train_model.py
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.pod import KubernetesPodOperator
from datetime import datetime, timedelta
from kubernetes.client import models as k8s
 
default_args = {
    "owner": "mlops-team",
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
}
 
with DAG(
    dag_id="train_model_pipeline",
    default_args=default_args,
    schedule_interval="0 2 * * *",  # daily at 2am
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:
 
    # Step 1: Data preprocessing (small pod)
    preprocess = KubernetesPodOperator(
        task_id="preprocess_data",
        name="preprocess-data",
        namespace="mlops",
        image="myregistry.com/ml-preprocess:latest",
        cmds=["python", "preprocess.py"],
        env_vars={"S3_BUCKET": "my-ml-data"},
        container_resources=k8s.V1ResourceRequirements(
            requests={"cpu": "500m", "memory": "2Gi"},
            limits={"cpu": "1", "memory": "4Gi"},
        ),
    )
 
    # Step 2: Training (GPU pod)
    train = KubernetesPodOperator(
        task_id="train_model",
        name="train-model",
        namespace="mlops",
        image="myregistry.com/ml-train:latest",
        cmds=["python", "train.py"],
        env_vars={
            "MLFLOW_TRACKING_URI": "http://mlflow-server.mlops.svc.cluster.local:5000",
            "EXPERIMENT_NAME": "production-model",
            "S3_BUCKET": "my-ml-data",
        },
        container_resources=k8s.V1ResourceRequirements(
            requests={"cpu": "2", "memory": "8Gi", "nvidia.com/gpu": "1"},
            limits={"cpu": "4", "memory": "16Gi", "nvidia.com/gpu": "1"},
        ),
        node_selector={"nvidia.com/gpu": "true"},
        tolerations=[
            k8s.V1Toleration(key="nvidia.com/gpu", operator="Exists", effect="NoSchedule")
        ],
    )
 
    # Step 3: Evaluate and register model
    evaluate_and_register = KubernetesPodOperator(
        task_id="evaluate_and_register",
        name="evaluate-register",
        namespace="mlops",
        image="myregistry.com/ml-evaluate:latest",
        cmds=["python", "evaluate_and_register.py"],
        env_vars={
            "MLFLOW_TRACKING_URI": "http://mlflow-server.mlops.svc.cluster.local:5000",
            "MIN_ACCURACY": "0.85",  # only register if accuracy > 85%
        },
        container_resources=k8s.V1ResourceRequirements(
            requests={"cpu": "500m", "memory": "2Gi"},
        ),
    )
 
    preprocess >> train >> evaluate_and_register

Step 4: Training Script with MLflow Tracking

python

# train.py
import mlflow
import mlflow.sklearn
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
 
mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
mlflow.set_experiment(os.environ["EXPERIMENT_NAME"])
 
with mlflow.start_run():
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("dataset_version", "v2.3")
 
    # Train
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)
 
    # Evaluate
    accuracy = accuracy_score(y_test, model.predict(X_test))
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("f1_score", f1)
 
    # Log model
    mlflow.sklearn.log_model(
        model,
        "model",
        registered_model_name="production-classifier",
        input_example=X_test[:5],
    )
 
    print(f"Accuracy: {accuracy:.4f}")

Step 5: Promote Model to Production

python

# evaluate_and_register.py
import mlflow
from mlflow.tracking import MlflowClient
 
client = MlflowClient()
min_accuracy = float(os.environ["MIN_ACCURACY"])
 
# Get the latest run
experiment = client.get_experiment_by_name(os.environ["EXPERIMENT_NAME"])
runs = client.search_runs(
    experiment_ids=[experiment.experiment_id],
    order_by=["start_time DESC"],
    max_results=1
)
 
latest_run = runs[0]
accuracy = latest_run.data.metrics["accuracy"]
 
if accuracy >= min_accuracy:
    # Transition to Production
    client.transition_model_version_stage(
        name="production-classifier",
        version=latest_run.data.tags["mlflow.log-model.history"],
        stage="Production",
        archive_existing_versions=True,
    )
    print(f"Model promoted to Production (accuracy: {accuracy:.4f})")
else:
    print(f"Model NOT promoted — accuracy {accuracy:.4f} below threshold {min_accuracy}")
    raise ValueError("Model quality gate failed")

Access MLflow UI

bash

kubectl port-forward svc/mlflow-server -n mlops 5000:5000
# Open http://localhost:5000

You'll see all experiments, runs, parameters, metrics, and registered models.

What You Get

Automated daily retraining via Airflow DAG
Full experiment lineage — every run tracked in MLflow
Quality gates — only models above accuracy threshold go to Production
GPU training on Kubernetes with proper resource isolation
S3-backed artifact storage — models versioned and stored durably

Set Up MLflow + Airflow MLOps Pipeline on Kubernetes (2026)

Architecture Overview

Step 1: Install MLflow on Kubernetes

Step 2: Install Apache Airflow with Helm

Step 3: Write the Training DAG

Step 4: Training Script with MLflow Tracking

Step 5: Promote Model to Production

Access MLflow UI

What You Get

Stay ahead of the curve

Related Articles

LLM-Powered Kubernetes Operators: How AI Is Automating Cluster Management in 2026

Run DeepSeek R1 on Kubernetes — Self-Hosted LLM Guide (2026)

Set Up Ray Serve on Kubernetes for ML Model Inference (2026)

Comments