All Articles

How to Set Up KubeFlow ML Pipeline on Kubernetes (2026)

Step-by-step guide to installing KubeFlow on Kubernetes and building your first ML pipeline — from cluster setup to a working training + serving workflow.

DevOpsBoysApr 20, 20266 min read
Share:Tweet

KubeFlow is the most complete MLOps platform for Kubernetes. It packages the entire ML workflow — experiment tracking, pipeline orchestration, model training, hyperparameter tuning, and model serving — into a single platform that runs on any Kubernetes cluster.

This guide walks through setting up KubeFlow on EKS and building a real ML pipeline from scratch.

What KubeFlow Gives You

  • Pipelines — DAG-based ML workflow orchestration (think Airflow for ML)
  • Notebooks — JupyterHub integrated with K8s
  • Katib — Hyperparameter tuning (AutoML)
  • KServe — Model serving with autoscaling
  • Training Operator — Distributed training (TensorFlow, PyTorch, XGBoost)
  • Central Dashboard — Web UI for all components

Prerequisites

  • Kubernetes cluster >= 1.28 (EKS, GKE, or on-prem)
  • kubectl configured
  • At least 4 CPU cores, 12GB RAM available
  • kustomize >= 5.0 installed
  • kubectl >= 1.21

Step 1 — Create EKS Cluster for KubeFlow

bash
eksctl create cluster \
  --name kubeflow-cluster \
  --region ap-south-1 \
  --version 1.30 \
  --nodegroup-name standard-workers \
  --node-type m5.xlarge \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 5 \
  --managed
 
# Verify
kubectl get nodes
# NAME                                          STATUS   ROLES    AGE   VERSION
# ip-192-168-1-100.ap-south-1.compute.internal  Ready    <none>   2m    v1.30.x

For GPU training later, add a GPU node group:

bash
eksctl create nodegroup \
  --cluster kubeflow-cluster \
  --name gpu-workers \
  --node-type g4dn.xlarge \
  --nodes 0 \
  --nodes-min 0 \
  --nodes-max 2

Step 2 — Install KubeFlow

KubeFlow uses kustomize for installation. The full manifests are in the official kubeflow/manifests repo.

bash
# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
 
# Clone KubeFlow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v1.9.0   # Latest stable

Install All KubeFlow Components

bash
# Install everything (takes 10-15 minutes)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."
  sleep 20
done
 
# Watch pods come up
kubectl get pods -n kubeflow --watch

The retry loop is needed because some resources have ordering dependencies. Running it 2-3 times is normal.

Verify Installation

bash
kubectl get pods -n kubeflow | grep -v Running
# All pods should be Running after 10-15 minutes

Step 3 — Access the KubeFlow Dashboard

bash
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
 
# Open: http://localhost:8080
# Default credentials:
# Email: user@example.com
# Password: 12341234

You'll see the KubeFlow Central Dashboard with all components in the left sidebar.

For production, set up proper authentication (Dex + OIDC) and an Ingress with TLS.


Step 4 — Build Your First KubeFlow Pipeline

A KubeFlow Pipeline is a DAG (directed acyclic graph) of ML steps. Each step runs as a Kubernetes pod.

Install KFP SDK

bash
pip install kfp==2.7.0

Build a Simple Training Pipeline

python
import kfp
from kfp import dsl
from kfp.dsl import Dataset, Model, Input, Output
 
# Step 1: Load and preprocess data
@dsl.component(
    base_image="python:3.11",
    packages_to_install=["pandas", "scikit-learn"]
)
def preprocess_data(
    raw_data_path: str,
    processed_dataset: Output[Dataset]
):
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    import pickle
 
    df = pd.read_csv(raw_data_path)
    X = df.drop("target", axis=1)
    y = df["target"]
 
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
 
    with open(processed_dataset.path, 'wb') as f:
        pickle.dump({"X": X_scaled, "y": y.values}, f)
 
    print(f"Preprocessed {len(df)} samples")
 
 
# Step 2: Train the model
@dsl.component(
    base_image="python:3.11",
    packages_to_install=["scikit-learn"]
)
def train_model(
    processed_dataset: Input[Dataset],
    trained_model: Output[Model],
    n_estimators: int = 100,
    max_depth: int = 5
):
    import pickle
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score
    import numpy as np
 
    with open(processed_dataset.path, 'rb') as f:
        data = pickle.load(f)
 
    X, y = data["X"], data["y"]
 
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42
    )
    model.fit(X, y)
 
    cv_scores = cross_val_score(model, X, y, cv=5)
    print(f"CV Accuracy: {np.mean(cv_scores):.3f} ± {np.std(cv_scores):.3f}")
 
    with open(trained_model.path, 'wb') as f:
        pickle.dump(model, f)
 
    trained_model.metadata["accuracy"] = float(np.mean(cv_scores))
 
 
# Step 3: Evaluate the model
@dsl.component(
    base_image="python:3.11",
    packages_to_install=["scikit-learn"]
)
def evaluate_model(
    trained_model: Input[Model],
    processed_dataset: Input[Dataset]
) -> float:
    import pickle
    from sklearn.metrics import accuracy_score
    import numpy as np
 
    with open(trained_model.path, 'rb') as f:
        model = pickle.load(f)
 
    with open(processed_dataset.path, 'rb') as f:
        data = pickle.load(f)
 
    X, y = data["X"], data["y"]
    predictions = model.predict(X)
    accuracy = accuracy_score(y, predictions)
 
    print(f"Final Accuracy: {accuracy:.3f}")
    return accuracy
 
 
# Define the pipeline
@dsl.pipeline(
    name="ml-training-pipeline",
    description="Preprocessing → Training → Evaluation"
)
def ml_pipeline(
    data_path: str = "s3://my-bucket/data/train.csv",
    n_estimators: int = 100,
    max_depth: int = 5
):
    preprocess_task = preprocess_data(raw_data_path=data_path)
 
    train_task = train_model(
        processed_dataset=preprocess_task.outputs["processed_dataset"],
        n_estimators=n_estimators,
        max_depth=max_depth
    )
 
    evaluate_task = evaluate_model(
        trained_model=train_task.outputs["trained_model"],
        processed_dataset=preprocess_task.outputs["processed_dataset"]
    )
 
 
# Compile and submit the pipeline
if __name__ == "__main__":
    # Compile to YAML
    kfp.compiler.Compiler().compile(
        pipeline_func=ml_pipeline,
        package_path="ml_pipeline.yaml"
    )
 
    # Submit to KubeFlow
    client = kfp.Client(host="http://localhost:8080")
    run = client.create_run_from_pipeline_func(
        ml_pipeline,
        arguments={
            "data_path": "s3://my-bucket/data/train.csv",
            "n_estimators": 200,
            "max_depth": 8
        }
    )
    print(f"Pipeline run ID: {run.run_id}")

Run the pipeline:

bash
python ml_pipeline.py

You'll see the pipeline appear in the KubeFlow UI under Runs.


Step 5 — Serve the Model with KServe

Once your model is trained, deploy it for inference using KServe (part of KubeFlow):

yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    sklearn:
      storageUri: "s3://my-bucket/models/random-forest-v1"
      resources:
        requests:
          cpu: "500m"
          memory: "512Mi"
        limits:
          cpu: "1"
          memory: "1Gi"
bash
kubectl apply -f inference-service.yaml
 
# Check status
kubectl get inferenceservice -n kubeflow-user-example-com
 
# Test inference
curl -X POST \
  http://sklearn-model.kubeflow-user-example-com.svc.cluster.local/v1/models/sklearn-model:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[1.2, 3.4, 5.6, 7.8]]}'

Step 6 — Hyperparameter Tuning with Katib

Katib automates the search for the best hyperparameters:

yaml
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-forest-tuning
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    goal: 0.95
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: bayesianoptimization
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
  - name: n_estimators
    parameterType: int
    feasibleSpace:
      min: "50"
      max: "300"
  - name: max_depth
    parameterType: int
    feasibleSpace:
      min: "3"
      max: "15"
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
    - name: n_estimators
      description: Number of estimators
      reference: n_estimators
    - name: max_depth
      description: Max depth
      reference: max_depth
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - name: training-container
              image: my-training-image:latest
              command:
              - python
              - train.py
              - --n_estimators=${trialParameters.n_estimators}
              - --max_depth=${trialParameters.max_depth}
bash
kubectl apply -f katib-experiment.yaml
kubectl get experiment random-forest-tuning -n kubeflow-user-example-com -w

Katib runs 12 trials in parallel (3 at a time) and finds the best hyperparameter combination using Bayesian optimization.


Cost Optimization

Use Spot instances for training workloads — Training jobs tolerate interruption better than serving. EKS Spot nodes can cut training costs by 70%.

Scale to zero when not in use — The control plane stays running, but training pods are ephemeral. You only pay for compute when pipelines run.

Use KEDA to scale KServe pods to zero — When there's no inference traffic, KServe models can scale down to zero.


Summary

ComponentWhat it does
KubeFlow PipelinesOrchestrate ML steps as K8s pods
NotebooksJupyterHub for interactive work
KatibHyperparameter search
KServeModel inference serving
Training OperatorDistributed PyTorch/TensorFlow

KubeFlow is the most complete open-source MLOps stack. The setup is non-trivial (10-15 mins, many components) but once running, it handles the entire ML lifecycle from data prep to production serving.

Deploy KubeFlow on DigitalOcean Kubernetes — $200 free credit for new accounts, enough to run a 3-node cluster for 2-3 weeks. For structured MLOps learning, KodeKloud has ML + Kubernetes labs.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments