How to Set Up KubeFlow ML Pipeline on Kubernetes (2026)

Step-by-step guide to installing KubeFlow on Kubernetes and building your first ML pipeline — from cluster setup to a working training + serving workflow.

KubeFlow is the most complete MLOps platform for Kubernetes. It packages the entire ML workflow — experiment tracking, pipeline orchestration, model training, hyperparameter tuning, and model serving — into a single platform that runs on any Kubernetes cluster.

This guide walks through setting up KubeFlow on EKS and building a real ML pipeline from scratch.

What KubeFlow Gives You

Pipelines — DAG-based ML workflow orchestration (think Airflow for ML)
Notebooks — JupyterHub integrated with K8s
Katib — Hyperparameter tuning (AutoML)
KServe — Model serving with autoscaling
Training Operator — Distributed training (TensorFlow, PyTorch, XGBoost)
Central Dashboard — Web UI for all components

Prerequisites

Kubernetes cluster >= 1.28 (EKS, GKE, or on-prem)
kubectl configured
At least 4 CPU cores, 12GB RAM available
kustomize >= 5.0 installed
kubectl >= 1.21

Step 1 — Create EKS Cluster for KubeFlow

bash

eksctl create cluster \
  --name kubeflow-cluster \
  --region ap-south-1 \
  --version 1.30 \
  --nodegroup-name standard-workers \
  --node-type m5.xlarge \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 5 \
  --managed
 
# Verify
kubectl get nodes
# NAME                                          STATUS   ROLES    AGE   VERSION
# ip-192-168-1-100.ap-south-1.compute.internal  Ready    <none>   2m    v1.30.x

For GPU training later, add a GPU node group:

bash

eksctl create nodegroup \
  --cluster kubeflow-cluster \
  --name gpu-workers \
  --node-type g4dn.xlarge \
  --nodes 0 \
  --nodes-min 0 \
  --nodes-max 2

Step 2 — Install KubeFlow

KubeFlow uses kustomize for installation. The full manifests are in the official kubeflow/manifests repo.

bash

# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
 
# Clone KubeFlow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v1.9.0   # Latest stable

Install All KubeFlow Components

bash

# Install everything (takes 10-15 minutes)
while ! kustomize build example | kubectl apply -f -; do
  echo "Retrying..."
  sleep 20
done
 
# Watch pods come up
kubectl get pods -n kubeflow --watch

The retry loop is needed because some resources have ordering dependencies. Running it 2-3 times is normal.

Verify Installation

bash

kubectl get pods -n kubeflow | grep -v Running
# All pods should be Running after 10-15 minutes

Step 3 — Access the KubeFlow Dashboard

bash

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
 
# Open: http://localhost:8080
# Default credentials:
# Email: user@example.com
# Password: 12341234

You'll see the KubeFlow Central Dashboard with all components in the left sidebar.

For production, set up proper authentication (Dex + OIDC) and an Ingress with TLS.

Step 4 — Build Your First KubeFlow Pipeline

A KubeFlow Pipeline is a DAG (directed acyclic graph) of ML steps. Each step runs as a Kubernetes pod.

Install KFP SDK

bash

pip install kfp==2.7.0

Build a Simple Training Pipeline

python

import kfp
from kfp import dsl
from kfp.dsl import Dataset, Model, Input, Output
 
# Step 1: Load and preprocess data
@dsl.component(
    base_image="python:3.11",
    packages_to_install=["pandas", "scikit-learn"]
)
def preprocess_data(
    raw_data_path: str,
    processed_dataset: Output[Dataset]
):
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    import pickle
 
    df = pd.read_csv(raw_data_path)
    X = df.drop("target", axis=1)
    y = df["target"]
 
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
 
    with open(processed_dataset.path, 'wb') as f:
        pickle.dump({"X": X_scaled, "y": y.values}, f)
 
    print(f"Preprocessed {len(df)} samples")
 
 
# Step 2: Train the model
@dsl.component(
    base_image="python:3.11",
    packages_to_install=["scikit-learn"]
)
def train_model(
    processed_dataset: Input[Dataset],
    trained_model: Output[Model],
    n_estimators: int = 100,
    max_depth: int = 5
):
    import pickle
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score
    import numpy as np
 
    with open(processed_dataset.path, 'rb') as f:
        data = pickle.load(f)
 
    X, y = data["X"], data["y"]
 
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42
    )
    model.fit(X, y)
 
    cv_scores = cross_val_score(model, X, y, cv=5)
    print(f"CV Accuracy: {np.mean(cv_scores):.3f} ± {np.std(cv_scores):.3f}")
 
    with open(trained_model.path, 'wb') as f:
        pickle.dump(model, f)
 
    trained_model.metadata["accuracy"] = float(np.mean(cv_scores))
 
 
# Step 3: Evaluate the model
@dsl.component(
    base_image="python:3.11",
    packages_to_install=["scikit-learn"]
)
def evaluate_model(
    trained_model: Input[Model],
    processed_dataset: Input[Dataset]
) -> float:
    import pickle
    from sklearn.metrics import accuracy_score
    import numpy as np
 
    with open(trained_model.path, 'rb') as f:
        model = pickle.load(f)
 
    with open(processed_dataset.path, 'rb') as f:
        data = pickle.load(f)
 
    X, y = data["X"], data["y"]
    predictions = model.predict(X)
    accuracy = accuracy_score(y, predictions)
 
    print(f"Final Accuracy: {accuracy:.3f}")
    return accuracy
 
 
# Define the pipeline
@dsl.pipeline(
    name="ml-training-pipeline",
    description="Preprocessing → Training → Evaluation"
)
def ml_pipeline(
    data_path: str = "s3://my-bucket/data/train.csv",
    n_estimators: int = 100,
    max_depth: int = 5
):
    preprocess_task = preprocess_data(raw_data_path=data_path)
 
    train_task = train_model(
        processed_dataset=preprocess_task.outputs["processed_dataset"],
        n_estimators=n_estimators,
        max_depth=max_depth
    )
 
    evaluate_task = evaluate_model(
        trained_model=train_task.outputs["trained_model"],
        processed_dataset=preprocess_task.outputs["processed_dataset"]
    )
 
 
# Compile and submit the pipeline
if __name__ == "__main__":
    # Compile to YAML
    kfp.compiler.Compiler().compile(
        pipeline_func=ml_pipeline,
        package_path="ml_pipeline.yaml"
    )
 
    # Submit to KubeFlow
    client = kfp.Client(host="http://localhost:8080")
    run = client.create_run_from_pipeline_func(
        ml_pipeline,
        arguments={
            "data_path": "s3://my-bucket/data/train.csv",
            "n_estimators": 200,
            "max_depth": 8
        }
    )
    print(f"Pipeline run ID: {run.run_id}")

Run the pipeline:

bash

python ml_pipeline.py

You'll see the pipeline appear in the KubeFlow UI under Runs.

Step 5 — Serve the Model with KServe

Once your model is trained, deploy it for inference using KServe (part of KubeFlow):

yaml

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-model
  namespace: kubeflow-user-example-com
spec:
  predictor:
    sklearn:
      storageUri: "s3://my-bucket/models/random-forest-v1"
      resources:
        requests:
          cpu: "500m"
          memory: "512Mi"
        limits:
          cpu: "1"
          memory: "1Gi"

bash

kubectl apply -f inference-service.yaml
 
# Check status
kubectl get inferenceservice -n kubeflow-user-example-com
 
# Test inference
curl -X POST \
  http://sklearn-model.kubeflow-user-example-com.svc.cluster.local/v1/models/sklearn-model:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[1.2, 3.4, 5.6, 7.8]]}'

Step 6 — Hyperparameter Tuning with Katib

Katib automates the search for the best hyperparameters:

yaml

apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
  name: random-forest-tuning
  namespace: kubeflow-user-example-com
spec:
  objective:
    type: maximize
    goal: 0.95
    objectiveMetricName: accuracy
  algorithm:
    algorithmName: bayesianoptimization
  parallelTrialCount: 3
  maxTrialCount: 12
  maxFailedTrialCount: 3
  parameters:
  - name: n_estimators
    parameterType: int
    feasibleSpace:
      min: "50"
      max: "300"
  - name: max_depth
    parameterType: int
    feasibleSpace:
      min: "3"
      max: "15"
  trialTemplate:
    primaryContainerName: training-container
    trialParameters:
    - name: n_estimators
      description: Number of estimators
      reference: n_estimators
    - name: max_depth
      description: Max depth
      reference: max_depth
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
            - name: training-container
              image: my-training-image:latest
              command:
              - python
              - train.py
              - --n_estimators=${trialParameters.n_estimators}
              - --max_depth=${trialParameters.max_depth}

bash

kubectl apply -f katib-experiment.yaml
kubectl get experiment random-forest-tuning -n kubeflow-user-example-com -w

Katib runs 12 trials in parallel (3 at a time) and finds the best hyperparameter combination using Bayesian optimization.

Cost Optimization

Use Spot instances for training workloads — Training jobs tolerate interruption better than serving. EKS Spot nodes can cut training costs by 70%.

Scale to zero when not in use — The control plane stays running, but training pods are ephemeral. You only pay for compute when pipelines run.

Use KEDA to scale KServe pods to zero — When there's no inference traffic, KServe models can scale down to zero.

Summary

Component	What it does
KubeFlow Pipelines	Orchestrate ML steps as K8s pods
Notebooks	JupyterHub for interactive work
Katib	Hyperparameter search
KServe	Model inference serving
Training Operator	Distributed PyTorch/TensorFlow

KubeFlow is the most complete open-source MLOps stack. The setup is non-trivial (10-15 mins, many components) but once running, it handles the entire ML lifecycle from data prep to production serving.

Deploy KubeFlow on DigitalOcean Kubernetes — $200 free credit for new accounts, enough to run a 3-node cluster for 2-3 weeks. For structured MLOps learning, KodeKloud has ML + Kubernetes labs.

How to Set Up KubeFlow ML Pipeline on Kubernetes (2026)

What KubeFlow Gives You

Prerequisites

Step 1 — Create EKS Cluster for KubeFlow

Step 2 — Install KubeFlow

Install All KubeFlow Components

Verify Installation

Step 3 — Access the KubeFlow Dashboard

Step 4 — Build Your First KubeFlow Pipeline

Install KFP SDK

Build a Simple Training Pipeline

Step 5 — Serve the Model with KServe

Step 6 — Hyperparameter Tuning with Katib

Cost Optimization

Summary

Stay ahead of the curve

Related Articles

Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)

Deploy NVIDIA Triton Inference Server on Kubernetes (2026)

AWS EKS Cluster Autoscaler Not Scaling — Every Fix (2026)

Comments