How to Set Up KubeFlow ML Pipeline on Kubernetes (2026)
Step-by-step guide to installing KubeFlow on Kubernetes and building your first ML pipeline — from cluster setup to a working training + serving workflow.
KubeFlow is the most complete MLOps platform for Kubernetes. It packages the entire ML workflow — experiment tracking, pipeline orchestration, model training, hyperparameter tuning, and model serving — into a single platform that runs on any Kubernetes cluster.
This guide walks through setting up KubeFlow on EKS and building a real ML pipeline from scratch.
What KubeFlow Gives You
- Pipelines — DAG-based ML workflow orchestration (think Airflow for ML)
- Notebooks — JupyterHub integrated with K8s
- Katib — Hyperparameter tuning (AutoML)
- KServe — Model serving with autoscaling
- Training Operator — Distributed training (TensorFlow, PyTorch, XGBoost)
- Central Dashboard — Web UI for all components
Prerequisites
- Kubernetes cluster >= 1.28 (EKS, GKE, or on-prem)
- kubectl configured
- At least 4 CPU cores, 12GB RAM available
kustomize>= 5.0 installedkubectl>= 1.21
Step 1 — Create EKS Cluster for KubeFlow
eksctl create cluster \
--name kubeflow-cluster \
--region ap-south-1 \
--version 1.30 \
--nodegroup-name standard-workers \
--node-type m5.xlarge \
--nodes 3 \
--nodes-min 2 \
--nodes-max 5 \
--managed
# Verify
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# ip-192-168-1-100.ap-south-1.compute.internal Ready <none> 2m v1.30.xFor GPU training later, add a GPU node group:
eksctl create nodegroup \
--cluster kubeflow-cluster \
--name gpu-workers \
--node-type g4dn.xlarge \
--nodes 0 \
--nodes-min 0 \
--nodes-max 2Step 2 — Install KubeFlow
KubeFlow uses kustomize for installation. The full manifests are in the official kubeflow/manifests repo.
# Install kustomize
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
# Clone KubeFlow manifests
git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v1.9.0 # Latest stableInstall All KubeFlow Components
# Install everything (takes 10-15 minutes)
while ! kustomize build example | kubectl apply -f -; do
echo "Retrying..."
sleep 20
done
# Watch pods come up
kubectl get pods -n kubeflow --watchThe retry loop is needed because some resources have ordering dependencies. Running it 2-3 times is normal.
Verify Installation
kubectl get pods -n kubeflow | grep -v Running
# All pods should be Running after 10-15 minutesStep 3 — Access the KubeFlow Dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
# Open: http://localhost:8080
# Default credentials:
# Email: user@example.com
# Password: 12341234You'll see the KubeFlow Central Dashboard with all components in the left sidebar.
For production, set up proper authentication (Dex + OIDC) and an Ingress with TLS.
Step 4 — Build Your First KubeFlow Pipeline
A KubeFlow Pipeline is a DAG (directed acyclic graph) of ML steps. Each step runs as a Kubernetes pod.
Install KFP SDK
pip install kfp==2.7.0Build a Simple Training Pipeline
import kfp
from kfp import dsl
from kfp.dsl import Dataset, Model, Input, Output
# Step 1: Load and preprocess data
@dsl.component(
base_image="python:3.11",
packages_to_install=["pandas", "scikit-learn"]
)
def preprocess_data(
raw_data_path: str,
processed_dataset: Output[Dataset]
):
import pandas as pd
from sklearn.preprocessing import StandardScaler
import pickle
df = pd.read_csv(raw_data_path)
X = df.drop("target", axis=1)
y = df["target"]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
with open(processed_dataset.path, 'wb') as f:
pickle.dump({"X": X_scaled, "y": y.values}, f)
print(f"Preprocessed {len(df)} samples")
# Step 2: Train the model
@dsl.component(
base_image="python:3.11",
packages_to_install=["scikit-learn"]
)
def train_model(
processed_dataset: Input[Dataset],
trained_model: Output[Model],
n_estimators: int = 100,
max_depth: int = 5
):
import pickle
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
with open(processed_dataset.path, 'rb') as f:
data = pickle.load(f)
X, y = data["X"], data["y"]
model = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
random_state=42
)
model.fit(X, y)
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"CV Accuracy: {np.mean(cv_scores):.3f} ± {np.std(cv_scores):.3f}")
with open(trained_model.path, 'wb') as f:
pickle.dump(model, f)
trained_model.metadata["accuracy"] = float(np.mean(cv_scores))
# Step 3: Evaluate the model
@dsl.component(
base_image="python:3.11",
packages_to_install=["scikit-learn"]
)
def evaluate_model(
trained_model: Input[Model],
processed_dataset: Input[Dataset]
) -> float:
import pickle
from sklearn.metrics import accuracy_score
import numpy as np
with open(trained_model.path, 'rb') as f:
model = pickle.load(f)
with open(processed_dataset.path, 'rb') as f:
data = pickle.load(f)
X, y = data["X"], data["y"]
predictions = model.predict(X)
accuracy = accuracy_score(y, predictions)
print(f"Final Accuracy: {accuracy:.3f}")
return accuracy
# Define the pipeline
@dsl.pipeline(
name="ml-training-pipeline",
description="Preprocessing → Training → Evaluation"
)
def ml_pipeline(
data_path: str = "s3://my-bucket/data/train.csv",
n_estimators: int = 100,
max_depth: int = 5
):
preprocess_task = preprocess_data(raw_data_path=data_path)
train_task = train_model(
processed_dataset=preprocess_task.outputs["processed_dataset"],
n_estimators=n_estimators,
max_depth=max_depth
)
evaluate_task = evaluate_model(
trained_model=train_task.outputs["trained_model"],
processed_dataset=preprocess_task.outputs["processed_dataset"]
)
# Compile and submit the pipeline
if __name__ == "__main__":
# Compile to YAML
kfp.compiler.Compiler().compile(
pipeline_func=ml_pipeline,
package_path="ml_pipeline.yaml"
)
# Submit to KubeFlow
client = kfp.Client(host="http://localhost:8080")
run = client.create_run_from_pipeline_func(
ml_pipeline,
arguments={
"data_path": "s3://my-bucket/data/train.csv",
"n_estimators": 200,
"max_depth": 8
}
)
print(f"Pipeline run ID: {run.run_id}")Run the pipeline:
python ml_pipeline.pyYou'll see the pipeline appear in the KubeFlow UI under Runs.
Step 5 — Serve the Model with KServe
Once your model is trained, deploy it for inference using KServe (part of KubeFlow):
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-model
namespace: kubeflow-user-example-com
spec:
predictor:
sklearn:
storageUri: "s3://my-bucket/models/random-forest-v1"
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"kubectl apply -f inference-service.yaml
# Check status
kubectl get inferenceservice -n kubeflow-user-example-com
# Test inference
curl -X POST \
http://sklearn-model.kubeflow-user-example-com.svc.cluster.local/v1/models/sklearn-model:predict \
-H "Content-Type: application/json" \
-d '{"instances": [[1.2, 3.4, 5.6, 7.8]]}'Step 6 — Hyperparameter Tuning with Katib
Katib automates the search for the best hyperparameters:
apiVersion: kubeflow.org/v1beta1
kind: Experiment
metadata:
name: random-forest-tuning
namespace: kubeflow-user-example-com
spec:
objective:
type: maximize
goal: 0.95
objectiveMetricName: accuracy
algorithm:
algorithmName: bayesianoptimization
parallelTrialCount: 3
maxTrialCount: 12
maxFailedTrialCount: 3
parameters:
- name: n_estimators
parameterType: int
feasibleSpace:
min: "50"
max: "300"
- name: max_depth
parameterType: int
feasibleSpace:
min: "3"
max: "15"
trialTemplate:
primaryContainerName: training-container
trialParameters:
- name: n_estimators
description: Number of estimators
reference: n_estimators
- name: max_depth
description: Max depth
reference: max_depth
trialSpec:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: training-container
image: my-training-image:latest
command:
- python
- train.py
- --n_estimators=${trialParameters.n_estimators}
- --max_depth=${trialParameters.max_depth}kubectl apply -f katib-experiment.yaml
kubectl get experiment random-forest-tuning -n kubeflow-user-example-com -wKatib runs 12 trials in parallel (3 at a time) and finds the best hyperparameter combination using Bayesian optimization.
Cost Optimization
Use Spot instances for training workloads — Training jobs tolerate interruption better than serving. EKS Spot nodes can cut training costs by 70%.
Scale to zero when not in use — The control plane stays running, but training pods are ephemeral. You only pay for compute when pipelines run.
Use KEDA to scale KServe pods to zero — When there's no inference traffic, KServe models can scale down to zero.
Summary
| Component | What it does |
|---|---|
| KubeFlow Pipelines | Orchestrate ML steps as K8s pods |
| Notebooks | JupyterHub for interactive work |
| Katib | Hyperparameter search |
| KServe | Model inference serving |
| Training Operator | Distributed PyTorch/TensorFlow |
KubeFlow is the most complete open-source MLOps stack. The setup is non-trivial (10-15 mins, many components) but once running, it handles the entire ML lifecycle from data prep to production serving.
Deploy KubeFlow on DigitalOcean Kubernetes — $200 free credit for new accounts, enough to run a 3-node cluster for 2-3 weeks. For structured MLOps learning, KodeKloud has ML + Kubernetes labs.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)
Step-by-step guide to deploying HuggingFace transformer models on Kubernetes using GPU nodes — from cluster setup to inference API in production.
Deploy NVIDIA Triton Inference Server on Kubernetes (2026)
Step-by-step guide to running NVIDIA Triton Inference Server on Kubernetes with GPU nodes — model repository setup, deployment, autoscaling, and monitoring.
AWS EKS Pods Stuck in Pending State: Causes and Fixes
Pods stuck in Pending on EKS are caused by a handful of known issues — insufficient node capacity, taint mismatches, PVC problems, and more. Here's how to diagnose and fix each one.