Build an LLM Fine-Tuning Pipeline on Kubernetes (2026)
Fine-tune a small LLM on domain-specific DevOps data using QLoRA, orchestrate the pipeline on Kubernetes, and serve the result with vLLM. Complete guide with code.
Fine-tuning an LLM on your own data gives you a model that understands your domain — your runbooks, your codebase, your incident patterns. Here's how to build the full pipeline on Kubernetes.
What We're Building
A Kubernetes-based fine-tuning pipeline:
- Data preparation — format your training data
- Fine-tuning job — QLoRA fine-tune on a GPU node
- Model storage — save adapter weights to S3
- Serving — load base model + adapter with vLLM
- Pipeline orchestration — Argo Workflows to tie it together
We'll fine-tune Llama 3.2 3B on DevOps Q&A data using QLoRA (quantized low-rank adaptation) — fits on a single A10G GPU in ~2 hours.
Prerequisites
- EKS cluster with at least one GPU node (g5.xlarge)
- NVIDIA GPU Operator installed
- Hugging Face account (for model downloads)
- S3 bucket for model artifacts
- Argo Workflows installed
Step 1: Prepare Training Data
Fine-tuning requires formatted instruction data:
# prepare_data.py
import json
# DevOps Q&A training examples
training_data = [
{
"instruction": "A Kubernetes pod is showing OOMKilled status. What are the steps to diagnose and fix this?",
"output": "OOMKilled means the container exceeded its memory limit. Steps:\n1. Check current limits: kubectl describe pod <name>\n2. Check memory usage: kubectl top pod <name>\n3. Review app logs before the kill: kubectl logs <name> --previous\n4. If the limit is too low, increase it in the Deployment spec under resources.limits.memory\n5. If the app has a leak, profile the application..."
},
{
"instruction": "Explain the difference between a Kubernetes Deployment and a StatefulSet.",
"output": "Deployments are for stateless applications — each pod is interchangeable, gets a random name, and shares storage. StatefulSets are for stateful applications (databases, queues) — each pod has a stable identity (mysql-0, mysql-1), persistent storage that follows the pod, and pods start/stop in order..."
},
# Add hundreds more examples
]
# Format for Llama instruction tuning
def format_example(item):
return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a senior DevOps engineer. Answer questions about Kubernetes, AWS, CI/CD, and infrastructure.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{item['instruction']}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
{item['output']}<|eot_id|>"""
formatted = [{"text": format_example(item)} for item in training_data]
with open("train.jsonl", "w") as f:
for item in formatted:
f.write(json.dumps(item) + "\n")
print(f"Prepared {len(formatted)} training examples")Upload training data to S3:
aws s3 cp train.jsonl s3://my-ml-bucket/fine-tuning/devops-qa/train.jsonlStep 2: Fine-Tuning Job
# finetune.py
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import load_dataset
import boto3
# Config
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
OUTPUT_DIR = "/tmp/fine-tuned-model"
S3_BUCKET = os.environ["S3_BUCKET"]
S3_OUTPUT_PREFIX = os.environ.get("S3_OUTPUT_PREFIX", "fine-tuned/devops-qa")
# Load dataset from S3
s3 = boto3.client('s3')
s3.download_file(S3_BUCKET, "fine-tuning/devops-qa/train.jsonl", "/tmp/train.jsonl")
dataset = load_dataset("json", data_files="/tmp/train.jsonl", split="train")
# Load model in 4-bit (QLoRA)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
token=os.environ["HF_TOKEN"],
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=os.environ["HF_TOKEN"])
tokenizer.pad_token = tokenizer.eos_token
# LoRA configuration
lora_config = LoraConfig(
r=16, # rank — higher = more capacity but more parameters
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~13M || all params: ~3.2B || ~0.4% trainable
# Training
training_args = TrainingArguments(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_steps=100,
save_total_limit=2,
report_to="none",
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
# Upload to S3
import subprocess
subprocess.run([
"aws", "s3", "sync", OUTPUT_DIR,
f"s3://{S3_BUCKET}/{S3_OUTPUT_PREFIX}/"
], check=True)
print(f"Model uploaded to s3://{S3_BUCKET}/{S3_OUTPUT_PREFIX}/")Step 3: Kubernetes Job
# fine-tune-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: llm-fine-tune-devops
namespace: ml-workloads
spec:
backoffLimit: 1
template:
spec:
restartPolicy: Never
containers:
- name: fine-tuner
image: your-registry/fine-tuner:latest
command: ["python", "finetune.py"]
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
- name: S3_BUCKET
value: "my-ml-bucket"
- name: S3_OUTPUT_PREFIX
value: "fine-tuned/devops-qa-v1"
resources:
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "4"
requests:
nvidia.com/gpu: "1"
memory: "20Gi"
cpu: "2"
volumeMounts:
- name: hf-cache
mountPath: /root/.cache/huggingface
volumes:
- name: hf-cache
persistentVolumeClaim:
claimName: hf-model-cache
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
node.kubernetes.io/instance-type: g5.xlargekubectl apply -f fine-tune-job.yaml
kubectl logs -n ml-workloads -l job-name=llm-fine-tune-devops -f
# Training takes ~90-120 minutes on g5.xlargeStep 4: Serve with vLLM + LoRA Adapter
vLLM supports loading LoRA adapters at runtime:
# serve-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: devops-llm
namespace: ml-workloads
spec:
replicas: 1
selector:
matchLabels:
app: devops-llm
template:
metadata:
labels:
app: devops-llm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.2-3B-Instruct"
- "--enable-lora"
- "--lora-modules"
- "devops-qa=s3://my-ml-bucket/fine-tuned/devops-qa-v1"
- "--dtype"
- "bfloat16"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: token
- name: AWS_DEFAULT_REGION
value: us-east-1
resources:
limits:
nvidia.com/gpu: "1"
memory: "16Gi"Query the fine-tuned model:
curl http://devops-llm-svc:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "devops-qa",
"messages": [{"role": "user", "content": "Pod is OOMKilled, how do I fix it?"}]
}'Step 5: Argo Workflow to Orchestrate
# pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
name: llm-fine-tune-pipeline
namespace: ml-workloads
spec:
entrypoint: fine-tune-pipeline
templates:
- name: fine-tune-pipeline
steps:
- - name: prepare-data
template: data-prep
- - name: fine-tune
template: train
arguments:
parameters:
- name: data-version
value: "{{steps.prepare-data.outputs.parameters.version}}"
- - name: evaluate
template: eval
- - name: deploy
template: deploy-model
when: "{{steps.evaluate.outputs.parameters.score}} > 0.8"
- name: data-prep
container:
image: your-registry/data-prep:latest
command: ["python", "prepare_data.py"]
- name: train
resource:
action: create
manifest: |
# fine-tune-job.yaml contents here
successCondition: status.succeeded == 1
failureCondition: status.failed > 0Cost Estimate
| Phase | Instance | Time | Cost |
|---|---|---|---|
| Fine-tuning | g5.xlarge spot (~$0.35/hr) | 2 hours | ~$0.70 |
| Serving | g5.xlarge spot | $0.35/hr ongoing | |
| S3 storage (adapter ~200MB) | ~$0.005/month |
The entire fine-tuning run costs under $1. The resulting model is specific to your domain and faster to serve than a 70B model.
Fine-tuning isn't just for big tech companies anymore. With QLoRA, you can fine-tune a production-quality model on a single GPU for under $5, deploy it in Kubernetes, and have an AI assistant that genuinely understands your infrastructure.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Build an AI Kubernetes Troubleshooter with Claude (2026)
Build a CLI tool that automatically diagnoses Kubernetes issues — OOMKilled, CrashLoopBackOff, pending pods — by gathering cluster state and asking Claude what's wrong and how to fix it.
Build a DevOps RAG Pipeline with LlamaIndex on Kubernetes (2026)
Build a Retrieval-Augmented Generation (RAG) pipeline that answers questions from your runbooks, Confluence docs, and incident history. Deploy it on Kubernetes with LlamaIndex, Ollama, and Qdrant vector database.
AI-Driven Capacity Planning for Kubernetes Clusters (2026)
How to use AI and machine learning for Kubernetes capacity planning. Covers predictive autoscaling, cost optimization, tools like StormForge and Kubecost, and building custom ML models for resource forecasting.