🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Build an LLM Fine-Tuning Pipeline on Kubernetes (2026)

Fine-tune a small LLM on domain-specific DevOps data using QLoRA, orchestrate the pipeline on Kubernetes, and serve the result with vLLM. Complete guide with code.

DevOpsBoysMay 8, 20265 min read
Share:Tweet

Fine-tuning an LLM on your own data gives you a model that understands your domain — your runbooks, your codebase, your incident patterns. Here's how to build the full pipeline on Kubernetes.


What We're Building

A Kubernetes-based fine-tuning pipeline:

  1. Data preparation — format your training data
  2. Fine-tuning job — QLoRA fine-tune on a GPU node
  3. Model storage — save adapter weights to S3
  4. Serving — load base model + adapter with vLLM
  5. Pipeline orchestration — Argo Workflows to tie it together

We'll fine-tune Llama 3.2 3B on DevOps Q&A data using QLoRA (quantized low-rank adaptation) — fits on a single A10G GPU in ~2 hours.


Prerequisites

  • EKS cluster with at least one GPU node (g5.xlarge)
  • NVIDIA GPU Operator installed
  • Hugging Face account (for model downloads)
  • S3 bucket for model artifacts
  • Argo Workflows installed

Step 1: Prepare Training Data

Fine-tuning requires formatted instruction data:

python
# prepare_data.py
import json
 
# DevOps Q&A training examples
training_data = [
    {
        "instruction": "A Kubernetes pod is showing OOMKilled status. What are the steps to diagnose and fix this?",
        "output": "OOMKilled means the container exceeded its memory limit. Steps:\n1. Check current limits: kubectl describe pod <name>\n2. Check memory usage: kubectl top pod <name>\n3. Review app logs before the kill: kubectl logs <name> --previous\n4. If the limit is too low, increase it in the Deployment spec under resources.limits.memory\n5. If the app has a leak, profile the application..."
    },
    {
        "instruction": "Explain the difference between a Kubernetes Deployment and a StatefulSet.",
        "output": "Deployments are for stateless applications — each pod is interchangeable, gets a random name, and shares storage. StatefulSets are for stateful applications (databases, queues) — each pod has a stable identity (mysql-0, mysql-1), persistent storage that follows the pod, and pods start/stop in order..."
    },
    # Add hundreds more examples
]
 
# Format for Llama instruction tuning
def format_example(item):
    return f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a senior DevOps engineer. Answer questions about Kubernetes, AWS, CI/CD, and infrastructure.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
{item['instruction']}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
{item['output']}<|eot_id|>"""
 
formatted = [{"text": format_example(item)} for item in training_data]
 
with open("train.jsonl", "w") as f:
    for item in formatted:
        f.write(json.dumps(item) + "\n")
 
print(f"Prepared {len(formatted)} training examples")

Upload training data to S3:

bash
aws s3 cp train.jsonl s3://my-ml-bucket/fine-tuning/devops-qa/train.jsonl

Step 2: Fine-Tuning Job

python
# finetune.py
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
from datasets import load_dataset
import boto3
 
# Config
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
OUTPUT_DIR = "/tmp/fine-tuned-model"
S3_BUCKET = os.environ["S3_BUCKET"]
S3_OUTPUT_PREFIX = os.environ.get("S3_OUTPUT_PREFIX", "fine-tuned/devops-qa")
 
# Load dataset from S3
s3 = boto3.client('s3')
s3.download_file(S3_BUCKET, "fine-tuning/devops-qa/train.jsonl", "/tmp/train.jsonl")
dataset = load_dataset("json", data_files="/tmp/train.jsonl", split="train")
 
# Load model in 4-bit (QLoRA)
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
 
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    token=os.environ["HF_TOKEN"],
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, token=os.environ["HF_TOKEN"])
tokenizer.pad_token = tokenizer.eos_token
 
# LoRA configuration
lora_config = LoraConfig(
    r=16,           # rank — higher = more capacity but more parameters
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
 
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~13M || all params: ~3.2B || ~0.4% trainable
 
# Training
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=100,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_steps=100,
    save_total_limit=2,
    report_to="none",
)
 
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
)
 
trainer.train()
trainer.save_model(OUTPUT_DIR)
 
# Upload to S3
import subprocess
subprocess.run([
    "aws", "s3", "sync", OUTPUT_DIR,
    f"s3://{S3_BUCKET}/{S3_OUTPUT_PREFIX}/"
], check=True)
 
print(f"Model uploaded to s3://{S3_BUCKET}/{S3_OUTPUT_PREFIX}/")

Step 3: Kubernetes Job

yaml
# fine-tune-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: llm-fine-tune-devops
  namespace: ml-workloads
spec:
  backoffLimit: 1
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: fine-tuner
        image: your-registry/fine-tuner:latest
        command: ["python", "finetune.py"]
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: token
        - name: S3_BUCKET
          value: "my-ml-bucket"
        - name: S3_OUTPUT_PREFIX
          value: "fine-tuned/devops-qa-v1"
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "24Gi"
            cpu: "4"
          requests:
            nvidia.com/gpu: "1"
            memory: "20Gi"
            cpu: "2"
        volumeMounts:
        - name: hf-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: hf-cache
        persistentVolumeClaim:
          claimName: hf-model-cache
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      nodeSelector:
        node.kubernetes.io/instance-type: g5.xlarge
bash
kubectl apply -f fine-tune-job.yaml
kubectl logs -n ml-workloads -l job-name=llm-fine-tune-devops -f
# Training takes ~90-120 minutes on g5.xlarge

Step 4: Serve with vLLM + LoRA Adapter

vLLM supports loading LoRA adapters at runtime:

yaml
# serve-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: devops-llm
  namespace: ml-workloads
spec:
  replicas: 1
  selector:
    matchLabels:
      app: devops-llm
  template:
    metadata:
      labels:
        app: devops-llm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.2-3B-Instruct"
        - "--enable-lora"
        - "--lora-modules"
        - "devops-qa=s3://my-ml-bucket/fine-tuned/devops-qa-v1"
        - "--dtype"
        - "bfloat16"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-secret
              key: token
        - name: AWS_DEFAULT_REGION
          value: us-east-1
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "16Gi"

Query the fine-tuned model:

bash
curl http://devops-llm-svc:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "devops-qa",
    "messages": [{"role": "user", "content": "Pod is OOMKilled, how do I fix it?"}]
  }'

Step 5: Argo Workflow to Orchestrate

yaml
# pipeline.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: llm-fine-tune-pipeline
  namespace: ml-workloads
spec:
  entrypoint: fine-tune-pipeline
  templates:
  - name: fine-tune-pipeline
    steps:
    - - name: prepare-data
        template: data-prep
    - - name: fine-tune
        template: train
        arguments:
          parameters:
          - name: data-version
            value: "{{steps.prepare-data.outputs.parameters.version}}"
    - - name: evaluate
        template: eval
    - - name: deploy
        template: deploy-model
        when: "{{steps.evaluate.outputs.parameters.score}} > 0.8"
 
  - name: data-prep
    container:
      image: your-registry/data-prep:latest
      command: ["python", "prepare_data.py"]
 
  - name: train
    resource:
      action: create
      manifest: |
        # fine-tune-job.yaml contents here
      successCondition: status.succeeded == 1
      failureCondition: status.failed > 0

Cost Estimate

PhaseInstanceTimeCost
Fine-tuningg5.xlarge spot (~$0.35/hr)2 hours~$0.70
Servingg5.xlarge spot$0.35/hr ongoing
S3 storage (adapter ~200MB)~$0.005/month

The entire fine-tuning run costs under $1. The resulting model is specific to your domain and faster to serve than a 70B model.


Fine-tuning isn't just for big tech companies anymore. With QLoRA, you can fine-tune a production-quality model on a single GPU for under $5, deploy it in Kubernetes, and have an AI assistant that genuinely understands your infrastructure.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments