All Articles

Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)

Step-by-step guide to deploying HuggingFace transformer models on Kubernetes using GPU nodes — from cluster setup to inference API in production.

DevOpsBoysApr 19, 20265 min read
Share:Tweet

Running HuggingFace models locally is easy. Running them reliably in production — with GPU scheduling, autoscaling, health checks, and a proper API — is a different challenge.

This guide walks through deploying a HuggingFace model as an inference service on Kubernetes with GPU nodes.

What We're Building

  • EKS cluster (or any Kubernetes cluster) with GPU node group
  • NVIDIA device plugin for GPU scheduling
  • A FastAPI inference server wrapping a HuggingFace model
  • Kubernetes Deployment + Service + HPA
  • Health checks and readiness probes

We'll deploy microsoft/phi-2 (a small but capable LLM) as an example, but the pattern works for any HuggingFace model.


Step 1 — Set Up GPU Nodes on EKS

Create EKS Cluster with GPU Node Group

bash
eksctl create cluster \
  --name ml-cluster \
  --region ap-south-1 \
  --nodegroup-name cpu-workers \
  --node-type t3.medium \
  --nodes 2
 
# Add GPU node group
eksctl create nodegroup \
  --cluster ml-cluster \
  --name gpu-workers \
  --node-type g4dn.xlarge \    # 1 NVIDIA T4 GPU, 16GB VRAM
  --nodes 1 \
  --nodes-min 0 \
  --nodes-max 3 \
  --node-labels "role=gpu-inference" \
  --asg-access

g4dn.xlarge costs ~$0.53/hour on-demand or ~$0.16/hour Spot. For development, 1 node is enough.

Install NVIDIA Device Plugin

The device plugin lets Kubernetes see and allocate GPUs as resources:

bash
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

Verify GPUs are visible:

bash
kubectl get nodes -l role=gpu-inference -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
# "1"

Step 2 — Build the Inference Server

Create a FastAPI application that loads and serves a HuggingFace model.

app.py

python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
 
app = FastAPI(title="HuggingFace Inference API")
 
MODEL_NAME = os.getenv("MODEL_NAME", "microsoft/phi-2")
device = "cuda" if torch.cuda.is_available() else "cpu"
 
print(f"Loading {MODEL_NAME} on {device}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    trust_remote_code=True,
    device_map="auto"
)
 
class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 200
    temperature: float = 0.7
 
class GenerateResponse(BaseModel):
    generated_text: str
    model: str
    device: str
 
@app.get("/health")
def health():
    return {"status": "healthy", "device": device, "model": MODEL_NAME}
 
@app.post("/generate", response_model=GenerateResponse)
def generate(request: GenerateRequest):
    inputs = tokenizer(request.prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=request.max_new_tokens,
            temperature=request.temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    generated = tokenizer.decode(output[0], skip_special_tokens=True)
    return GenerateResponse(
        generated_text=generated[len(request.prompt):].strip(),
        model=MODEL_NAME,
        device=device
    )

requirements.txt

fastapi==0.110.0
uvicorn==0.27.0
transformers==4.39.0
torch==2.2.1
accelerate==0.28.0
pydantic==2.6.0

Dockerfile

dockerfile
FROM pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime
 
WORKDIR /app
 
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
 
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
 
COPY app.py .
 
ENV MODEL_NAME=microsoft/phi-2
ENV HF_HOME=/model-cache
 
EXPOSE 8080
 
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Build and push to ECR:

bash
# Create ECR repo
aws ecr create-repository --repository-name hf-inference --region ap-south-1
 
# Build and push
ECR_URI=$(aws ecr describe-repositories --repository-names hf-inference \
  --query 'repositories[0].repositoryUri' --output text)
 
aws ecr get-login-password --region ap-south-1 | \
  docker login --username AWS --password-stdin $ECR_URI
 
docker build -t hf-inference .
docker tag hf-inference:latest $ECR_URI:latest
docker push $ECR_URI:latest

Step 3 — Create a PersistentVolume for Model Cache

Models are large (phi-2 is ~5GB). Download once, cache on a PVC:

yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: ml-inference
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 20Gi

Step 4 — Deploy the Inference Server

yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hf-inference
  namespace: ml-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hf-inference
  template:
    metadata:
      labels:
        app: hf-inference
    spec:
      nodeSelector:
        role: gpu-inference          # Schedule only on GPU nodes
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: inference
        image: <ECR_URI>:latest
        ports:
        - containerPort: 8080
        env:
        - name: MODEL_NAME
          value: "microsoft/phi-2"
        - name: HF_HOME
          value: "/model-cache"
        resources:
          requests:
            memory: "8Gi"
            cpu: "2"
            nvidia.com/gpu: "1"      # Request 1 GPU
          limits:
            memory: "12Gi"
            cpu: "4"
            nvidia.com/gpu: "1"      # Limit to 1 GPU
        volumeMounts:
        - name: model-cache
          mountPath: /model-cache
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 120   # Model loading takes time
          periodSeconds: 10
          failureThreshold: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 180
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache
---
apiVersion: v1
kind: Service
metadata:
  name: hf-inference
  namespace: ml-inference
spec:
  selector:
    app: hf-inference
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Apply:

bash
kubectl create namespace ml-inference
kubectl apply -f pvc.yaml
kubectl apply -f deployment.yaml

Step 5 — Watch Model Loading

The first startup takes 3–5 minutes (downloading the model). Watch the logs:

bash
kubectl logs -f deployment/hf-inference -n ml-inference
# Loading microsoft/phi-2 on cuda...
# Downloading model files: 100%|██████| 5.18G/5.18G
# Application startup complete.

Check readiness:

bash
kubectl get pods -n ml-inference -w
# hf-inference-xxx   0/1   Running   0   2m
# hf-inference-xxx   1/1   Running   0   4m   <- Ready after model loads

Step 6 — Test the Inference API

bash
# Port-forward to test locally
kubectl port-forward svc/hf-inference 8080:80 -n ml-inference
 
# Send a test request
curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain Kubernetes in simple terms:", "max_new_tokens": 100}'

Expected response:

json
{
  "generated_text": "Kubernetes is a container orchestration platform...",
  "model": "microsoft/phi-2",
  "device": "cuda"
}

Step 7 — Add an Ingress

yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hf-inference
  namespace: ml-inference
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: inference.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: hf-inference
            port:
              number: 80

Cost Optimization Tips

Use Spot instances for GPU nodes — g4dn.xlarge Spot is ~$0.16/hour vs $0.53 on-demand.

bash
eksctl create nodegroup \
  --cluster ml-cluster \
  --name gpu-spot \
  --node-type g4dn.xlarge \
  --spot \
  --nodes-min 0 \
  --nodes-max 2

Scale to zero when not in use — Set minReplicas: 0 and use KEDA to scale up on HTTP traffic. The Karpenter + KEDA combo can reduce GPU costs by 70%+ for batch workloads.

Use quantized models — Models quantized to 4-bit (GGUF, AWQ) run on GPUs with less VRAM and are faster. For phi-2, try microsoft/phi-2-q4 or load with load_in_4bit=True via bitsandbytes.


Summary

StepWhat it does
GPU node groupg4dn.xlarge with NVIDIA T4
NVIDIA device pluginExposes GPU as schedulable resource
FastAPI appWraps HuggingFace model as REST API
PVCCaches model across pod restarts
DeploymentGPU resource request + readiness probe
IngressExternal access with timeout tuning

The pattern here — FastAPI + HuggingFace + K8s GPU nodes — is production-ready and works for any model from the HuggingFace hub.

Run this stack on DigitalOcean GPU Droplets or spin up EKS GPU nodes. New DigitalOcean accounts get $200 free credit — enough to experiment with GPU inference for several hours.

Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments