Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)
Step-by-step guide to deploying HuggingFace transformer models on Kubernetes using GPU nodes — from cluster setup to inference API in production.
Running HuggingFace models locally is easy. Running them reliably in production — with GPU scheduling, autoscaling, health checks, and a proper API — is a different challenge.
This guide walks through deploying a HuggingFace model as an inference service on Kubernetes with GPU nodes.
What We're Building
- EKS cluster (or any Kubernetes cluster) with GPU node group
- NVIDIA device plugin for GPU scheduling
- A FastAPI inference server wrapping a HuggingFace model
- Kubernetes Deployment + Service + HPA
- Health checks and readiness probes
We'll deploy microsoft/phi-2 (a small but capable LLM) as an example, but the pattern works for any HuggingFace model.
Step 1 — Set Up GPU Nodes on EKS
Create EKS Cluster with GPU Node Group
eksctl create cluster \
--name ml-cluster \
--region ap-south-1 \
--nodegroup-name cpu-workers \
--node-type t3.medium \
--nodes 2
# Add GPU node group
eksctl create nodegroup \
--cluster ml-cluster \
--name gpu-workers \
--node-type g4dn.xlarge \ # 1 NVIDIA T4 GPU, 16GB VRAM
--nodes 1 \
--nodes-min 0 \
--nodes-max 3 \
--node-labels "role=gpu-inference" \
--asg-accessg4dn.xlarge costs ~$0.53/hour on-demand or ~$0.16/hour Spot. For development, 1 node is enough.
Install NVIDIA Device Plugin
The device plugin lets Kubernetes see and allocate GPUs as resources:
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.ymlVerify GPUs are visible:
kubectl get nodes -l role=gpu-inference -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
# "1"Step 2 — Build the Inference Server
Create a FastAPI application that loads and serves a HuggingFace model.
app.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
app = FastAPI(title="HuggingFace Inference API")
MODEL_NAME = os.getenv("MODEL_NAME", "microsoft/phi-2")
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Loading {MODEL_NAME} on {device}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16 if device == "cuda" else torch.float32,
trust_remote_code=True,
device_map="auto"
)
class GenerateRequest(BaseModel):
prompt: str
max_new_tokens: int = 200
temperature: float = 0.7
class GenerateResponse(BaseModel):
generated_text: str
model: str
device: str
@app.get("/health")
def health():
return {"status": "healthy", "device": device, "model": MODEL_NAME}
@app.post("/generate", response_model=GenerateResponse)
def generate(request: GenerateRequest):
inputs = tokenizer(request.prompt, return_tensors="pt").to(device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=request.max_new_tokens,
temperature=request.temperature,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated = tokenizer.decode(output[0], skip_special_tokens=True)
return GenerateResponse(
generated_text=generated[len(request.prompt):].strip(),
model=MODEL_NAME,
device=device
)requirements.txt
fastapi==0.110.0
uvicorn==0.27.0
transformers==4.39.0
torch==2.2.1
accelerate==0.28.0
pydantic==2.6.0
Dockerfile
FROM pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime
WORKDIR /app
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
ENV MODEL_NAME=microsoft/phi-2
ENV HF_HOME=/model-cache
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]Build and push to ECR:
# Create ECR repo
aws ecr create-repository --repository-name hf-inference --region ap-south-1
# Build and push
ECR_URI=$(aws ecr describe-repositories --repository-names hf-inference \
--query 'repositories[0].repositoryUri' --output text)
aws ecr get-login-password --region ap-south-1 | \
docker login --username AWS --password-stdin $ECR_URI
docker build -t hf-inference .
docker tag hf-inference:latest $ECR_URI:latest
docker push $ECR_URI:latestStep 3 — Create a PersistentVolume for Model Cache
Models are large (phi-2 is ~5GB). Download once, cache on a PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
namespace: ml-inference
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3
resources:
requests:
storage: 20GiStep 4 — Deploy the Inference Server
apiVersion: apps/v1
kind: Deployment
metadata:
name: hf-inference
namespace: ml-inference
spec:
replicas: 1
selector:
matchLabels:
app: hf-inference
template:
metadata:
labels:
app: hf-inference
spec:
nodeSelector:
role: gpu-inference # Schedule only on GPU nodes
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: inference
image: <ECR_URI>:latest
ports:
- containerPort: 8080
env:
- name: MODEL_NAME
value: "microsoft/phi-2"
- name: HF_HOME
value: "/model-cache"
resources:
requests:
memory: "8Gi"
cpu: "2"
nvidia.com/gpu: "1" # Request 1 GPU
limits:
memory: "12Gi"
cpu: "4"
nvidia.com/gpu: "1" # Limit to 1 GPU
volumeMounts:
- name: model-cache
mountPath: /model-cache
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 120 # Model loading takes time
periodSeconds: 10
failureThreshold: 10
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 180
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
---
apiVersion: v1
kind: Service
metadata:
name: hf-inference
namespace: ml-inference
spec:
selector:
app: hf-inference
ports:
- port: 80
targetPort: 8080
type: ClusterIPApply:
kubectl create namespace ml-inference
kubectl apply -f pvc.yaml
kubectl apply -f deployment.yamlStep 5 — Watch Model Loading
The first startup takes 3–5 minutes (downloading the model). Watch the logs:
kubectl logs -f deployment/hf-inference -n ml-inference
# Loading microsoft/phi-2 on cuda...
# Downloading model files: 100%|██████| 5.18G/5.18G
# Application startup complete.Check readiness:
kubectl get pods -n ml-inference -w
# hf-inference-xxx 0/1 Running 0 2m
# hf-inference-xxx 1/1 Running 0 4m <- Ready after model loadsStep 6 — Test the Inference API
# Port-forward to test locally
kubectl port-forward svc/hf-inference 8080:80 -n ml-inference
# Send a test request
curl -X POST http://localhost:8080/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain Kubernetes in simple terms:", "max_new_tokens": 100}'Expected response:
{
"generated_text": "Kubernetes is a container orchestration platform...",
"model": "microsoft/phi-2",
"device": "cuda"
}Step 7 — Add an Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: hf-inference
namespace: ml-inference
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
ingressClassName: nginx
rules:
- host: inference.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: hf-inference
port:
number: 80Cost Optimization Tips
Use Spot instances for GPU nodes — g4dn.xlarge Spot is ~$0.16/hour vs $0.53 on-demand.
eksctl create nodegroup \
--cluster ml-cluster \
--name gpu-spot \
--node-type g4dn.xlarge \
--spot \
--nodes-min 0 \
--nodes-max 2Scale to zero when not in use — Set minReplicas: 0 and use KEDA to scale up on HTTP traffic. The Karpenter + KEDA combo can reduce GPU costs by 70%+ for batch workloads.
Use quantized models — Models quantized to 4-bit (GGUF, AWQ) run on GPUs with less VRAM and are faster. For phi-2, try microsoft/phi-2-q4 or load with load_in_4bit=True via bitsandbytes.
Summary
| Step | What it does |
|---|---|
| GPU node group | g4dn.xlarge with NVIDIA T4 |
| NVIDIA device plugin | Exposes GPU as schedulable resource |
| FastAPI app | Wraps HuggingFace model as REST API |
| PVC | Caches model across pod restarts |
| Deployment | GPU resource request + readiness probe |
| Ingress | External access with timeout tuning |
The pattern here — FastAPI + HuggingFace + K8s GPU nodes — is production-ready and works for any model from the HuggingFace hub.
Run this stack on DigitalOcean GPU Droplets or spin up EKS GPU nodes. New DigitalOcean accounts get $200 free credit — enough to experiment with GPU inference for several hours.
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Deploy NVIDIA Triton Inference Server on Kubernetes (2026)
Step-by-step guide to running NVIDIA Triton Inference Server on Kubernetes with GPU nodes — model repository setup, deployment, autoscaling, and monitoring.
How to Set Up KubeFlow ML Pipeline on Kubernetes (2026)
Step-by-step guide to installing KubeFlow on Kubernetes and building your first ML pipeline — from cluster setup to a working training + serving workflow.
AWS EKS Pods Stuck in Pending State: Causes and Fixes
Pods stuck in Pending on EKS are caused by a handful of known issues — insufficient node capacity, taint mismatches, PVC problems, and more. Here's how to diagnose and fix each one.