Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)

Step-by-step guide to deploying HuggingFace transformer models on Kubernetes using GPU nodes — from cluster setup to inference API in production.

Running HuggingFace models locally is easy. Running them reliably in production — with GPU scheduling, autoscaling, health checks, and a proper API — is a different challenge.

This guide walks through deploying a HuggingFace model as an inference service on Kubernetes with GPU nodes.

What We're Building

EKS cluster (or any Kubernetes cluster) with GPU node group
NVIDIA device plugin for GPU scheduling
A FastAPI inference server wrapping a HuggingFace model
Kubernetes Deployment + Service + HPA
Health checks and readiness probes

We'll deploy microsoft/phi-2 (a small but capable LLM) as an example, but the pattern works for any HuggingFace model.

Step 1 — Set Up GPU Nodes on EKS

Create EKS Cluster with GPU Node Group

bash

eksctl create cluster \
  --name ml-cluster \
  --region ap-south-1 \
  --nodegroup-name cpu-workers \
  --node-type t3.medium \
  --nodes 2
 
# Add GPU node group
eksctl create nodegroup \
  --cluster ml-cluster \
  --name gpu-workers \
  --node-type g4dn.xlarge \    # 1 NVIDIA T4 GPU, 16GB VRAM
  --nodes 1 \
  --nodes-min 0 \
  --nodes-max 3 \
  --node-labels "role=gpu-inference" \
  --asg-access

g4dn.xlarge costs ~$0.53/hour on-demand or ~$0.16/hour Spot. For development, 1 node is enough.

Install NVIDIA Device Plugin

The device plugin lets Kubernetes see and allocate GPUs as resources:

bash

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

Verify GPUs are visible:

bash

kubectl get nodes -l role=gpu-inference -o json | jq '.items[].status.allocatable["nvidia.com/gpu"]'
# "1"

Step 2 — Build the Inference Server

Create a FastAPI application that loads and serves a HuggingFace model.

`app.py`

python

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import os
 
app = FastAPI(title="HuggingFace Inference API")
 
MODEL_NAME = os.getenv("MODEL_NAME", "microsoft/phi-2")
device = "cuda" if torch.cuda.is_available() else "cpu"
 
print(f"Loading {MODEL_NAME} on {device}...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    trust_remote_code=True,
    device_map="auto"
)
 
class GenerateRequest(BaseModel):
    prompt: str
    max_new_tokens: int = 200
    temperature: float = 0.7
 
class GenerateResponse(BaseModel):
    generated_text: str
    model: str
    device: str
 
@app.get("/health")
def health():
    return {"status": "healthy", "device": device, "model": MODEL_NAME}
 
@app.post("/generate", response_model=GenerateResponse)
def generate(request: GenerateRequest):
    inputs = tokenizer(request.prompt, return_tensors="pt").to(device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=request.max_new_tokens,
            temperature=request.temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    generated = tokenizer.decode(output[0], skip_special_tokens=True)
    return GenerateResponse(
        generated_text=generated[len(request.prompt):].strip(),
        model=MODEL_NAME,
        device=device
    )

`requirements.txt`

fastapi==0.110.0
uvicorn==0.27.0
transformers==4.39.0
torch==2.2.1
accelerate==0.28.0
pydantic==2.6.0

`Dockerfile`

dockerfile

FROM pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime
 
WORKDIR /app
 
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
 
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
 
COPY app.py .
 
ENV MODEL_NAME=microsoft/phi-2
ENV HF_HOME=/model-cache
 
EXPOSE 8080
 
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Build and push to ECR:

bash

# Create ECR repo
aws ecr create-repository --repository-name hf-inference --region ap-south-1
 
# Build and push
ECR_URI=$(aws ecr describe-repositories --repository-names hf-inference \
  --query 'repositories[0].repositoryUri' --output text)
 
aws ecr get-login-password --region ap-south-1 | \
  docker login --username AWS --password-stdin $ECR_URI
 
docker build -t hf-inference .
docker tag hf-inference:latest $ECR_URI:latest
docker push $ECR_URI:latest

Step 3 — Create a PersistentVolume for Model Cache

Models are large (phi-2 is ~5GB). Download once, cache on a PVC:

yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache
  namespace: ml-inference
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 20Gi

Step 4 — Deploy the Inference Server

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hf-inference
  namespace: ml-inference
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hf-inference
  template:
    metadata:
      labels:
        app: hf-inference
    spec:
      nodeSelector:
        role: gpu-inference          # Schedule only on GPU nodes
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"
      containers:
      - name: inference
        image: <ECR_URI>:latest
        ports:
        - containerPort: 8080
        env:
        - name: MODEL_NAME
          value: "microsoft/phi-2"
        - name: HF_HOME
          value: "/model-cache"
        resources:
          requests:
            memory: "8Gi"
            cpu: "2"
            nvidia.com/gpu: "1"      # Request 1 GPU
          limits:
            memory: "12Gi"
            cpu: "4"
            nvidia.com/gpu: "1"      # Limit to 1 GPU
        volumeMounts:
        - name: model-cache
          mountPath: /model-cache
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 120   # Model loading takes time
          periodSeconds: 10
          failureThreshold: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 180
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache
---
apiVersion: v1
kind: Service
metadata:
  name: hf-inference
  namespace: ml-inference
spec:
  selector:
    app: hf-inference
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Apply:

bash

kubectl create namespace ml-inference
kubectl apply -f pvc.yaml
kubectl apply -f deployment.yaml

Step 5 — Watch Model Loading

The first startup takes 3–5 minutes (downloading the model). Watch the logs:

bash

kubectl logs -f deployment/hf-inference -n ml-inference
# Loading microsoft/phi-2 on cuda...
# Downloading model files: 100%|██████| 5.18G/5.18G
# Application startup complete.

Check readiness:

bash

kubectl get pods -n ml-inference -w
# hf-inference-xxx   0/1   Running   0   2m
# hf-inference-xxx   1/1   Running   0   4m   <- Ready after model loads

Step 6 — Test the Inference API

bash

# Port-forward to test locally
kubectl port-forward svc/hf-inference 8080:80 -n ml-inference
 
# Send a test request
curl -X POST http://localhost:8080/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Explain Kubernetes in simple terms:", "max_new_tokens": 100}'

Expected response:

json

{
  "generated_text": "Kubernetes is a container orchestration platform...",
  "model": "microsoft/phi-2",
  "device": "cuda"
}

Step 7 — Add an Ingress

yaml

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hf-inference
  namespace: ml-inference
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  ingressClassName: nginx
  rules:
  - host: inference.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: hf-inference
            port:
              number: 80

Cost Optimization Tips

Use Spot instances for GPU nodes — g4dn.xlarge Spot is ~$0.16/hour vs $0.53 on-demand.

bash

eksctl create nodegroup \
  --cluster ml-cluster \
  --name gpu-spot \
  --node-type g4dn.xlarge \
  --spot \
  --nodes-min 0 \
  --nodes-max 2

Scale to zero when not in use — Set minReplicas: 0 and use KEDA to scale up on HTTP traffic. The Karpenter + KEDA combo can reduce GPU costs by 70%+ for batch workloads.

Use quantized models — Models quantized to 4-bit (GGUF, AWQ) run on GPUs with less VRAM and are faster. For phi-2, try microsoft/phi-2-q4 or load with load_in_4bit=True via bitsandbytes.

Summary

Step	What it does
GPU node group	g4dn.xlarge with NVIDIA T4
NVIDIA device plugin	Exposes GPU as schedulable resource
FastAPI app	Wraps HuggingFace model as REST API
PVC	Caches model across pod restarts
Deployment	GPU resource request + readiness probe
Ingress	External access with timeout tuning

The pattern here — FastAPI + HuggingFace + K8s GPU nodes — is production-ready and works for any model from the HuggingFace hub.

Run this stack on DigitalOcean GPU Droplets or spin up EKS GPU nodes. New DigitalOcean accounts get $200 free credit — enough to experiment with GPU inference for several hours.

Deploy HuggingFace Models on Kubernetes with GPU Nodes (2026)

What We're Building

Step 1 — Set Up GPU Nodes on EKS

Create EKS Cluster with GPU Node Group

Install NVIDIA Device Plugin

Step 2 — Build the Inference Server

`app.py`

`requirements.txt`

`Dockerfile`

Step 3 — Create a PersistentVolume for Model Cache

Step 4 — Deploy the Inference Server

Step 5 — Watch Model Loading

Step 6 — Test the Inference API

Step 7 — Add an Ingress

Cost Optimization Tips

Summary

Stay ahead of the curve

Related Articles

Deploy NVIDIA Triton Inference Server on Kubernetes (2026)

How to Set Up KubeFlow ML Pipeline on Kubernetes (2026)

AWS EKS Pods Stuck in Pending State: Causes and Fixes

Comments