🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

LLM Fine-Tuning on AWS SageMaker — When, Why and How 2026

When should you fine-tune an LLM vs just prompting? How do you do it on SageMaker? This guide covers the decision framework and step-by-step fine-tuning with LoRA on AWS.

DevOpsBoysMay 31, 20264 min read
Share:Tweet

Fine-tuning is one of the most misused concepts in LLM deployment. Most teams reach for fine-tuning when better prompting would solve the problem. But when you do need it, SageMaker makes it manageable.

Here's when to fine-tune and how to do it.


When to Fine-Tune (And When NOT To)

Don't fine-tune if:

  • You haven't tried few-shot prompting first
  • Your training data is less than 500 examples
  • You need the model to "know" new facts (use RAG instead)
  • Budget is a constraint (fine-tuning + inference costs add up fast)

Do fine-tune when:

  • You need consistent output FORMAT (JSON with specific schema, code in specific style)
  • You need domain-specific TONE (legal language, medical terminology)
  • Inference latency matters and you want a smaller specialized model
  • You have 1000+ high-quality examples of input/output pairs
  • Base model can't perform the task reliably even with good prompts

Fine-Tuning Methods

Full fine-tuning — Update all model weights. Most powerful, most expensive. Needs A100/H100 GPUs.

LoRA (Low-Rank Adaptation) — Add small trainable adapter layers. 10–100x cheaper. 90% of the quality for 10% of the cost. This is what you'll actually use.

QLoRA — LoRA + quantization (4-bit). Fits large models on smaller GPUs. Slight quality drop.


SageMaker Fine-Tuning with LoRA

Step 1: Prepare Your Data

python
# Dataset format for instruction fine-tuning
# JSONL file: each line is one training example
 
import json
 
training_data = [
    {
        "instruction": "Convert this English Kubernetes command to a kubectl command",
        "input": "List all pods in the production namespace that are not running",
        "output": "kubectl get pods -n production --field-selector=status.phase!=Running"
    },
    {
        "instruction": "Convert this English Kubernetes command to a kubectl command",
        "input": "Show me the logs of the api-server pod from the last hour",
        "output": "kubectl logs api-server --since=1h"
    },
    # ... 1000+ examples
]
 
# Write to JSONL
with open("train.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

Upload to S3:

bash
aws s3 cp train.jsonl s3://my-training-data/kubectl-assistant/
aws s3 cp val.jsonl s3://my-training-data/kubectl-assistant/

Step 2: SageMaker Training Job

python
# finetune.py
import sagemaker
from sagemaker.huggingface import HuggingFace
 
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
 
# Training hyperparameters
hyperparameters = {
    "model_id": "mistralai/Mistral-7B-Instruct-v0.3",
    "dataset_path": "/opt/ml/input/data/training",
    "num_train_epochs": 3,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "learning_rate": 2e-4,
    "max_seq_length": 2048,
    
    # LoRA config
    "use_peft": True,
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "lora_target_modules": "q_proj,v_proj",
    
    # QLoRA — enable for smaller GPUs
    "load_in_4bit": True,
    "bnb_4bit_use_double_quant": True,
}
 
huggingface_estimator = HuggingFace(
    entry_point="train.py",          # Training script
    source_dir="./scripts",
    instance_type="ml.g5.2xlarge",   # A10G GPU
    instance_count=1,
    base_job_name="mistral-kubectl-finetune",
    role=role,
    transformers_version="4.36",
    pytorch_version="2.1",
    py_version="py310",
    hyperparameters=hyperparameters,
    environment={"HF_TOKEN": "your-hf-token"},  # For gated models
)
 
huggingface_estimator.fit({
    "training": f"s3://my-training-data/kubectl-assistant/"
})

Step 3: Training Script

python
# scripts/train.py
import os
import json
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
 
model_id = os.environ.get("SM_HP_MODEL_ID", "mistralai/Mistral-7B-Instruct-v0.3")
output_dir = "/opt/ml/model"
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
 
# Load dataset
dataset = load_dataset("json", data_files={
    "train": f"{os.environ['SM_CHANNEL_TRAINING']}/train.jsonl",
    "validation": f"{os.environ['SM_CHANNEL_TRAINING']}/val.jsonl",
})
 
def format_prompt(example):
    return f"""<s>[INST] {example['instruction']}
 
{example['input']} [/INST] {example['output']}</s>"""
 
# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
 
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,756,048,384 || trainable%: 0.11%
 
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    report_to="none",
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=format_prompt,
    max_seq_length=2048,
)
 
trainer.train()
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

Deploy the Fine-Tuned Model

python
# deploy.py
from sagemaker.huggingface import HuggingFaceModel
 
model = HuggingFaceModel(
    model_data=huggingface_estimator.model_data,  # S3 path to fine-tuned weights
    role=role,
    transformers_version="4.36",
    pytorch_version="2.1",
    py_version="py310",
)
 
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",
    endpoint_name="kubectl-assistant"
)

Cost Estimate

Training: ml.g5.2xlarge ($1.21/hour) × 3 hours for 1000 examples = ~$3.63

Inference: ml.g5.xlarge ($1.01/hour) × 720 hours/month = ~$727/month

Or use serverless inference for low-traffic workloads — pay per request, no idle cost.


Evaluate Before Deploying

python
# Always evaluate against a held-out test set
test_cases = [
    {
        "input": "Delete all pods with label app=nginx",
        "expected": "kubectl delete pods -l app=nginx",
    }
]
 
correct = 0
for case in test_cases:
    prediction = predictor.predict({"inputs": case["input"]})["generated_text"]
    if prediction.strip() == case["expected"].strip():
        correct += 1
 
accuracy = correct / len(test_cases)
print(f"Test accuracy: {accuracy:.1%}")
# Target: >85% for deployment

Fine-tuning on SageMaker is straightforward. The hard part is getting good training data — 1000+ high-quality, consistent input/output pairs. That's where most fine-tuning projects succeed or fail.

For managed ML infrastructure, AWS SageMaker handles the GPU management, scaling, and model storage so you focus on data and training logic.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments