LLM Fine-Tuning on AWS SageMaker — When, Why and How 2026
When should you fine-tune an LLM vs just prompting? How do you do it on SageMaker? This guide covers the decision framework and step-by-step fine-tuning with LoRA on AWS.
Fine-tuning is one of the most misused concepts in LLM deployment. Most teams reach for fine-tuning when better prompting would solve the problem. But when you do need it, SageMaker makes it manageable.
Here's when to fine-tune and how to do it.
When to Fine-Tune (And When NOT To)
Don't fine-tune if:
- You haven't tried few-shot prompting first
- Your training data is less than 500 examples
- You need the model to "know" new facts (use RAG instead)
- Budget is a constraint (fine-tuning + inference costs add up fast)
Do fine-tune when:
- You need consistent output FORMAT (JSON with specific schema, code in specific style)
- You need domain-specific TONE (legal language, medical terminology)
- Inference latency matters and you want a smaller specialized model
- You have 1000+ high-quality examples of input/output pairs
- Base model can't perform the task reliably even with good prompts
Fine-Tuning Methods
Full fine-tuning — Update all model weights. Most powerful, most expensive. Needs A100/H100 GPUs.
LoRA (Low-Rank Adaptation) — Add small trainable adapter layers. 10–100x cheaper. 90% of the quality for 10% of the cost. This is what you'll actually use.
QLoRA — LoRA + quantization (4-bit). Fits large models on smaller GPUs. Slight quality drop.
SageMaker Fine-Tuning with LoRA
Step 1: Prepare Your Data
# Dataset format for instruction fine-tuning
# JSONL file: each line is one training example
import json
training_data = [
{
"instruction": "Convert this English Kubernetes command to a kubectl command",
"input": "List all pods in the production namespace that are not running",
"output": "kubectl get pods -n production --field-selector=status.phase!=Running"
},
{
"instruction": "Convert this English Kubernetes command to a kubectl command",
"input": "Show me the logs of the api-server pod from the last hour",
"output": "kubectl logs api-server --since=1h"
},
# ... 1000+ examples
]
# Write to JSONL
with open("train.jsonl", "w") as f:
for item in training_data:
f.write(json.dumps(item) + "\n")Upload to S3:
aws s3 cp train.jsonl s3://my-training-data/kubectl-assistant/
aws s3 cp val.jsonl s3://my-training-data/kubectl-assistant/Step 2: SageMaker Training Job
# finetune.py
import sagemaker
from sagemaker.huggingface import HuggingFace
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
# Training hyperparameters
hyperparameters = {
"model_id": "mistralai/Mistral-7B-Instruct-v0.3",
"dataset_path": "/opt/ml/input/data/training",
"num_train_epochs": 3,
"per_device_train_batch_size": 2,
"gradient_accumulation_steps": 4,
"learning_rate": 2e-4,
"max_seq_length": 2048,
# LoRA config
"use_peft": True,
"lora_r": 16,
"lora_alpha": 32,
"lora_dropout": 0.05,
"lora_target_modules": "q_proj,v_proj",
# QLoRA — enable for smaller GPUs
"load_in_4bit": True,
"bnb_4bit_use_double_quant": True,
}
huggingface_estimator = HuggingFace(
entry_point="train.py", # Training script
source_dir="./scripts",
instance_type="ml.g5.2xlarge", # A10G GPU
instance_count=1,
base_job_name="mistral-kubectl-finetune",
role=role,
transformers_version="4.36",
pytorch_version="2.1",
py_version="py310",
hyperparameters=hyperparameters,
environment={"HF_TOKEN": "your-hf-token"}, # For gated models
)
huggingface_estimator.fit({
"training": f"s3://my-training-data/kubectl-assistant/"
})Step 3: Training Script
# scripts/train.py
import os
import json
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
model_id = os.environ.get("SM_HP_MODEL_ID", "mistralai/Mistral-7B-Instruct-v0.3")
output_dir = "/opt/ml/model"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
# Load dataset
dataset = load_dataset("json", data_files={
"train": f"{os.environ['SM_CHANNEL_TRAINING']}/train.jsonl",
"validation": f"{os.environ['SM_CHANNEL_TRAINING']}/val.jsonl",
})
def format_prompt(example):
return f"""<s>[INST] {example['instruction']}
{example['input']} [/INST] {example['output']}</s>"""
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,756,048,384 || trainable%: 0.11%
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
learning_rate=2e-4,
save_steps=100,
evaluation_strategy="steps",
eval_steps=100,
report_to="none",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
tokenizer=tokenizer,
args=training_args,
formatting_func=format_prompt,
max_seq_length=2048,
)
trainer.train()
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)Deploy the Fine-Tuned Model
# deploy.py
from sagemaker.huggingface import HuggingFaceModel
model = HuggingFaceModel(
model_data=huggingface_estimator.model_data, # S3 path to fine-tuned weights
role=role,
transformers_version="4.36",
pytorch_version="2.1",
py_version="py310",
)
predictor = model.deploy(
initial_instance_count=1,
instance_type="ml.g5.xlarge",
endpoint_name="kubectl-assistant"
)Cost Estimate
Training: ml.g5.2xlarge ($1.21/hour) × 3 hours for 1000 examples = ~$3.63
Inference: ml.g5.xlarge ($1.01/hour) × 720 hours/month = ~$727/month
Or use serverless inference for low-traffic workloads — pay per request, no idle cost.
Evaluate Before Deploying
# Always evaluate against a held-out test set
test_cases = [
{
"input": "Delete all pods with label app=nginx",
"expected": "kubectl delete pods -l app=nginx",
}
]
correct = 0
for case in test_cases:
prediction = predictor.predict({"inputs": case["input"]})["generated_text"]
if prediction.strip() == case["expected"].strip():
correct += 1
accuracy = correct / len(test_cases)
print(f"Test accuracy: {accuracy:.1%}")
# Target: >85% for deploymentFine-tuning on SageMaker is straightforward. The hard part is getting good training data — 1000+ high-quality, consistent input/output pairs. That's where most fine-tuning projects succeed or fail.
For managed ML infrastructure, AWS SageMaker handles the GPU management, scaling, and model storage so you focus on data and training logic.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
Deploy Llama 3 on AWS Bedrock — Production Guide 2026
AWS Bedrock now supports Meta's Llama 3 models. Here's how to deploy, call, and optimize Llama 3 on Bedrock for production use cases without managing GPU infrastructure.
How to Deploy Mistral 7B on AWS EC2 — Production Guide 2026
Step-by-step guide to deploying Mistral 7B on AWS EC2 for production use. Covers instance selection, quantization, serving with vLLM, and cost optimization.
LLM Cost Optimization in Production — Caching, Batching, Quantization 2026
LLM API bills spiral fast. Here's every technique to cut your LLM costs in production without sacrificing quality — prompt caching, request batching, model routing, and quantization.