LLM Fine-Tuning on AWS SageMaker — When, Why and How 2026

When should you fine-tune an LLM vs just prompting? How do you do it on SageMaker? This guide covers the decision framework and step-by-step fine-tuning with LoRA on AWS.

Fine-tuning is one of the most misused concepts in LLM deployment. Most teams reach for fine-tuning when better prompting would solve the problem. But when you do need it, SageMaker makes it manageable.

Here's when to fine-tune and how to do it.

When to Fine-Tune (And When NOT To)

Don't fine-tune if:

You haven't tried few-shot prompting first
Your training data is less than 500 examples
You need the model to "know" new facts (use RAG instead)
Budget is a constraint (fine-tuning + inference costs add up fast)

Do fine-tune when:

You need consistent output FORMAT (JSON with specific schema, code in specific style)
You need domain-specific TONE (legal language, medical terminology)
Inference latency matters and you want a smaller specialized model
You have 1000+ high-quality examples of input/output pairs
Base model can't perform the task reliably even with good prompts

Fine-Tuning Methods

Full fine-tuning — Update all model weights. Most powerful, most expensive. Needs A100/H100 GPUs.

LoRA (Low-Rank Adaptation) — Add small trainable adapter layers. 10–100x cheaper. 90% of the quality for 10% of the cost. This is what you'll actually use.

QLoRA — LoRA + quantization (4-bit). Fits large models on smaller GPUs. Slight quality drop.

SageMaker Fine-Tuning with LoRA

Step 1: Prepare Your Data

python

# Dataset format for instruction fine-tuning
# JSONL file: each line is one training example
 
import json
 
training_data = [
    {
        "instruction": "Convert this English Kubernetes command to a kubectl command",
        "input": "List all pods in the production namespace that are not running",
        "output": "kubectl get pods -n production --field-selector=status.phase!=Running"
    },
    {
        "instruction": "Convert this English Kubernetes command to a kubectl command",
        "input": "Show me the logs of the api-server pod from the last hour",
        "output": "kubectl logs api-server --since=1h"
    },
    # ... 1000+ examples
]
 
# Write to JSONL
with open("train.jsonl", "w") as f:
    for item in training_data:
        f.write(json.dumps(item) + "\n")

Upload to S3:

bash

aws s3 cp train.jsonl s3://my-training-data/kubectl-assistant/
aws s3 cp val.jsonl s3://my-training-data/kubectl-assistant/

Step 2: SageMaker Training Job

python

# finetune.py
import sagemaker
from sagemaker.huggingface import HuggingFace
 
sess = sagemaker.Session()
role = sagemaker.get_execution_role()
 
# Training hyperparameters
hyperparameters = {
    "model_id": "mistralai/Mistral-7B-Instruct-v0.3",
    "dataset_path": "/opt/ml/input/data/training",
    "num_train_epochs": 3,
    "per_device_train_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "learning_rate": 2e-4,
    "max_seq_length": 2048,
    
    # LoRA config
    "use_peft": True,
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "lora_target_modules": "q_proj,v_proj",
    
    # QLoRA — enable for smaller GPUs
    "load_in_4bit": True,
    "bnb_4bit_use_double_quant": True,
}
 
huggingface_estimator = HuggingFace(
    entry_point="train.py",          # Training script
    source_dir="./scripts",
    instance_type="ml.g5.2xlarge",   # A10G GPU
    instance_count=1,
    base_job_name="mistral-kubectl-finetune",
    role=role,
    transformers_version="4.36",
    pytorch_version="2.1",
    py_version="py310",
    hyperparameters=hyperparameters,
    environment={"HF_TOKEN": "your-hf-token"},  # For gated models
)
 
huggingface_estimator.fit({
    "training": f"s3://my-training-data/kubectl-assistant/"
})

Step 3: Training Script

python

# scripts/train.py
import os
import json
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
 
model_id = os.environ.get("SM_HP_MODEL_ID", "mistralai/Mistral-7B-Instruct-v0.3")
output_dir = "/opt/ml/model"
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
 
# Load dataset
dataset = load_dataset("json", data_files={
    "train": f"{os.environ['SM_CHANNEL_TRAINING']}/train.jsonl",
    "validation": f"{os.environ['SM_CHANNEL_TRAINING']}/val.jsonl",
})
 
def format_prompt(example):
    return f"""<s>[INST] {example['instruction']}
 
{example['input']} [/INST] {example['output']}</s>"""
 
# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
 
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 3,756,048,384 || trainable%: 0.11%
 
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    save_steps=100,
    evaluation_strategy="steps",
    eval_steps=100,
    report_to="none",
)
 
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    tokenizer=tokenizer,
    args=training_args,
    formatting_func=format_prompt,
    max_seq_length=2048,
)
 
trainer.train()
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

Deploy the Fine-Tuned Model

python

# deploy.py
from sagemaker.huggingface import HuggingFaceModel
 
model = HuggingFaceModel(
    model_data=huggingface_estimator.model_data,  # S3 path to fine-tuned weights
    role=role,
    transformers_version="4.36",
    pytorch_version="2.1",
    py_version="py310",
)
 
predictor = model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge",
    endpoint_name="kubectl-assistant"
)

Cost Estimate

Training: ml.g5.2xlarge ($1.21/hour) × 3 hours for 1000 examples = ~$3.63

Inference: ml.g5.xlarge ($1.01/hour) × 720 hours/month = ~$727/month

Or use serverless inference for low-traffic workloads — pay per request, no idle cost.

Evaluate Before Deploying

python

# Always evaluate against a held-out test set
test_cases = [
    {
        "input": "Delete all pods with label app=nginx",
        "expected": "kubectl delete pods -l app=nginx",
    }
]
 
correct = 0
for case in test_cases:
    prediction = predictor.predict({"inputs": case["input"]})["generated_text"]
    if prediction.strip() == case["expected"].strip():
        correct += 1
 
accuracy = correct / len(test_cases)
print(f"Test accuracy: {accuracy:.1%}")
# Target: >85% for deployment

Fine-tuning on SageMaker is straightforward. The hard part is getting good training data — 1000+ high-quality, consistent input/output pairs. That's where most fine-tuning projects succeed or fail.

For managed ML infrastructure, AWS SageMaker handles the GPU management, scaling, and model storage so you focus on data and training logic.

LLM Fine-Tuning on AWS SageMaker — When, Why and How 2026

When to Fine-Tune (And When NOT To)

Don't fine-tune if:

Do fine-tune when:

Fine-Tuning Methods

SageMaker Fine-Tuning with LoRA

Step 1: Prepare Your Data

Step 2: SageMaker Training Job

Step 3: Training Script

Deploy the Fine-Tuned Model

Cost Estimate

Evaluate Before Deploying

Stay ahead of the curve

Related Articles

Build an AI AWS Cost Anomaly Detector with Claude API and Cost Explorer

Build an AI Cloud Cost Anomaly Detector with Claude API + AWS Cost Explorer

Deploy Llama 3 on AWS Bedrock — Production Guide 2026

Comments