Fine-Tuning vs RAG vs Prompt Engineering — When to Use Which
Not every LLM problem needs fine-tuning. Understand when prompt engineering is enough, when to use RAG for knowledge, and when fine-tuning actually makes sense — with real decision criteria.
Teams waste months fine-tuning models for problems that prompt engineering would solve in an afternoon. The wrong choice costs GPU hours, engineering time, and usually doesn't even fix the actual problem. Here's the decision framework.
Quick Decision Tree
Is the model missing factual knowledge?
→ YES: Use RAG (give it the documents)
→ NO ↓
Does it behave wrong (tone, format, style)?
→ YES: Try prompt engineering first
→ STILL WRONG: Consider fine-tuning
Is it too slow or expensive at inference?
→ YES: Fine-tune a smaller model
Do you need it to do something it fundamentally can't do?
→ YES: Fine-tuning won't help either — you need training data or a different approach
Prompt Engineering
What it is: Crafting the system prompt, user prompt, and few-shot examples to get the behavior you want.
When it works:
- Model knows the information but formats it wrong
- You want consistent tone/persona
- You need structured output (JSON, YAML)
- You want to constrain or expand behaviors
# Before: generic response
response = client.messages.create(
model="claude-sonnet-4-6",
messages=[{"role": "user", "content": "Review this Terraform code"}]
)
# After: specific, structured behavior
response = client.messages.create(
model="claude-sonnet-4-6",
system="""You are a Terraform expert reviewer. For every code review:
1. Check security issues (overpermissive IAM, public S3, unencrypted RDS)
2. Check cost issues (overprovisioned instances, missing lifecycle policies)
3. Check correctness (missing dependencies, wrong resource arguments)
Format output as JSON: {"security": [], "cost": [], "correctness": []}
Each issue: {"severity": "critical|warning", "line": N, "issue": "...", "fix": "..."}""",
messages=[{"role": "user", "content": f"Review:\n```hcl\n{terraform_code}\n```"}]
)Cost: Zero (no training)
Time to implement: Hours
Limitation: Can't add new knowledge. Can't change model weights.
RAG (Retrieval-Augmented Generation)
What it is: Retrieve relevant documents at query time and inject them into the context.
When it works:
- Model needs to answer questions about your internal docs, runbooks, codebase
- Information changes frequently (new incidents, updated policies)
- Knowledge is too large to fit in a prompt
- You want citations/sources
When people use it wrong: Trying to use RAG to teach the model HOW to do something (behavior/style) instead of WHAT to know (facts/documents). RAG doesn't change behavior — it changes knowledge.
# RAG pipeline
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")
qdrant = QdrantClient("localhost", port=6333)
def rag_answer(question: str) -> str:
# Retrieve relevant chunks
query_vector = encoder.encode(question).tolist()
results = qdrant.search(
collection_name="runbooks",
query_vector=query_vector,
limit=5,
)
context = "\n\n".join(r.payload["text"] for r in results)
# Generate with context
response = client.messages.create(
model="claude-sonnet-4-6",
system="You are a DevOps assistant. Answer based only on the provided context. "
"If the answer isn't in the context, say so.",
messages=[
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}",
}
],
)
return response.content[0].textCost: Embedding model + vector DB + extra tokens per query
Time to implement: 1–2 weeks for production setup
Limitation: Quality depends entirely on retrieval quality. Wrong chunk = wrong answer.
Fine-Tuning
What it is: Training the model on examples of input→output pairs to change its behavior at the weight level.
When it actually makes sense:
- You need a behavior change that prompting can't achieve (specific reasoning style, domain-specific shorthand)
- Reducing token usage matters at scale (fine-tuned models often need shorter prompts)
- You're distilling a larger model into a smaller one
- You have 1,000+ high-quality labeled examples
When fine-tuning is NOT the answer:
- You want the model to "know" your company's internal docs → use RAG
- The base model gives 80% correct answers and you want 100% → engineering problem, not model problem
- You don't have quality training data → fine-tuning bad data makes a worse model
# Fine-tuning cost estimate (rough)
# OpenAI fine-tuning: $8/1M tokens for training
# 10,000 training examples × 500 tokens each = 5M tokens = ~$40 to fine-tune
# Inference on fine-tuned model: 2x more expensive than base
# Anthropic Claude fine-tuning: contact sales (enterprise feature)
# Open-source (Llama 3, Mistral): use Axolotl or LLaMA-FactoryOpen-source fine-tuning with Axolotl:
# axolotl config for instruction fine-tuning
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
datasets:
- path: my_training_data.jsonl
type: alpaca
output_dir: ./fine-tuned-model
sequence_len: 4096
sample_packing: true
# LoRA config (parameter-efficient fine-tuning)
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: trueCost: GPU hours ($1–10/hour on AWS) + engineering time
Time to implement: 2–8 weeks including data collection
Limitation: Catastrophic forgetting (model may lose general capabilities). Ongoing maintenance as you improve the model.
Comparison Table
| Prompt Engineering | RAG | Fine-Tuning | |
|---|---|---|---|
| Use for | Behavior, format, style | Knowledge, facts, docs | Behavior at scale, distillation |
| Changes model weights? | No | No | Yes |
| Cost | Free | Low-medium | High |
| Time to production | Hours | Days-weeks | Weeks-months |
| Maintenance | Low | Medium (index updates) | High (retraining) |
| Knowledge freshness | Static (training cutoff) | Real-time | Static (training data) |
| Data required | None | Documents | Labeled examples (1K+) |
| Best for DevOps | Reviewer/formatter behavior | Runbook Q&A, incident help | Custom small models |
The Right Order
- Start with prompt engineering. 80% of problems are solvable here.
- Add RAG when you need the model to use your knowledge base.
- Fine-tune only when both above have a specific gap AND you have the data and budget.
Most production AI features at companies are prompt engineering + RAG. Fine-tuning is for specialized cases that can't be solved otherwise. The cost and time of fine-tuning is rarely justified unless you're operating at serious scale or have a genuinely unique behavior requirement.
Today I Fixed
Short real fixes from production — posted daily
Stay ahead of the curve
Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.
Related Articles
AI-Powered Kubernetes Anomaly Detection: Beyond Static Thresholds
Static alerts miss 40% of real incidents. Learn how AI and ML-based anomaly detection — using tools like Prometheus + ML, Dynatrace, and custom LLM runbooks — catches what thresholds can't.
AI-Powered Log Analysis Is Replacing Manual Debugging in DevOps (2026)
How LLMs and AI are transforming log analysis, anomaly detection, and root cause analysis — and the tools DevOps engineers should know about in 2026.
AI-Powered Log Analysis — How LLMs Are Replacing grep for DevOps Engineers
How to use LLMs and AI tools for intelligent log analysis in DevOps. Covers practical workflows, open-source tools, prompt engineering for logs, and building custom log analysis agents.