🎉 DevOps Interview Prep Bundle is live — 1000+ Q&A across 20 topicsGet it →
All Articles

Fine-Tuning vs RAG vs Prompt Engineering — When to Use Which

Not every LLM problem needs fine-tuning. Understand when prompt engineering is enough, when to use RAG for knowledge, and when fine-tuning actually makes sense — with real decision criteria.

DevOpsBoysJun 8, 20264 min read
Share:Tweet

Teams waste months fine-tuning models for problems that prompt engineering would solve in an afternoon. The wrong choice costs GPU hours, engineering time, and usually doesn't even fix the actual problem. Here's the decision framework.


Quick Decision Tree

Is the model missing factual knowledge?
  → YES: Use RAG (give it the documents)
  → NO ↓

Does it behave wrong (tone, format, style)?
  → YES: Try prompt engineering first
  → STILL WRONG: Consider fine-tuning

Is it too slow or expensive at inference?
  → YES: Fine-tune a smaller model

Do you need it to do something it fundamentally can't do?
  → YES: Fine-tuning won't help either — you need training data or a different approach

Prompt Engineering

What it is: Crafting the system prompt, user prompt, and few-shot examples to get the behavior you want.

When it works:

  • Model knows the information but formats it wrong
  • You want consistent tone/persona
  • You need structured output (JSON, YAML)
  • You want to constrain or expand behaviors
python
# Before: generic response
response = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": "Review this Terraform code"}]
)
 
# After: specific, structured behavior
response = client.messages.create(
    model="claude-sonnet-4-6",
    system="""You are a Terraform expert reviewer. For every code review:
1. Check security issues (overpermissive IAM, public S3, unencrypted RDS)
2. Check cost issues (overprovisioned instances, missing lifecycle policies)
3. Check correctness (missing dependencies, wrong resource arguments)
 
Format output as JSON: {"security": [], "cost": [], "correctness": []}
Each issue: {"severity": "critical|warning", "line": N, "issue": "...", "fix": "..."}""",
    messages=[{"role": "user", "content": f"Review:\n```hcl\n{terraform_code}\n```"}]
)

Cost: Zero (no training)
Time to implement: Hours
Limitation: Can't add new knowledge. Can't change model weights.


RAG (Retrieval-Augmented Generation)

What it is: Retrieve relevant documents at query time and inject them into the context.

When it works:

  • Model needs to answer questions about your internal docs, runbooks, codebase
  • Information changes frequently (new incidents, updated policies)
  • Knowledge is too large to fit in a prompt
  • You want citations/sources

When people use it wrong: Trying to use RAG to teach the model HOW to do something (behavior/style) instead of WHAT to know (facts/documents). RAG doesn't change behavior — it changes knowledge.

python
# RAG pipeline
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
 
encoder = SentenceTransformer("BAAI/bge-large-en-v1.5")
qdrant = QdrantClient("localhost", port=6333)
 
 
def rag_answer(question: str) -> str:
    # Retrieve relevant chunks
    query_vector = encoder.encode(question).tolist()
    results = qdrant.search(
        collection_name="runbooks",
        query_vector=query_vector,
        limit=5,
    )
    context = "\n\n".join(r.payload["text"] for r in results)
 
    # Generate with context
    response = client.messages.create(
        model="claude-sonnet-4-6",
        system="You are a DevOps assistant. Answer based only on the provided context. "
               "If the answer isn't in the context, say so.",
        messages=[
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}",
            }
        ],
    )
    return response.content[0].text

Cost: Embedding model + vector DB + extra tokens per query
Time to implement: 1–2 weeks for production setup
Limitation: Quality depends entirely on retrieval quality. Wrong chunk = wrong answer.


Fine-Tuning

What it is: Training the model on examples of input→output pairs to change its behavior at the weight level.

When it actually makes sense:

  • You need a behavior change that prompting can't achieve (specific reasoning style, domain-specific shorthand)
  • Reducing token usage matters at scale (fine-tuned models often need shorter prompts)
  • You're distilling a larger model into a smaller one
  • You have 1,000+ high-quality labeled examples

When fine-tuning is NOT the answer:

  • You want the model to "know" your company's internal docs → use RAG
  • The base model gives 80% correct answers and you want 100% → engineering problem, not model problem
  • You don't have quality training data → fine-tuning bad data makes a worse model
python
# Fine-tuning cost estimate (rough)
# OpenAI fine-tuning: $8/1M tokens for training
# 10,000 training examples × 500 tokens each = 5M tokens = ~$40 to fine-tune
# Inference on fine-tuned model: 2x more expensive than base
 
# Anthropic Claude fine-tuning: contact sales (enterprise feature)
# Open-source (Llama 3, Mistral): use Axolotl or LLaMA-Factory

Open-source fine-tuning with Axolotl:

yaml
# axolotl config for instruction fine-tuning
base_model: meta-llama/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
 
datasets:
  - path: my_training_data.jsonl
    type: alpaca
 
output_dir: ./fine-tuned-model
sequence_len: 4096
sample_packing: true
 
# LoRA config (parameter-efficient fine-tuning)
adapter: lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true

Cost: GPU hours ($1–10/hour on AWS) + engineering time
Time to implement: 2–8 weeks including data collection
Limitation: Catastrophic forgetting (model may lose general capabilities). Ongoing maintenance as you improve the model.


Comparison Table

Prompt EngineeringRAGFine-Tuning
Use forBehavior, format, styleKnowledge, facts, docsBehavior at scale, distillation
Changes model weights?NoNoYes
CostFreeLow-mediumHigh
Time to productionHoursDays-weeksWeeks-months
MaintenanceLowMedium (index updates)High (retraining)
Knowledge freshnessStatic (training cutoff)Real-timeStatic (training data)
Data requiredNoneDocumentsLabeled examples (1K+)
Best for DevOpsReviewer/formatter behaviorRunbook Q&A, incident helpCustom small models

The Right Order

  1. Start with prompt engineering. 80% of problems are solvable here.
  2. Add RAG when you need the model to use your knowledge base.
  3. Fine-tune only when both above have a specific gap AND you have the data and budget.

Most production AI features at companies are prompt engineering + RAG. Fine-tuning is for specialized cases that can't be solved otherwise. The cost and time of fine-tuning is rarely justified unless you're operating at serious scale or have a genuinely unique behavior requirement.

🔧

Today I Fixed

Short real fixes from production — posted daily

Browse fixes
Newsletter

Stay ahead of the curve

Get the latest DevOps, Kubernetes, AWS, and AI/ML guides delivered straight to your inbox. No spam — just practical engineering content.

Related Articles

Comments