LLMs

Fine-Tuning vs RAG vs Prompting: A Practical Decision Tree for Engineers

May 30, 2026·8 min read

RAGFine-TuningAI EngineeringPrompt Engineering

Every LLM project hits the same fork early: you've got a capable base model, a task to solve, and three very different tools to choose from — prompting, RAG, or fine-tuning.

Pick the wrong one and you spend two sprints rebuilding. Pick the right one and you ship faster with less complexity.

The frustrating part? All three can look similar on simple tasks. The differences surface at the edges — when data changes, when volume scales, when behavior needs to be consistent, or when the model was never trained on your domain knowledge.

This guide walks you through each approach in plain terms and gives you a decision tree you can apply right now.

Key Takeaways

Prompting should always be your first move — it's free, fast, and reversible.

RAG solves knowledge problems: private data, fresh information, large knowledge bases the model hasn't seen.

Fine-tuning solves behavior problems: consistent style, format, domain vocabulary — and it requires hundreds of quality examples to be worth the effort.

The most common mistake is jumping to fine-tuning when better prompting would have worked in an afternoon.

What Are These Three Approaches, Really?

Before the decision tree, let's get the mental models right. Each approach customizes LLM behavior at a different layer.

Prompting means shaping output through instructions, examples, and context — entirely in the input. No training, no infrastructure beyond API calls. You describe what you want, show a few examples if needed, and the model works with it. Zero-shot, few-shot, chain-of-thought, system prompts — these are all prompting strategies.

RAG (Retrieval-Augmented Generation) adds a retrieval step before generation. Instead of relying solely on what the model already knows, you fetch relevant documents from a knowledge base and include them in the prompt at query time. The knowledge lives outside the model — in a vector database, search index, or document store — and gets pulled in as needed.

Fine-tuning actually changes the model. You take a base model and continue training it on your own labeled examples, updating the weights to shift its behavior. It's the difference between giving someone instructions and training them until they no longer need the instructions.

Step 1: Always Start With Prompting

Prompting is the right first move in almost every case. It's fast to iterate, costs nothing beyond inference, and is fully reversible. If it doesn't work, you haven't committed to an architecture.

OpenAI's fine-tuning documentation explicitly recommends exhausting prompt engineering before attempting fine-tuning. The reason is practical: most tasks that feel like they require training can be solved with a clearer system message or a few well-chosen examples.

Prompting works well when:

The task is general — summarization, Q&A, classification, code generation
The correct behavior can be described in words or shown in a handful of examples
The model already has the required knowledge — it just needs better instructions
Requirements are still evolving and you need to iterate quickly

Where prompting hits a wall:

The model needs information it was never trained on (your internal docs, last week's product updates, customer records) — no amount of prompting creates knowledge that isn't there
You need highly consistent output formatting across thousands of calls, and prompting produces occasional drift
The task uses domain-specific vocabulary or logic the base model doesn't understand

A useful heuristic: if you can describe the correct behavior clearly enough for a smart new hire to follow it without worked examples — prompting is probably enough. If you'd need to walk them through hundreds of examples before they "get it" — that's a fine-tuning signal.

Step 2: Add RAG When You Have a Knowledge Problem

RAG is the right answer when the model lacks the knowledge it needs — not because it's a bad model, but because that knowledge exists outside its training data.

The original RAG paper (Lewis et al., Facebook AI Research, NeurIPS 2020) showed that retrieval-augmented models significantly outperform standard language models on knowledge-intensive tasks, especially where factual accuracy matters. The key insight: instead of memorizing facts in model weights, you look them up at inference time.

RAG works well when:

You need to query private, internal, or proprietary data — company docs, customer records, codebases, internal wikis
The information changes frequently — product catalogs, news, pricing, regulatory updates
The knowledge base is too large to fit in a single context window
Source attribution matters — users need to know where an answer came from

A concrete example: You're building an internal support bot over a 400-page technical documentation site. The LLM knows nothing about your product's specific error codes, configuration options, or internal terminology. You don't want to retrain every time the docs change. RAG is the right call — index the docs in a vector store, retrieve the most relevant chunks at query time, and let the model answer from that grounded context.

Where RAG falls short: Retrieval quality determines answer quality. Bad chunking, weak embedding models, or a missing document means bad answers — and the failure is often silent. RAG also adds latency and infrastructure you have to build and maintain. It's not a magic layer on top of prompting; it's a retrieval system with its own failure modes.

Step 3: Fine-Tune Only When Behavior Is the Problem

Fine-tuning gets over-used. It's appealing because it feels permanent — train the behavior in and you're done. But it's expensive to set up, slow to iterate, and brittle when your task or data evolves.

Fine-tuning makes sense when:

Style or format consistency is critical at scale — if you're generating 10,000 structured JSON outputs a day and the model occasionally drifts in format, fine-tuning on well-formatted examples can lock that in more reliably than prompting
Domain-specific language is opaque to the base model — legal, medical, finance, or proprietary technical vocabulary that wasn't in the training data
You've already tried prompting and RAG, and the behavior problem persists — not the knowledge problem
You have enough data. OpenAI's documentation recommends at least 50–100 examples to begin seeing improvement, and several hundred to low thousands for reliable gains in production

What fine-tuning won't fix: A knowledge gap. Fine-tuning teaches behavior, not facts. If the model hallucinates because it doesn't know your data, adding a retrieval layer almost always works better than training more. Fine-tuning also doesn't make the model smarter — it shifts its behavior within its existing capability range.

Teams that jump to fine-tuning without first exhausting prompting and RAG almost always report the same thing: it helped, but not as much as expected, and now they're maintaining a training pipeline, a dataset versioning system, and a custom model checkpoint. That's a real operational burden on top of whatever product you're building.

The Decision Tree

Decision tree for choosing between prompting, RAG, and fine-tuning. Start at the top and follow the branches.

The Three Mistakes Engineers Make Most

1. Fine-tuning when prompting would work.

This is the most expensive mistake. Engineers see inconsistent output and jump to fine-tuning, when a better system prompt or a few well-chosen examples would fix it in an afternoon. Before fine-tuning, try: specifying the exact output structure in the system prompt, adding 3–5 representative few-shot examples, or using chain-of-thought to guide the model's reasoning step-by-step.

2. Building a RAG pipeline when the context window is large enough.

If you're chunking a 40-page document and running embedding-based retrieval over it, ask yourself: would just passing the whole document in context work? With modern models supporting 128K–1M token context windows, the answer is often yes — and it's a fraction of the implementation complexity. RAG earns its place when the knowledge base exceeds context limits, changes frequently, or needs to scale to thousands of documents.

3. Layering all three without evidence that you need each one.

It's tempting to build a pipeline that prompts + retrieves + fine-tunes. Each layer adds latency, cost, and new failure modes. Start minimal. Add a layer only when you have evidence — real test results, not intuition — that the simpler version is insufficient.

A Quick Comparison

	Prompting	RAG	Fine-tuning
Setup time	Minutes	Days–weeks	Days–weeks
Data required	None	Documents / knowledge base	100s–1000s labeled examples
Solves	Instruction + reasoning tasks	Knowledge gaps	Style, format, behavior gaps
Updates easily?	Yes	Yes (add to store)	No (retrain)
Cost	Inference only	Inference + retrieval infra	Training + inference
Try first?	Always	After prompting fails	After both fail

Frequently Asked Questions

Can I combine RAG and fine-tuning?

Yes, and it often makes sense. Fine-tune for consistent style and output format, then add RAG for fresh or private knowledge. These solve different problems, so they don't compete — they layer cleanly. The pattern: fine-tune once for behavior, retrieve always for knowledge. Just don't add both hoping one compensates for the other's weaknesses. Identify what's actually broken first.

How much data do I actually need to fine-tune?

OpenAI's fine-tuning documentation recommends starting with at least 50–100 well-formatted examples and iterating from there. For reliable, consistent gains in production, most teams end up with several hundred to a few thousand examples. Quality beats quantity — 200 clean, representative examples outperform 2,000 noisy ones. If you don't have enough yet, use prompting or RAG in the meantime while you collect data.

When should I fine-tune for private data instead of using RAG?

RAG is almost always the right first choice for private data because updates are immediate — add a document and it's queryable in minutes. Fine-tuning on private data makes sense in narrow situations: the domain vocabulary is so specialized the model can't correctly interpret retrieved content, or the data is too sensitive to include in inference-time prompts. In most cases, RAG handles this more flexibly and at lower cost.

Does prompting still matter if I'm already using RAG or fine-tuning?

Always. Prompting shapes how the model uses retrieved context in a RAG system and how it applies learned behavior in a fine-tuned model. Good prompting makes both approaches work better. Think of it as the instruction layer — it doesn't go away when you add retrieval or training, it just gets company.

Where to Go From Here

The default order is: prompt first, retrieve second, fine-tune last. Most tasks that feel like they need fine-tuning can be solved with a well-engineered prompt and, if needed, a retrieval layer on top.

Before you design your next LLM integration, run through the decision tree once. It takes five minutes and can save weeks of rebuilding.

Sources

Lewis, Patrick, et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. Facebook AI Research.
Liu, Nelson F., et al. "Lost in the Middle: How Language Models Use Long Contexts." Stanford NLP Group, 2023.
OpenAI. "Fine-tuning best practices." OpenAI Platform Documentation, retrieved 2026-05-30.

AI Engineering