What is fine-tuning an LLM?

Fine-tuning is a training-time process that adjusts a pre-trained model's weights on a task-specific dataset, changing how the model behaves when generating output. It can improve format consistency, adherence to domain-specific output conventions, and handling of specialised vocabulary, but it requires high-quality labelled examples — typically 500 to 5,000 at minimum — and does not update the model's factual knowledge.

What is RAG and how does it work?

RAG (Retrieval-Augmented Generation) augments a model's available knowledge at inference time by retrieving relevant documents from an external knowledge base and including them in the prompt. The model reasons over the retrieved context rather than relying solely on its training data. This keeps the knowledge base current and updatable without retraining the model.

When should you use fine-tuning instead of RAG?

Fine-tuning is appropriate when format and output consistency cannot be achieved through prompting alone, when the task involves domain-specific vocabulary or reasoning patterns the base model handles poorly, when on-premise deployment is required for data sovereignty, and when you have sufficient high-quality labelled data. RAG remains the right choice when the knowledge base changes frequently, is large relative to context window economics, or when retrieval precision matters — fine-tuning cannot reliably inject factual knowledge and produces confabulation when applied for that purpose.

Has RAG been replaced by fine-tuning or large context windows in 2025?

No. Larger context windows (up to 1 million tokens in some models) make RAG unnecessary for small, bounded document sets, but at high query volume the per-request cost of large-context prompts exceeds the infrastructure cost of a vector store and embedding pipeline. Fine-tuning changes model behaviour at generation time; RAG supplies current knowledge at inference time. They solve different problems and are frequently used together in high-capability enterprise applications.

Generative Ai

Sep 01, 2025

Fine-Tuning vs. RAG in 2025: Has Anything Changed?

Context windows are larger and fine-tuning is cheaper. We revisit the fine-tuning vs. RAG tradeoffs in 2025 and find some guidance still holds.

The conventional wisdom in 2024 was clear: use RAG for knowledge retrieval, use fine-tuning for style and format adaptation. The guidance held up reasonably well because it mapped cleanly onto what the two approaches actually do — RAG augments a model’s available knowledge at inference time, fine-tuning adjusts the model’s behaviour and output characteristics at training time. These are different tools solving different problems, and treating them as substitutes was the mistake most commonly made by teams that did not understand either.

By 2025, three developments changed the context enough that the conventional guidance required revisiting: context windows grew substantially, fine-tuning became significantly cheaper and more accessible, and open-source models improved to the point where on-premise deployment became a realistic option for many enterprise use cases. Each of these shifts moved some combination of tradeoffs, but they moved them in different directions — and not all of them in ways that favour abandoning the original guidance.

What Changed: Context Windows

The context window expansions of 2024 and 2025 were significant. Gemini 1.5 Pro shipped with a 1 million token context window. GPT-4o extended to 128K tokens as standard. Claude 3.5 Sonnet handles 200K tokens. For many document-centric enterprise tasks, this changes the calculus on whether you need a retrieval pipeline at all.

With a 128K token context, you can fit roughly 90,000 words of text into a single prompt — the equivalent of a short novel, or several hundred pages of documentation. “In-context learning at scale” becomes viable: rather than building an indexing pipeline and retrieval layer, you stuff the relevant documents directly into the prompt and ask the model to reason over them. No infrastructure to build, no retrieval accuracy to tune, no embedding model to select.

This approach genuinely works for some use cases. If your knowledge base is small enough to fit within the context window, if you need to answer questions that require reasoning across the full document set rather than precise point retrieval, and if the economics of large-context prompts at your expected query volume are acceptable — then direct in-context use of long documents is a legitimate architecture.

The economics are where this approach hits its limits. Pricing scales linearly with context length. A 128K-token prompt at GPT-4o pricing ($5 per million input tokens) costs $0.64 per request — before the output tokens. At high query volume, this adds up faster than the infrastructure cost of a vector store and embedding pipeline. Beyond cost, there is the attention quality problem: models on very long contexts reliably underperform on information that appears in the middle of the context window (the “lost in the middle” degradation pattern, documented in multiple evaluations). For retrieving specific facts from large document sets, RAG is still more reliable — and substantially cheaper at scale.

The context window expansion does not replace RAG for enterprise knowledge bases. It does make it viable to skip RAG for small, bounded document sets where simplicity of architecture is more valuable than optimised retrieval.

What Changed: Fine-Tuning Accessibility

The barrier to fine-tuning dropped meaningfully in 2024 and 2025. OpenAI’s fine-tuning API now covers GPT-4o models. Anthropic introduced fine-tuning access for Claude. On the open-source side, PEFT methods — LoRA and QLoRA in particular — allow effective fine-tuning of large models on consumer-grade hardware. A domain-specific LoRA fine-tune of Llama 3 8B on a focused dataset of several thousand examples can be trained in a few hours on a single A100, at a compute cost of well under a thousand dollars.

What has not changed is the data requirement. Fine-tuning on low-quality or insufficient data produces a model that has overfit to whatever patterns the training data contained — which typically means confident, fluent responses that are wrong in specific ways that reflect the training set’s biases. The quality of the training data is the constraint that cost reductions do not remove.

For fine-tuning to produce a meaningfully better model than the base model plus prompting, you generally need: a clearly defined task with consistent input-output patterns, high-quality labelled examples (typically 500–5,000 minimum depending on task complexity and domain specificity), and a validation set to measure improvement. If these conditions are met, fine-tuning can produce better format consistency, better adherence to domain-specific output conventions, and better handling of specialised vocabulary than prompting alone. If they are not met, the fine-tuned model is worse than the base model plus a well-crafted system prompt.

The reduced cost of fine-tuning means more teams can afford to try it. It does not mean more teams should try it before they have exhausted what is achievable through prompting.

What Changed: Open-Source Model Quality

The open-source model landscape in 2025 is meaningfully different from 2023. Llama 3.1 70B performs comparably to GPT-4 on many standard benchmarks. Mistral Large and Qwen2 72B are competitive with Claude Haiku and GPT-4o mini on a range of enterprise tasks. For tasks that do not require frontier-level reasoning capability — classification, extraction, summarisation of structured data, format-constrained generation — open-source models are production-viable.

This matters most for regulated industries. A fine-tuned open-source model hosted on your own infrastructure eliminates the data-sharing concern that comes with sending sensitive documents to a closed-model API provider. For a hospital system, a financial institution, or a government agency whose documents cannot leave a controlled environment, the ability to run a fine-tuned Llama 3 70B on private infrastructure changes the architecture options available.

In practice, this means the fine-tuning decision for regulated industries now includes a separate axis: closed model via provider API (simpler, higher baseline capability, data leaves your environment) vs. open-source model hosted on-premise (more complex, lower baseline capability on most tasks, full data sovereignty). The right answer depends on the sensitivity of the data, the regulatory requirements in your jurisdiction, and whether the open-source model’s capability is sufficient for your task.

What Has Not Changed

Fine-tuning does not update a model’s knowledge. A model fine-tuned on data with a September 2024 cutoff does not know what happened in October 2024. Fine-tuning teaches the model new patterns of behaviour — tone, format, specialised vocabulary, task-specific reasoning chains — but it does not update the model’s factual knowledge base. A fine-tuned model asked about events after its training cutoff will either refuse (good) or confabulate (bad).

For any application where the knowledge base changes over time — product documentation, regulatory guidance, market data, internal policies — RAG remains the appropriate architecture for the knowledge retrieval component. The model can be fine-tuned to handle the format and output conventions of your use case, but RAG must supply the current knowledge.

This is the error that recurs most often in teams re-examining fine-tuning versus RAG in light of improved fine-tuning accessibility: attempting to use fine-tuning for knowledge injection. It does not work reliably. Fine-tuning for knowledge produces models that appear to have learned facts during training but confabulate plausibly when the fine-tuning examples are not directly applicable to the query. The result looks like hallucination, because it is.

The 2025 Decision Framework

The updated decision framework, incorporating the shifts above:

Use RAG when: the knowledge base is dynamic (changes frequently), the knowledge base is large relative to practical context window economics, retrieval precision matters (you need specific facts rather than general knowledge), or the application involves multiple users querying the same knowledge base at volume.

Use fine-tuning when: format and output consistency are critical and cannot be achieved through prompting alone (a model that must always output in a specific JSON schema with specific field conventions, for instance), the task involves domain-specific vocabulary or reasoning patterns that the base model handles poorly, on-premise deployment is required for data sovereignty, or you have sufficient high-quality labelled data to justify the training cost.

Use both when: you need domain-specific output behaviour (fine-tune) and access to a dynamic knowledge base (RAG). The two approaches are complementary. A fine-tuned model that retrieves context via RAG combines the output quality benefits of fine-tuning with the knowledge currency of retrieval. This is the highest-capability pattern for precision-critical enterprise applications with dynamic knowledge bases.

Use neither when: a well-crafted system prompt against a capable base model achieves acceptable performance. Before investing in either fine-tuning or RAG infrastructure, establish the performance of the base model plus prompting. Many teams that believe they need fine-tuning actually need a better prompt and few-shot examples. The infrastructure investment should be proportional to a measured performance gap, not an assumption.

The Headlines Changed More Than the Fundamentals

The 2025 landscape is different enough that teams should revisit their architecture decisions. Context window expansions make some RAG deployments unnecessary. Cheaper fine-tuning makes some behaviour optimisation accessible that previously required a larger investment. Better open-source models make on-premise deployment viable for regulated industries.

But the fundamental characteristic of each approach has not changed. RAG updates what a model knows at inference time. Fine-tuning changes how a model behaves at generation time. The choice between them — or the decision to use both — should be driven by which problem you are solving, not by which technology has received the most recent attention.

RAG remains the right default for most enterprise knowledge retrieval tasks. Fine-tuning remains a targeted tool for specific, measurable optimisation goals when prompting has proven insufficient. The teams that treat these as competing alternatives rather than complementary tools will continue to reach for the wrong one at the wrong time, regardless of how the technology evolves.

RAG vs. Fine-Tuning vs. Prompt Engineering — The foundational decision framework that this 2025 update builds on and revisits.
Building a Production RAG System — A practical implementation guide for teams that have confirmed RAG is the right architecture.
Nematix Generative AI Services — How Nematix helps teams navigate these architecture decisions and implement the right approach in production.

Learn how Nematix’s Innovation Engineering services help businesses build production-ready AI systems.