Aug 01, 2024

Building a Production RAG System

A RAG system that works in a demo often fails in production. Here is the architecture, chunking strategy, and retrieval framework that closes the gap.

Retrieval-Augmented Generation is now the default pattern for knowledge-intensive GenAI applications. Most teams understand the concept — retrieve relevant context, inject it into a prompt, generate a grounded response. Far fewer have built a system that performs reliably in production. The gap between a convincing demo and a RAG system you can trust at scale is almost entirely about the retrieval side of the pipeline, and most teams underinvest there.

This is what we have learned building production RAG systems across enterprise deployments — the architectural choices, the chunking tradeoffs, and the evaluation framework that determines whether a RAG system is useful or merely plausible.

The Architecture: Two Pipelines, Not One

A RAG system is not a single pipeline — it is two pipelines that share a vector store.

The indexing pipeline runs offline (or on a schedule) and builds the knowledge base:

Document Ingestion → Parsing & Cleaning → Chunking → Embedding → Vector Store

The query pipeline runs at request time and generates responses:

User Query → Query Embedding → Vector Retrieval → Reranking → Context Assembly → LLM Generation

Most teams get the indexing pipeline mostly right and underdesign the query pipeline. The retrieval step — not the generation step — is where production RAG systems live or die. A strong LLM cannot compensate for a retrieval layer that returns the wrong chunks.

The vector store sits at the junction of both pipelines. In production we use Pinecone, Weaviate, or Qdrant depending on the hosting requirements. For teams with an existing PostgreSQL stack, pgvector is a viable starting point, though it tends to hit performance ceilings at the scale of millions of chunks. Each index entry stores the chunk embedding alongside metadata: source document identifier, chunk sequence position, section title, and document type. Metadata filtering at retrieval time is essential for enterprise deployments where a single index holds documents from multiple departments, time periods, or access tiers.

Chunking: The Decision That Matters Most

If we had to identify the single architectural choice with the greatest impact on RAG system quality, it is chunking strategy — not model selection, not vector store choice. The way you split documents determines the granularity of what gets retrieved, and that directly constrains what can be generated.

Fixed-size chunking

The simplest approach: split the document every N tokens with a small overlap window. Easy to implement, easy to reason about, and consistently mediocre in production. Fixed-size chunks ignore document structure entirely. A chunk boundary can fall mid-sentence, mid-paragraph, or mid-table, producing fragments that have no meaningful context when retrieved in isolation. We use this only as a baseline to measure against.

Semantic chunking

Split on natural document boundaries — sentence groups, paragraphs, section headers — rather than at a fixed token count. The resulting chunks are semantically coherent: each chunk contains a complete thought or a set of related thoughts. Retrieval quality improves noticeably because the chunks are meaningful units rather than arbitrary slices. The tradeoff is that chunk size becomes variable, which makes token budget management at generation time slightly more complex.

Hierarchical chunking (parent-child)

The approach we use most often in production for long-form document corpora. The strategy: create two sets of chunks from each document — small child chunks for precise retrieval, and larger parent chunks for context-rich generation.

Parent Chunk (512–1024 tokens): full section or multiple paragraphs
    ├── Child Chunk 1 (~128 tokens)
    ├── Child Chunk 2 (~128 tokens)
    └── Child Chunk 3 (~128 tokens)

At query time, the vector index contains only the child chunks. Retrieval operates on the child chunks — small chunks mean precise, focused retrieval. But when assembling context for the generation prompt, we fetch the parent chunk that contains each retrieved child. The LLM receives the full parent context, which avoids the fragmentation problem of small-chunk retrieval while preserving retrieval precision.

Chunk size tradeoffs

The 512-token chunk is a reasonable starting point for most document types, but it is a starting point — not a conclusion. The right chunk size depends on your document structure and your query patterns.

Small chunks (64–128 tokens) yield precise retrieval but frequently lack sufficient context for generation. Large chunks (1024–2048 tokens) preserve context but introduce retrieval noise — the vector similarity score is averaged across a longer span, so a chunk that is only partially relevant can rank highly. The parent-child approach resolves this tension by decoupling retrieval granularity from generation context size.

The most reliable way to calibrate chunk size is empirical: build a small golden dataset of representative queries with known relevant passages, then measure retrieval recall at different chunk sizes. The numbers will tell you more than any rule of thumb.

Embedding Model Selection

The embedding model determines how the semantic space is constructed. Queries and chunks are only retrievable from each other to the degree that the embedding model places similar meanings near each other in that space.

For general enterprise use, OpenAI’s text-embedding-3-small is the pragmatic choice: competitive retrieval quality, low cost at $0.02 per million tokens, and a 1,536-dimensional output that handles most document types well. text-embedding-3-large offers measurably better performance on complex semantic queries at roughly 5x the cost — worth it for precision-critical applications, less defensible for general-purpose knowledge bases.

Cohere Embed v3 is competitive with OpenAI’s large model on MTEB benchmarks and supports a native 1,024-dimension output with int8 quantisation, which is useful for reducing vector store footprint at scale.

Open-source options have improved substantially. BGE-M3 from BAAI handles multilingual retrieval effectively — useful for Southeast Asian enterprise deployments where documents span English, Bahasa Malaysia, and Mandarin. E5-large-v2 performs well on English-language document corpora and can be self-hosted, which matters for organisations with data sovereignty requirements.

Domain-specific fine-tuning of the embedding model is worth the investment only when: (1) your document corpus contains dense technical vocabulary that general-purpose models represent poorly, and (2) you have at least several thousand high-quality (query, relevant passage) training pairs. In practice, this threshold is reached in specialised domains — medical, legal, or highly technical engineering documentation — but not in typical enterprise knowledge bases.

Retrieval: Dense, Sparse, and Hybrid

The retrieval layer determines which chunks from the index get passed to the generation step. There are three approaches in common use.

Dense retrieval uses vector similarity (cosine or dot product) between the query embedding and chunk embeddings. It handles semantic queries well — “what is the company’s policy on remote work?” retrieves relevant chunks even if they never use the word “policy”. It handles exact-match and keyword queries poorly, because lexical overlap does not drive vector similarity.

Sparse retrieval uses BM25 or similar term-frequency methods. Strong on keyword queries and exact-match lookups. Fails on paraphrased queries where the user’s language does not match the document’s language.

Hybrid search combines both. The standard approach is reciprocal rank fusion (RRF): retrieve the top-k results from both dense and sparse retrieval independently, then merge the ranked lists using RRF scoring. In our production deployments, hybrid search consistently outperforms either approach alone across diverse query sets. It is the default for any general-purpose enterprise knowledge base.

Weaviate, Qdrant, and Elasticsearch all support hybrid search natively. Pinecone requires combining results from a Pinecone dense index and a separate sparse index, which adds implementation complexity but is manageable.

Reranking: The Second-Stage Filter

Even a well-tuned retrieval layer returns some irrelevant chunks. Reranking is a second-stage model that takes the top-k retrieved chunks and produces a more accurate relevance ranking before context assembly.

The standard approach uses a cross-encoder: a model that takes (query, chunk) pairs and outputs a relevance score. Unlike the bi-encoder used for embedding, the cross-encoder attends to both the query and the chunk together, which produces more accurate relevance judgments — at significantly higher computational cost. This is why reranking is applied only to the top-k retrieved chunks (typically 20–50) rather than the full index.

Cohere Rerank is the most commonly used managed reranking service. BGE Reranker is the open-source alternative. Both add 100–300ms of latency to the query pipeline. For precision-critical applications — where returning a single wrong document has real consequences — the latency cost is worth it. For latency-sensitive applications like chat interfaces, evaluate whether the quality improvement justifies the tradeoff.

Evaluating Retrieval Quality

You cannot improve what you cannot measure. This is not a platitude in RAG systems — it is the reason most teams stall. Without a retrieval quality measurement framework, every tuning decision is guesswork.

The RAGAS framework provides a standard set of metrics for RAG evaluation:

Context precision: of the retrieved chunks, what fraction are actually relevant to the query?
Context recall: of the relevant information that exists in the corpus, what fraction did retrieval find?
Faithfulness: does the generated answer reflect the retrieved chunks, or does it introduce information not present in the context?
Answer relevance: does the generated answer address the user’s question?

Building a golden dataset for your domain is the prerequisite. A golden dataset consists of representative queries with annotated relevant passages from the corpus. For an enterprise deployment, 150–300 (query, relevant passages) pairs covering the range of query types your system will encounter is a reasonable starting set. RAGAS can generate a synthetic golden dataset from your corpus using an LLM, which gives you a faster starting point — though the synthetically generated queries should be reviewed and supplemented with real queries from domain experts before you rely on the scores for production decisions.

The key insight that most teams internalise too late: retrieval quality is a ceiling on generation quality. If the retrieval layer returns chunks that do not contain the answer, no amount of prompt engineering will produce a correct response. The generation step can only work with what retrieval provides. Fix retrieval before tuning prompts — every time.

Common Production Failures

After deploying and operating RAG systems across multiple enterprise clients, we see the same failure modes repeatedly.

Metadata filtering not implemented. A single vector index often contains documents from multiple sources, departments, or time periods. Without metadata filtering at retrieval time, a query about the 2024 annual leave policy might retrieve chunks from the 2019 version. The fix is straightforward: store document metadata in the vector index and apply filters at query time. The oversight is surprisingly common in systems built by teams focused on the model layer.

Embedding model drift. OpenAI, Cohere, and open-source model maintainers update embedding models. When the embedding model changes, the new model produces vectors in a different space from the old vectors. Chunks embedded with the old model are not meaningfully comparable to query embeddings from the new model. The fix: re-index the entire corpus when the embedding model changes, and treat model updates as a breaking change that requires a migration plan.

Source document changes not reflected in the index. A policy document is updated; the vector index still contains chunks from the old version; queries return outdated information with high confidence. Production RAG systems need a re-indexing trigger that fires when source documents are modified. Document last-modified timestamps are the minimum; a proper content-addressed index that detects hash changes is better.

Retrieval Determines Everything

The generation step gets all the attention. It is the visible face of the system — the text that users read. But the generation step is entirely dependent on what the retrieval layer provides. A RAG system with weak retrieval and strong generation produces confident-sounding answers from incomplete or wrong context. A RAG system with strong retrieval and reasonable generation produces accurate, grounded answers.

The investment goes in retrieval. Chunking strategy, hybrid search, reranking, evaluation metrics, golden datasets — these are the components that determine whether a RAG system is useful in production. The generation prompt is a relatively small lever compared to the retrieval architecture it sits on top of.

Evaluating LLM Output Quality in Production — How to measure whether your RAG system’s outputs are actually accurate and reliable once it is live.
LLM Gateway Architecture for Production — The infrastructure layer that sits above your RAG system and manages routing, cost, and resilience.
Nematix Generative AI Services — How Nematix designs and builds production RAG pipelines for enterprise clients across regulated industries.

Learn how Nematix’s Innovation Engineering services help businesses build production-ready AI systems.