Generative Ai
Evaluating LLM Output Quality in Production
Dec 01, 2024

Evaluating LLM Output Quality in Production

Vibes are not a quality framework. Here is how to define, measure, and monitor LLM output quality in production, from RAGAS to LLM-as-judge.


The most common question we hear from teams with a working prototype is some version of: “how do we know when it’s good enough to ship?” They have built something that works in their testing sessions, that impresses stakeholders in demos, and that the team has grown genuinely excited about. But when pressed on whether it is production-ready, the honest answer is usually “we think so” — which is not a quality framework.

The question of “good enough” cannot be answered until you have defined what “good” means for your specific system, measured it against that definition, and established the threshold below which the system should not ship. This sounds obvious stated plainly, but the majority of GenAI systems we encounter in the pre-production stage have none of these things in place. They have a feeling.

Why Evaluation Is Hard for LLMs

Traditional software has a testable expected output for most inputs. A function that parses a date string either returns the correct date object or it does not. Classification models have ground truth labels. Even recommender systems can be evaluated against held-out interaction data with clear precision and recall metrics.

LLMs do not have a single ground truth. A question about an organisation’s remote work policy might have dozens of equally correct paraphrases as valid answers. “Good” is context-dependent: a concise two-sentence answer is good for a customer service chatbot and inadequate for a compliance documentation assistant. What counts as a correct answer evolves as the use case evolves, as documents are updated, as user expectations shift.

This is not an excuse for avoiding measurement — it is a reason to be thoughtful about what you measure and why. The goal is not to reduce LLM evaluation to a single number. It is to build a measurement framework that is specific enough to be actionable and stable enough to track over time.

The Evaluation Dimensions

The right evaluation dimensions depend on the system type. For RAG systems — the dominant pattern for knowledge-intensive enterprise GenAI — we evaluate across four dimensions:

Faithfulness: Does the generated answer reflect the information in the retrieved context, or does it introduce claims that are not supported by the retrieved chunks? A system with low faithfulness is hallucinating — generating plausible-sounding information that is not grounded in the source documents. This is the most consequential failure mode in knowledge-retrieval applications.

Answer relevance: Does the answer address the user’s actual question? A faithful answer to the wrong question is still a failure. This dimension catches cases where the system retrieves relevant context but then generates an answer that drifts from the query.

Context precision: Of the retrieved chunks included in the prompt, what proportion are actually relevant to the query? Low context precision means the retrieval layer is including noise, which dilutes the signal available to the generation step and increases the risk of the model attending to irrelevant content.

Context recall: Of the information in the corpus that is relevant to the query, what proportion did the retrieval step actually find? Low context recall means the system is missing things — answering questions based on incomplete evidence, which may produce technically faithful but substantively incomplete answers.

For general text generation tasks — summarisation, drafting, rewriting — the evaluation dimensions shift: correctness (does it accurately represent the source material?), completeness (does it cover the key points?), conciseness (is it appropriately dense for the format?), and format adherence (does it conform to the specified output structure?).

RAGAS: The Standard Framework for RAG Evaluation

RAGAS (Retrieval-Augmented Generation Assessment) is the closest thing the field has to a standard evaluation framework for RAG systems. It provides automated measurement of all four dimensions above, using an LLM as the evaluator.

The workflow is: generate a synthetic test set from your document corpus, run your RAG system against the test set, and score the outputs using RAGAS metrics. RAGAS can generate the test set itself — it uses an LLM to generate plausible questions from your documents along with the ground truth answers and the relevant passages. This significantly reduces the effort of building an evaluation dataset, though the synthetic questions should be reviewed and augmented with real queries before you rely on the scores for production decisions.

The tradeoffs between RAGAS metrics are instructive. High faithfulness combined with low context recall produces a system that is accurate but incomplete — it answers faithfully from the chunks it retrieves, but it misses relevant information. High context recall combined with low context precision produces a system that retrieves broadly but noisily — it finds the right information but buries it in irrelevant chunks, increasing generation noise. Understanding which dimension is failing tells you exactly where to focus: retrieval tuning, chunking strategy, or generation prompt.

LLM-as-Judge: Scaling Evaluation

Human evaluation is the ground truth for LLM output quality, but it does not scale. Evaluating 500 outputs per day with human raters requires infrastructure, annotation guidelines, rater training, and ongoing quality control. For many teams, the economics do not work.

LLM-as-judge is the practical middle ground: use a strong, capable model — GPT-4o, Claude Sonnet, or Gemini 1.5 Pro — to evaluate the outputs of the production system. The judge model receives the query, the system’s output, and an evaluation rubric, then returns a score and a brief rationale. At scale, this is orders of magnitude cheaper than human evaluation while producing quality assessments that correlate well with human judgment on well-defined dimensions.

The risks are real and must be managed. Model bias: LLMs have systematic preferences — for longer responses (verbosity bias), for responses that resemble their own style, and in some cases for the first option in a comparison pair (positional bias). Calibration drift: the judge model’s standards may shift as it receives different batches of inputs. Overconfidence: LLM judges tend to produce high scores, compressing the useful range of the scale.

Mitigations that we apply in production: provide explicit, calibrated rubrics with behavioural anchors at each score level (not just “1 = bad, 5 = good” but specific descriptions of what a 3 looks like versus a 4). Use multi-model juries — run evaluation through two or three judge models and flag outputs where the judges disagree significantly for human review. Periodically sample outputs that scored above a certain threshold and have a human reviewer verify the scores; this calibrates your trust in the automated evaluation.

Human Evaluation Protocols

LLM-as-judge handles scale. Human evaluation handles correctness — particularly in domains where domain expertise is required to assess output quality and where the stakes of getting it wrong are high.

Designing an effective human evaluation protocol requires more care than it typically receives. Start with annotation guidelines that define the evaluation task precisely: what the rater is evaluating, what the scoring dimensions are, examples of outputs at each score level, and explicit guidance on edge cases. Vague guidelines produce inconsistent ratings that are not useful for decision-making.

Measure inter-annotator agreement before trusting your evaluation scores. Cohen’s kappa is the standard metric for pairwise agreement between two raters on categorical judgments; Krippendorff’s alpha is more appropriate for more than two raters or continuous scales. A kappa below 0.6 indicates that raters are interpreting the task differently, which means the evaluation itself is unreliable. Fix the guidelines before collecting more ratings.

Sample size matters for statistical significance. If you are evaluating two versions of a prompt or two model configurations and want to detect a 5% difference in quality with 80% statistical power, you need a sample of roughly 300–400 output pairs. Teams that evaluate 20 outputs and conclude that “version A is better” are drawing conclusions from noise.

For the review tooling, we use simple web-based interfaces — often built on Label Studio or a lightweight custom tool — that show the rater the query, the output, and the scoring dimensions side by side. Rater fatigue is real after about 45–60 minutes of continuous evaluation; session length should be managed to maintain rating quality.

Production Monitoring: Evaluation Does Not Stop at Launch

A pre-launch evaluation tells you the system was good enough on the date you shipped it. It does not tell you whether it remains good enough as inputs change, documents are updated, or the foundation model provider silently updates the underlying model.

Continuous production monitoring requires a different set of signals. Output length distributions are a cheap and sensitive leading indicator of prompt drift or model update: if average response length shifts by 20% over a week without an intentional change, something has changed in the system’s behaviour. User feedback signals — thumbs up/down in a chat interface, explicit corrections, repeat queries that indicate the first answer was unsatisfactory — provide direct quality signal from the people who actually use the system. Escalation rate in human-in-the-loop systems is a quality metric: if the share of LLM outputs that human reviewers are overriding increases, the model’s outputs are degrading relative to the human standard.

A meaningful production monitoring setup samples a fraction of live outputs for automated LLM-as-judge scoring on a daily schedule, tracks the distributions of key metrics over time, and alerts when scores fall below defined thresholds. This is not a heavy build — it can be implemented with a scheduled job, a few hundred daily samples, and a simple dashboard in your existing observability tooling.

Defining “Good Enough”: The Business Question Behind the Technical Measurement

The evaluation framework exists to answer a business question, not a technical one. “Good enough” means different things depending on what the system is doing and what happens when it is wrong.

The frame we use: error rate multiplied by consequence equals the acceptable risk threshold. A 3% error rate on a customer service FAQ chatbot that answers questions about branch opening hours is a minor inconvenience. A 3% error rate on a system that generates draft credit memos reviewed by junior analysts who rely heavily on the draft is a material risk — those 3% of cases represent real lending decisions where the draft may propagate an error through the review process.

Define your acceptable error rate before you start measuring. Define what counts as an error — not just “wrong” in the abstract, but specifically what types of failures are consequential and what types are recoverable. Then measure against that definition. This converts the question “is it good enough?” from a judgment call into a calculation.

Ship with a clear measurement baseline. Know your pre-launch RAGAS scores, your human evaluation agreement rates, and your key production monitoring thresholds. When the system degrades — and at some point, it will — you will know because the numbers will tell you. The alternative is discovering the degradation through a user complaint or a production incident, which is a worse place to be making quality judgments.


Learn how Nematix’s Innovation Engineering services help businesses build production-ready AI systems.