Measuring RAG Quality That Matters
Retrieval accuracy and answer faithfulness are different problems with different fixes. Measure each to debug the right layer.
RAG quality is two problems wearing one trenchcoat
Measuring RAG quality fails when teams track a single "is the answer good" score and then cannot tell why it dropped. A retrieval-augmented generation system has two distinct stages, finding the right context and writing a faithful answer from it, and each fails in its own way for its own reasons. Conflate them and every regression becomes a guessing game. Separate them and each failure points at the layer that caused it.
The discipline is simple to state: measure retrieval and generation independently, then measure the whole thing end to end. Three lenses, three different fixes, and no more staring at one number wondering which half of the pipeline betrayed you.
First, is retrieval even finding the right chunks?
If the relevant passage never makes it into the context window, no model can answer correctly. It will either refuse or invent. So the first metric set is about retrieval in isolation, judged against a set of questions where you know which document holds the answer.
- Recall at k: of the questions whose answer lives in your corpus, how often does the correct chunk appear in the top k results? Low recall is a retrieval problem in chunking, embeddings, or query rewriting.
- Precision at k: how much of what you retrieved is actually relevant? Low precision floods the prompt with noise and pushes the real answer down or out.
- Mean reciprocal rank: how high up does the right chunk land? Position matters, because models attend more to what comes first.
When these numbers are weak, do not touch your prompt. The fix lives upstream. Revisit chunk size, try a better embedding model, add a reranker, or rewrite the user's query before searching.
Then, is the answer faithful to what was retrieved?
Good retrieval can still produce a bad answer if the model embellishes, contradicts the source, or ignores it. This is faithfulness, and it is separate from correctness. An answer can be factually true in general while being unfaithful to the documents you gave it, which in a grounded system is still a failure, because it means the model is not actually using your data.
Measure it by checking whether each claim in the answer is supported by the retrieved context. An LLM-as-judge works well here: give a scoring model the context, the answer, and a rubric, and ask it to flag unsupported statements. Track the rate of unsupported claims over time. A rising number usually means a prompt that invites speculation, or context that is technically present but too buried for the model to use.
Finally, end-to-end answer quality
The two diagnostic layers tell you where things break; end-to-end quality tells you whether the system is useful at all. Build a fixed evaluation set of real questions with reference answers, and score the full pipeline against it on every meaningful change.
- Curate questions from real usage, not invented ones, because real questions expose the gaps that matter.
- Score on every meaningful change: a new model, a new prompt, a new chunking strategy.
- Compare against the previous version, not an absolute bar, because what you care about is direction.
Combine an automated judge with periodic human review. Automated scoring catches regressions cheaply and runs on every change, while human review catches the subtle failures judges miss and keeps the judge itself honest. Neither replaces the other.
Watch the answers users never get
One metric quietly predicts trust: how often the system should have said "I don't know" but answered anyway, and how often it refused when the answer was right there in the context. Over-refusal frustrates users and trains them to stop asking. Over-answering erodes trust the first time someone catches a confident fabrication, and that trust does not come back easily.
Both are worth tracking explicitly, because both are invisible in a single quality score. A system can post a respectable average while failing exactly the high-stakes questions where a wrong answer does the most damage, and you would never know unless you measured the refusal behaviour on its own. The fix is usually a calibrated instruction about when to abstain, plus a retrieval confidence signal the generator can lean on: if the top chunks score poorly, the model should be told it is allowed, even expected, to decline rather than stretch thin evidence into a confident-sounding paragraph.
How BSH can help
At BSH Technologies, we build RAG systems with evaluation baked in from day one: separate retrieval and faithfulness metrics, a curated test set drawn from your real questions, and dashboards that tell you which layer moved when quality changes. We have delivered grounded AI search and knowledge assistants for teams who need answers they can defend, not just answers that sound good. If your RAG system feels unpredictable, we can help you measure it properly and fix the layer that actually matters.
From the blog
View all postsDesigning Multi-Tenant SaaS That Scales
Choosing an isolation model, keeping tenant data separate, and dodging the noisy-neighbour and migration traps that bite SaaS later.
Hitting Green Core Web Vitals in Next.js
A practical guide to LCP, INP and CLS in Next.js — image handling, font loading, the App Router boundary, and costly third-party scripts.