Quick Answer
RAG systems must be evaluated on retrieval and generation. The RAG triad - Query, Context, Response - prevents confusing retrieval errors with model errors.
TL;DR
- Measure context precision and recall.
- Evaluate faithfulness and answer correctness.
- Debug failures by isolating retrieval vs. generation.
FAQ
What is the RAG triad?
It separates evaluation into the user query, retrieved context, and final response so you can find the failing layer.
Which metrics matter most?
Context precision/recall for retrieval and faithfulness for generation are usually the highest signal.
How do I debug a poor answer?
Check whether the correct document was retrieved. If not, fix retrieval; if yes, fix prompting or reasoning.
The "Why" of RAG Evals
Evaluate a RAG system as a single black box, and you'll never know why it failed. Did it miss the document? Or did it have the document but hallucinates anyway?
We break RAG evaluation into the RAG Triad: Query, Context, and Response.
1. Retrieval Metrics (The Search Engine)
Before an LLM even sees your prompt, your retrieval system needs to find the right data. If garbage goes in, garbage comes out.
Context Precision
Definition: What proportion of retrieved chunks are actually relevant?
Why it matters: Low precision floods the LLM context window with noise, increasing cost and "lost in the middle" errors.
relevant_chunks / k_retrieved_chunksContext Recall
Definition: Did we retrieve all the necessary information to answer the question?
Why it matters: If the legal statute is in Chunk #50 but you only retrieved top-5, the model cannot be correct without hallucinating.
The "Lost in the Middle" Phenomenon
Research shows LLMs prioritize information at the START and END of their context window. High precision is often more valuable than high recall if it means shorter contexts.
2. Generation Metrics (The Writer)
Once you have the right context, did the model use it correctly?
Faithfulness
This checks for Hallucinations. A faithful answer contains only information found in the retrieved context.
- Evaluation Method: Use an "LLM-as-Judge" to extract claims from the answer and verify each claim against the source chunks.
- Score: 0.0 (Pure Hallucination) to 1.0 (Fully Grounded).
Answer Relevancy
A model can be faithful ("The sky is blue") but irrelevant to the question ("What is the capital of France?"). Relevancy measures the semantic similarity between the Query and the Response.
Implementation Strategy
Do not rely on vibes. Build a GoldenDataset of (Question, Context_Ids,
Ground_Truth) tuples.
def evaluate_retrieval(retrieved_ids, expected_ids):
# Calculate Intersection over Union (IoU) or simple Recall
intersection = len(set(retrieved_ids) & set(expected_ids))
recall = intersection / len(expected_ids)
return recall
def evaluate_generation_faithfulness(answer, context_text):
# Use a smaller, cheaper model (e.g. GPT-3.5) to verify claims
prompt = f"""
Context: {context_text}
Answer: {answer}
Identify any claims in the Answer not supported by the Context.
Return Score (0-1).
"""
return llm_judge(prompt)