RAG Evaluation Strategies

Deconstruct your pipeline into Retrieval and Generation validation.


Quick Answer

RAG systems must be evaluated on retrieval and generation. The RAG triad - Query, Context, Response - prevents confusing retrieval errors with model errors.

TL;DR

  • Measure context precision and recall.
  • Evaluate faithfulness and answer correctness.
  • Debug failures by isolating retrieval vs. generation.

FAQ

What is the RAG triad?

It separates evaluation into the user query, retrieved context, and final response so you can find the failing layer.

Which metrics matter most?

Context precision/recall for retrieval and faithfulness for generation are usually the highest signal.

How do I debug a poor answer?

Check whether the correct document was retrieved. If not, fix retrieval; if yes, fix prompting or reasoning.

The "Why" of RAG Evals

Evaluate a RAG system as a single black box, and you'll never know why it failed. Did it miss the document? Or did it have the document but hallucinates anyway?

We break RAG evaluation into the RAG Triad: Query, Context, and Response.

User Query Retriever LLM Reason + Answer Final Answer User question Context

1. Retrieval Metrics (The Search Engine)

Before an LLM even sees your prompt, your retrieval system needs to find the right data. If garbage goes in, garbage comes out.

Context Precision

Definition: What proportion of retrieved chunks are actually relevant?

Why it matters: Low precision floods the LLM context window with noise, increasing cost and "lost in the middle" errors.

relevant_chunks / k_retrieved_chunks

Context Recall

Definition: Did we retrieve all the necessary information to answer the question?

Why it matters: If the legal statute is in Chunk #50 but you only retrieved top-5, the model cannot be correct without hallucinating.

The "Lost in the Middle" Phenomenon

Research shows LLMs prioritize information at the START and END of their context window. High precision is often more valuable than high recall if it means shorter contexts.

2. Generation Metrics (The Writer)

Once you have the right context, did the model use it correctly?

Faithfulness

This checks for Hallucinations. A faithful answer contains only information found in the retrieved context.

  • Evaluation Method: Use an "LLM-as-Judge" to extract claims from the answer and verify each claim against the source chunks.
  • Score: 0.0 (Pure Hallucination) to 1.0 (Fully Grounded).

Answer Relevancy

A model can be faithful ("The sky is blue") but irrelevant to the question ("What is the capital of France?"). Relevancy measures the semantic similarity between the Query and the Response.

Implementation Strategy

Do not rely on vibes. Build a GoldenDataset of (Question, Context_Ids, Ground_Truth) tuples.

rag_eval_example.py
def evaluate_retrieval(retrieved_ids, expected_ids):
    # Calculate Intersection over Union (IoU) or simple Recall
    intersection = len(set(retrieved_ids) & set(expected_ids))
    recall = intersection / len(expected_ids)
    return recall

def evaluate_generation_faithfulness(answer, context_text):
    # Use a smaller, cheaper model (e.g. GPT-3.5) to verify claims
    prompt = f"""
    Context: {context_text}
    Answer: {answer}
    
    Identify any claims in the Answer not supported by the Context.
    Return Score (0-1).
    """
    return llm_judge(prompt)

See Full Evaluator Code →