Quick Answer
This glossary defines core evaluation terms so teams can align on metrics, failure modes, and workflows.
TL;DR
- Use these definitions to standardize eval discussions.
- Map each term to a measurable metric or rubric.
- Keep the glossary updated as your system evolves.
FAQ
Faithfulness vs. accuracy - what is the difference?
Accuracy measures correctness against a reference answer; faithfulness measures whether the answer is supported by evidence.
What is the RAG triad?
The RAG triad is Query, Context, and Response - each must be evaluated separately to diagnose failures.
What is drift?
Drift is a shift in data distribution or embedding space that degrades performance over time.
Drift
The degradation of model performance over time due to changes in input data distribution (Data Drift) or user expectations (Concept Drift).
Hallucination
When an LLM generates a confident but factually incorrect answer that is not grounded in the provided context.
Faithfulness
A metric measuring whether the generated answer is derived only from the retrieved context, without adding external information.
LLM-as-Judge
The practice of using a stronger model (e.g., GPT-4) to evaluate the outputs of a weaker model or application.