Glossary of Terms

Common terms in AI evaluation and safety.


Quick Answer

This glossary defines core evaluation terms so teams can align on metrics, failure modes, and workflows.

TL;DR

  • Use these definitions to standardize eval discussions.
  • Map each term to a measurable metric or rubric.
  • Keep the glossary updated as your system evolves.

FAQ

Faithfulness vs. accuracy - what is the difference?

Accuracy measures correctness against a reference answer; faithfulness measures whether the answer is supported by evidence.

What is the RAG triad?

The RAG triad is Query, Context, and Response - each must be evaluated separately to diagnose failures.

What is drift?

Drift is a shift in data distribution or embedding space that degrades performance over time.

Drift

The degradation of model performance over time due to changes in input data distribution (Data Drift) or user expectations (Concept Drift).

Hallucination

When an LLM generates a confident but factually incorrect answer that is not grounded in the provided context.

Faithfulness

A metric measuring whether the generated answer is derived only from the retrieved context, without adding external information.

LLM-as-Judge

The practice of using a stronger model (e.g., GPT-4) to evaluate the outputs of a weaker model or application.