Home / Reference / Glossary

Glossary of Terms

Common terms in AI evaluation and safety.

Quick Answer

This glossary defines core evaluation terms so teams can align on metrics, failure modes, and workflows.

Accuracy measures correctness against a reference answer; faithfulness measures whether the answer is supported by evidence.

The RAG triad is Query, Context, and Response - each must be evaluated separately to diagnose failures.

Drift is a shift in data distribution or embedding space that degrades performance over time.

The degradation of model performance over time due to changes in input data distribution (Data Drift) or user expectations (Concept Drift).

When an LLM generates a confident but factually incorrect answer that is not grounded in the provided context.

A metric measuring whether the generated answer is derived only from the retrieved context, without adding external information.

The practice of using a stronger model (e.g., GPT-4) to evaluate the outputs of a weaker model or application.