Home / Case Studies / Hallucination Reduction

Case Study Archetype: Reducing Hallucinations to 2%

A composite case study of grounding a legal compliance assistant.

Quick Answer

This composite case study shows how a compliance assistant reduced hallucinations by enforcing citations and judge-based penalties.

TL;DR

Hallucinations were defined by unsupported claims.
Citation requirements and judge penalties raised faithfulness.
Critical error rate dropped below 1% after six weeks.

FAQ

What counts as a hallucination?

Any claim that cannot be supported by retrieved policy text, especially if it contradicts policy.

How was faithfulness measured?

Claims were mapped to evidence spans and penalized if citations were missing or incorrect.

Which controls reduced risk most?

Mandatory citations, domain filtering, and LLM-judge penalties for uncited claims.

About this case study

Composite archetype: Synthesized from multiple production deployments to illustrate real-world eval workflows.
Data: All numbers are illustrative and anonymized to show evaluation impact.
System: Legal compliance assistant for HR policy questions.

System Snapshot

Traffic: ~180k questions/month across 14 policy domains.
Stack: RAG with policy PDF ingestion, top-5 passages, answer + citations.
Risk: incorrect advice triggers compliance exposure and audit findings.
Success criteria: high faithfulness + citations; low critical hallucination rate.

Definition of Hallucination (and Consequence)

Hallucination was defined as any claim not supported by retrieved policy text. Claims that contradicted policy or cited the wrong statute were labeled critical.

Severity	Definition	Rate (Initial)	Consequence
Critical	Contradicts policy or fabricates legal requirement	18%	Audit risk / legal escalation
Major	Missing citation or incomplete policy section	17%	Manual review required
Minor	Formatting or ambiguous wording	10%	Low impact

Error Distribution (Initial)

Most failures were hallucinations, but the critical cases were concentrated in a few HR policy topics (leave policy exceptions, contractor classification, and termination).

Eval Design That Made This Actionable

1,200-query golden set balanced by policy domain and severity tier.
Faithfulness score (Ragas-style) + citation coverage (answer spans must map to sources).
LLM-as-judge rubric: “Does each claim cite the supporting clause?”
Automatic escalation if critical hallucination rate exceeds 2% weekly.

Measurement Methodology (How This Would Be Measured)

Golden set balanced across policy domains; reviewers label evidence spans and severity.
Faithfulness computed as claim-to-evidence overlap; citation coverage computed as cited spans / total claims.
LLM-as-judge rubric used for consistency checks; disagreements sampled for human audit.
Operational impact from audit logs: escalations, review time, and user trust survey scores.

Operational Dashboard (Before Fix)

Mock compliance dashboard showing hallucination risk metrics before fixes

Illustrative dashboard (synthetic data).

Faithfulness

0.61

Below 0.80 target

Citation Coverage

12%

-68 pts vs target

Critical Hallucinations

18%

9x above limit

Escalation Rate

6.2%

+3.1 pts

Avg. Resolution Time

9.4 min

+3.5 min

User Trust Score

3.1 / 5

-1.2

Interventions (and Their Measured Impact)

Change	Why	Measured Impact
Mandatory citations in answer template	Force evidence grounding	+52 pts citation coverage
Passage filtering by policy domain	Reduce cross-policy confusion	-9 pts critical hallucinations
Answer length cap + “I don’t know” policy	Prevent speculative wording	-6 pts hallucinations
LLM-as-judge penalty for uncited claims	Enforce evidence consistency	+0.18 faithfulness

Results After 6 Weeks

Metric	Before	After
Hallucination Rate	45%	2%
Critical Hallucinations	18%	0.6%
Faithfulness	0.61	0.91
Citation Coverage	12%	94%
Escalation Rate	6.2%	1.1%

Key takeaway

“Reduce hallucinations” became a measurable, auditable objective. Evals turned a vague quality goal into concrete controls with clear business impact.