Quick Answer
This composite case study shows how a compliance assistant reduced hallucinations by enforcing citations and judge-based penalties.
TL;DR
- Hallucinations were defined by unsupported claims.
- Citation requirements and judge penalties raised faithfulness.
- Critical error rate dropped below 1% after six weeks.
FAQ
What counts as a hallucination?
Any claim that cannot be supported by retrieved policy text, especially if it contradicts policy.
How was faithfulness measured?
Claims were mapped to evidence spans and penalized if citations were missing or incorrect.
Which controls reduced risk most?
Mandatory citations, domain filtering, and LLM-judge penalties for uncited claims.
About this case study
- Composite archetype: Synthesized from multiple production deployments to illustrate real-world eval workflows.
- Data: All numbers are illustrative and anonymized to show evaluation impact.
- System: Legal compliance assistant for HR policy questions.
System Snapshot
- Traffic: ~180k questions/month across 14 policy domains.
- Stack: RAG with policy PDF ingestion, top-5 passages, answer + citations.
- Risk: incorrect advice triggers compliance exposure and audit findings.
- Success criteria: high faithfulness + citations; low critical hallucination rate.
Definition of Hallucination (and Consequence)
Hallucination was defined as any claim not supported by retrieved policy text. Claims that contradicted policy or cited the wrong statute were labeled critical.
| Severity | Definition | Rate (Initial) | Consequence |
|---|---|---|---|
| Critical | Contradicts policy or fabricates legal requirement | 18% | Audit risk / legal escalation |
| Major | Missing citation or incomplete policy section | 17% | Manual review required |
| Minor | Formatting or ambiguous wording | 10% | Low impact |
Error Distribution (Initial)
Most failures were hallucinations, but the critical cases were concentrated in a few HR policy topics (leave policy exceptions, contractor classification, and termination).
Eval Design That Made This Actionable
- 1,200-query golden set balanced by policy domain and severity tier.
- Faithfulness score (Ragas-style) + citation coverage (answer spans must map to sources).
- LLM-as-judge rubric: “Does each claim cite the supporting clause?”
- Automatic escalation if critical hallucination rate exceeds 2% weekly.
Measurement Methodology (How This Would Be Measured)
- Golden set balanced across policy domains; reviewers label evidence spans and severity.
- Faithfulness computed as claim-to-evidence overlap; citation coverage computed as cited spans / total claims.
- LLM-as-judge rubric used for consistency checks; disagreements sampled for human audit.
- Operational impact from audit logs: escalations, review time, and user trust survey scores.
Operational Dashboard (Before Fix)
Illustrative dashboard (synthetic data).
Interventions (and Their Measured Impact)
| Change | Why | Measured Impact |
|---|---|---|
| Mandatory citations in answer template | Force evidence grounding | +52 pts citation coverage |
| Passage filtering by policy domain | Reduce cross-policy confusion | -9 pts critical hallucinations |
| Answer length cap + “I don’t know” policy | Prevent speculative wording | -6 pts hallucinations |
| LLM-as-judge penalty for uncited claims | Enforce evidence consistency | +0.18 faithfulness |
Results After 6 Weeks
| Metric | Before | After |
|---|---|---|
| Hallucination Rate | 45% | 2% |
| Critical Hallucinations | 18% | 0.6% |
| Faithfulness | 0.61 | 0.91 |
| Citation Coverage | 12% | 94% |
| Escalation Rate | 6.2% | 1.1% |
“Reduce hallucinations” became a measurable, auditable objective. Evals turned a vague quality goal into concrete controls with clear business impact.