Case Study Archetype: Reducing Hallucinations to 2%

A composite case study of grounding a legal compliance assistant.


Quick Answer

This composite case study shows how a compliance assistant reduced hallucinations by enforcing citations and judge-based penalties.

TL;DR

  • Hallucinations were defined by unsupported claims.
  • Citation requirements and judge penalties raised faithfulness.
  • Critical error rate dropped below 1% after six weeks.

FAQ

What counts as a hallucination?

Any claim that cannot be supported by retrieved policy text, especially if it contradicts policy.

How was faithfulness measured?

Claims were mapped to evidence spans and penalized if citations were missing or incorrect.

Which controls reduced risk most?

Mandatory citations, domain filtering, and LLM-judge penalties for uncited claims.

About this case study

  • Composite archetype: Synthesized from multiple production deployments to illustrate real-world eval workflows.
  • Data: All numbers are illustrative and anonymized to show evaluation impact.
  • System: Legal compliance assistant for HR policy questions.

System Snapshot

  • Traffic: ~180k questions/month across 14 policy domains.
  • Stack: RAG with policy PDF ingestion, top-5 passages, answer + citations.
  • Risk: incorrect advice triggers compliance exposure and audit findings.
  • Success criteria: high faithfulness + citations; low critical hallucination rate.

Definition of Hallucination (and Consequence)

Hallucination was defined as any claim not supported by retrieved policy text. Claims that contradicted policy or cited the wrong statute were labeled critical.

Severity Definition Rate (Initial) Consequence
Critical Contradicts policy or fabricates legal requirement 18% Audit risk / legal escalation
Major Missing citation or incomplete policy section 17% Manual review required
Minor Formatting or ambiguous wording 10% Low impact

Error Distribution (Initial)

Analysis of 500 Failures Invented Facts (Hallucination) 45% Wrong Context Retrieved 30% Reasoning Error 15% Refusal Error 10%

Most failures were hallucinations, but the critical cases were concentrated in a few HR policy topics (leave policy exceptions, contractor classification, and termination).

Eval Design That Made This Actionable

  • 1,200-query golden set balanced by policy domain and severity tier.
  • Faithfulness score (Ragas-style) + citation coverage (answer spans must map to sources).
  • LLM-as-judge rubric: “Does each claim cite the supporting clause?”
  • Automatic escalation if critical hallucination rate exceeds 2% weekly.

Measurement Methodology (How This Would Be Measured)

  • Golden set balanced across policy domains; reviewers label evidence spans and severity.
  • Faithfulness computed as claim-to-evidence overlap; citation coverage computed as cited spans / total claims.
  • LLM-as-judge rubric used for consistency checks; disagreements sampled for human audit.
  • Operational impact from audit logs: escalations, review time, and user trust survey scores.

Operational Dashboard (Before Fix)

Mock compliance dashboard showing hallucination risk metrics before fixes

Illustrative dashboard (synthetic data).

Faithfulness
0.61
Below 0.80 target
Citation Coverage
12%
-68 pts vs target
Critical Hallucinations
18%
9x above limit
Escalation Rate
6.2%
+3.1 pts
Avg. Resolution Time
9.4 min
+3.5 min
User Trust Score
3.1 / 5
-1.2

Interventions (and Their Measured Impact)

Change Why Measured Impact
Mandatory citations in answer template Force evidence grounding +52 pts citation coverage
Passage filtering by policy domain Reduce cross-policy confusion -9 pts critical hallucinations
Answer length cap + “I don’t know” policy Prevent speculative wording -6 pts hallucinations
LLM-as-judge penalty for uncited claims Enforce evidence consistency +0.18 faithfulness

Results After 6 Weeks

Metric Before After
Hallucination Rate 45% 2%
Critical Hallucinations 18% 0.6%
Faithfulness 0.61 0.91
Citation Coverage 12% 94%
Escalation Rate 6.2% 1.1%
Key takeaway

“Reduce hallucinations” became a measurable, auditable objective. Evals turned a vague quality goal into concrete controls with clear business impact.