Quick Answer
LLM-as-judge uses models to score other models with a rubric. It can scale evaluation but must be calibrated and audited for bias.
TL;DR
- Use clear rubrics and few-shot examples.
- Validate with human review and spot checks.
- Monitor judge drift and disagreement rates.
FAQ
When is LLM-as-judge reliable?
It works best for structured rubrics and factuality checks; it is weaker for subjective or ambiguous tasks.
How do I calibrate the judge?
Compare judge scores against human labels on a validation set and adjust prompts or thresholds.
What are common failure modes?
Overconfidence, preference for verbosity, and bias toward certain styles or providers.
The Core Concept
In many tasks (creative writing, reasoning, summarization), there is no single "correct" answer string. Exact match metrics (BLEU, ROUGE) correlate poorly with human judgment.
LLM-as-Judge uses a high-capacity model to act as a human proxy, reading the input, the output, and a scoring rubric to assign a grade.
Evaluation Approaches
1. Reference-Free Evaluation
The judge sees only the Input and the Output. It decides if the
output is "good" based on internal knowledge and the prompt instructions.
- Use Case: Creative writing, Coding assistants (does it compile?), General chat.
- Pros: Doesn't require a "Golden Dataset" of answers.
- Cons: Subjective.
2. Reference-Based Evaluation
The judge compares the Output against a Gold Reference.
- Use Case: RAG, Fact-extraction.
- Pros: More grounded.
- Cons: Expensive to create references.
Biases & Limitations
Judges are not perfect. Research (Zheng et al., 2023) has identified distinct biases:
| Bias Type | Description | Mitigation Strategy |
|---|---|---|
| Position Bias | In pairwise comparison, models prefer the first option presented. | Run eval twice, swapping order (A vs B, then B vs A). |
| Verbosity Bias | Models prefer longer answers, even if they are repetitive or fluff. | Explicitly penalize length in the system prompt; Use "conciseness" metrics. |
| Self-Preference | GPT-4 tends to rate GPT-4 outputs higher than Claude/Llama outputs. | Use a neutral judge or ensemble of judges. |
The "G-Eval" Prompt Pattern
The most effective way to use LLM-as-Judge is the Chain-of-Thought Rubric pattern (popularized by the G-Eval paper).
You are an expert evaluator for search relevance.
Task: Rate the relevance of the AI Answer to the User Query on a scale of 1-5.
Rubric:
1 - Completely irrelevant.
2 - Tangential connection, misses intent.
3 - Addresses the topic but lacks detail or specific accuracy.
4 - Good answer, minor omissions.
5 - Perfect output, helpful, concise, and accurate.
Steps:
1. Read the Query and identify the user intent.
2. Read the Answer.
3. Compare the Answer claims to the intent.
4. Assign a score based ONLY on the rubric.
Output format: JSON { "score": int, "reasoning": "string" }