LLM-as-Judge

Using strong models (like GPT-4) to evaluate the outputs of weaker models (like Llama-2 or older GPT versions).


Quick Answer

LLM-as-judge uses models to score other models with a rubric. It can scale evaluation but must be calibrated and audited for bias.

TL;DR

  • Use clear rubrics and few-shot examples.
  • Validate with human review and spot checks.
  • Monitor judge drift and disagreement rates.

FAQ

When is LLM-as-judge reliable?

It works best for structured rubrics and factuality checks; it is weaker for subjective or ambiguous tasks.

How do I calibrate the judge?

Compare judge scores against human labels on a validation set and adjust prompts or thresholds.

What are common failure modes?

Overconfidence, preference for verbosity, and bias toward certain styles or providers.

The Core Concept

In many tasks (creative writing, reasoning, summarization), there is no single "correct" answer string. Exact match metrics (BLEU, ROUGE) correlate poorly with human judgment.

LLM-as-Judge uses a high-capacity model to act as a human proxy, reading the input, the output, and a scoring rubric to assign a grade.

User Input Article + Prompt Model A (Weak) Draft Summary Judge (LLM) Rubric + Checks Hallucination, Style Score + Rationale Original + Rubric

Evaluation Approaches

1. Reference-Free Evaluation

The judge sees only the Input and the Output. It decides if the output is "good" based on internal knowledge and the prompt instructions.

  • Use Case: Creative writing, Coding assistants (does it compile?), General chat.
  • Pros: Doesn't require a "Golden Dataset" of answers.
  • Cons: Subjective.

2. Reference-Based Evaluation

The judge compares the Output against a Gold Reference.

  • Use Case: RAG, Fact-extraction.
  • Pros: More grounded.
  • Cons: Expensive to create references.

Biases & Limitations

Judges are not perfect. Research (Zheng et al., 2023) has identified distinct biases:

Bias Type Description Mitigation Strategy
Position Bias In pairwise comparison, models prefer the first option presented. Run eval twice, swapping order (A vs B, then B vs A).
Verbosity Bias Models prefer longer answers, even if they are repetitive or fluff. Explicitly penalize length in the system prompt; Use "conciseness" metrics.
Self-Preference GPT-4 tends to rate GPT-4 outputs higher than Claude/Llama outputs. Use a neutral judge or ensemble of judges.

The "G-Eval" Prompt Pattern

The most effective way to use LLM-as-Judge is the Chain-of-Thought Rubric pattern (popularized by the G-Eval paper).

Judge System Prompt
You are an expert evaluator for search relevance.
    
Task: Rate the relevance of the AI Answer to the User Query on a scale of 1-5.

Rubric:
1 - Completely irrelevant.
2 - Tangential connection, misses intent.
3 - Addresses the topic but lacks detail or specific accuracy.
4 - Good answer, minor omissions.
5 - Perfect output, helpful, concise, and accurate.

Steps:
1. Read the Query and identify the user intent.
2. Read the Answer.
3. Compare the Answer claims to the intent.
4. Assign a score based ONLY on the rubric.

Output format: JSON { "score": int, "reasoning": "string" }