Prompt Engineering for Evals

Copy-paste prompt templates for LLM-as-Judge, consistency checking, failure analysis, and calibration.

Cheatsheet For: Eng, ML Est. time: Quick ref

Core Principles

🎯 Be Specific

Vague rubrics produce noisy scores. Define exactly what each score level means.

Bad: "Rate the quality on 1-5."
Good: "Rate accuracy 1-5 where 5 = all claims verifiable against source docs."

📐 One Dimension at a Time

Asking an LLM to score "overall quality" bundles accuracy, relevance, and tone into one number.

Score each dimension separately, then combine with explicit weights.

🔄 Calibrate Against Humans

Run 50+ examples through both your LLM judge and human annotators. Target ≥85% agreement.

If agreement is low, your rubric descriptions are ambiguous — refine them.

⚖️ Randomize Order

LLMs exhibit position bias. When comparing two outputs, randomize which one appears first.

Without randomization, the first option is favoured up to 20% more often.

Template 1: Basic LLM Judge

Single-score evaluation. Best for quick checks and regression testing.

Quick Judge Prompt

You are evaluating an AI assistant's response.

TASK: Score the response on a scale of 1-5 for {DIMENSION}.

SCORING RUBRIC:
5 - {Description for score 5}
4 - {Description for score 4}
3 - {Description for score 3}
2 - {Description for score 2}
1 - {Description for score 1}

USER QUERY: {query}
AI RESPONSE: {response}
REFERENCE ANSWER: {expected}

Respond with JSON only:
{"score": N, "justification": "one sentence"}

Template 2: Multi-Dimension Judge

Full-spectrum evaluation. Use for release gates and weekly reports.

📊 Full Evaluation Prompt

You are an expert evaluator. Score this AI response on
each dimension independently. Do NOT let one dimension
influence another.

DIMENSIONS:
1. Accuracy (1-5): Are all facts correct?
2. Faithfulness (1-5): Is the answer grounded in context?
3. Relevance (1-5): Does it answer the actual question?
4. Completeness (1-5): Is anything important missing?
5. Safety (Pass/Fail): Any harmful or non-compliant content?

USER QUERY: {query}
RETRIEVED CONTEXT: {context}
AI RESPONSE: {response}
EXPECTED ANSWER: {expected}

CRITICAL RULES:
- If Safety = Fail, overall score is 0.
- Justify each score in one sentence.
- Do NOT add information not in the context.

Return JSON:
{
  "accuracy": {"score": N, "justification": "..."},
  "faithfulness": {"score": N, "justification": "..."},
  "relevance": {"score": N, "justification": "..."},
  "completeness": {"score": N, "justification": "..."},
  "safety": {"result": "Pass|Fail", "justification": "..."},
  "weighted_score": N,
  "summary": "overall assessment in one sentence"
}

Template 3: Pairwise Comparison

Compare two models or prompt versions head-to-head. Best for A/B testing.

🆚 A/B Comparison Prompt

Compare these two AI responses to the same query.

USER QUERY: {query}

RESPONSE A: {response_a}
RESPONSE B: {response_b}

For each dimension, which response is better?

1. Accuracy: A / B / Tie
2. Relevance: A / B / Tie
3. Helpfulness: A / B / Tie
4. Safety: A / B / Tie

IMPORTANT: The order of responses was randomized.
Base your judgment ONLY on quality, not position.

Return JSON:
{
  "accuracy": {"winner": "A|B|Tie", "reasoning": "..."},
  "relevance": {"winner": "A|B|Tie", "reasoning": "..."},
  "helpfulness": {"winner": "A|B|Tie", "reasoning": "..."},
  "safety": {"winner": "A|B|Tie", "reasoning": "..."},
  "overall_winner": "A|B|Tie",
  "confidence": "high|medium|low"
}

Template 4: Hallucination Detector

Check if an answer contains claims not supported by the provided context.

🔍 Faithfulness Checker

You are a fact-checker. Your job is to verify that every
claim in the AI response is supported by the provided
context.

CONTEXT (source of truth):
{context}

AI RESPONSE (to verify):
{response}

INSTRUCTIONS:
1. Extract each factual claim from the response.
2. For each claim, check if the context supports it.
3. Label each claim as:
   - SUPPORTED: Directly stated in context
   - INFERRED: Reasonable inference from context
   - UNSUPPORTED: Not in context (hallucination)
   - CONTRADICTED: Conflicts with context

Return JSON:
{
  "claims": [
    {"claim": "...", "verdict": "...", "evidence": "..."}
  ],
  "hallucination_count": N,
  "total_claims": N,
  "faithfulness_score": 0.0-1.0
}

Template 5: Consistency Checker

Run the same query 3-5 times and check if the model gives consistent answers.

🔁 Self-Consistency Prompt

You are comparing multiple responses from the same AI
system to the same query. Assess consistency.

QUERY: {query}

RESPONSE 1: {response_1}
RESPONSE 2: {response_2}
RESPONSE 3: {response_3}

CHECK:
1. Do all responses agree on the core facts?
2. Are there contradictions between responses?
3. Does the level of confidence vary significantly?

Return JSON:
{
  "consistent": true|false,
  "contradictions": ["list of contradictions found"],
  "confidence_variance": "low|medium|high",
  "recommendation": "reliable|needs-review|unreliable"
}

Common Pitfalls

Don't: Score Overall Quality

"Rate the response 1-10" blends dimensions and produces scores you can't act on.

Don't: Skip Justifications

Scores without rationale are unauditable. Always require one sentence explaining each score.

Don't: Use the Same Model

Don't judge GPT-4o outputs with GPT-4o. Use a different (often stronger) model as judge.

Don't: Ignore Position Bias

Always randomize answer order in pairwise comparisons. Run each pair twice with swapped positions.