Home / Eval Design / Agents & HITL

Human-in-the-Loop (HITL) & Agents

Agents are powerful but unpredictable. Use humans as the ultimate safety layer.

Quick Answer

Human-in-the-loop systems route uncertain or high-risk outputs to people. The goal is to reduce critical errors while creating feedback for improvement.

TL;DR

Define escalation thresholds by risk.
Route low-confidence cases to experts.
Feed corrections back into evaluation data.

FAQ

When should I escalate to a human?

Escalate when confidence is low or the potential impact is high, such as compliance or safety scenarios.

How do I set thresholds?

Start with conservative thresholds, then tune using false negative and false positive rates on a labeled set.

How do I use human feedback?

Convert corrections into labeled examples and update the rubric and golden sets.

The Philosophy of Escalation

In high-stakes environments (Finance, Healthcare), 99% accuracy isn't good enough. You can't just ship an autonomous agent and hope for the best.

The solution is not "better models", but better workflows. We design systems that know their own limits and escalate to a human expert when unsure.

The Escalation Logic

This flowchart defines how an Agent determines if it should act or ask.

Capturing Feedback Signals

Evaluating agents requires capturing data from production usage. We categorize feedback into two types:

Explicit Feedback

Direct user input on quality.

Thumbs Up / Down buttons
"Regenerate" clicks
Rating (1-5 stars)

Pros: Clean data. Cons: Low engagement.

Implicit Feedback

Inferred from user behavior.

User copies code block (Positive)
User edits the AI draft (Negative - drift calculation)
Session length (Short can mean success OR rage quit)

Pros: High volume. Cons: Noisy.

Agent-Specific Metrics

Success Rate: Percent of sessions resolving the user intent without human takeover.
Escalation Rate: How often does the agent give up? (Should trend down as model improves).
Intervention Rate: How often does a human HAVE to step in to stop a bad action? (Critical Safety Metric).

Confidence Calibration

An agent is only useful if its "Confidence Score" reflects reality. If it says "I am 99% sure" but is wrong, your escalation logic breaks.

Use Expected Calibration Error (ECE) to measure this alignment.

calibration.py

def calculate_ece(probs, labels, bins=10):
    # Group predictions by confidence bin (0.0-0.1, 0.1-0.2...)
    # For each bin, calculate |avg_confidence - avg_accuracy|
    # Weighted average is your ECE Score.
    # Lower is better.
    pass

View Full Calibration Code →