Quick Answer
Human-in-the-loop systems route uncertain or high-risk outputs to people. The goal is to reduce critical errors while creating feedback for improvement.
TL;DR
- Define escalation thresholds by risk.
- Route low-confidence cases to experts.
- Feed corrections back into evaluation data.
FAQ
When should I escalate to a human?
Escalate when confidence is low or the potential impact is high, such as compliance or safety scenarios.
How do I set thresholds?
Start with conservative thresholds, then tune using false negative and false positive rates on a labeled set.
How do I use human feedback?
Convert corrections into labeled examples and update the rubric and golden sets.
The Philosophy of Escalation
In high-stakes environments (Finance, Healthcare), 99% accuracy isn't good enough. You can't just ship an autonomous agent and hope for the best.
The solution is not "better models", but better workflows. We design systems that know their own limits and escalate to a human expert when unsure.
The Escalation Logic
This flowchart defines how an Agent determines if it should act or ask.
Capturing Feedback Signals
Evaluating agents requires capturing data from production usage. We categorize feedback into two types:
Explicit Feedback
Direct user input on quality.
- Thumbs Up / Down buttons
- "Regenerate" clicks
- Rating (1-5 stars)
Pros: Clean data. Cons: Low engagement.
Implicit Feedback
Inferred from user behavior.
- User copies code block (Positive)
- User edits the AI draft (Negative - drift calculation)
- Session length (Short can mean success OR rage quit)
Pros: High volume. Cons: Noisy.
Agent-Specific Metrics
- Success Rate: Percent of sessions resolving the user intent without human takeover.
- Escalation Rate: How often does the agent give up? (Should trend down as model improves).
- Intervention Rate: How often does a human HAVE to step in to stop a bad action? (Critical Safety Metric).
Confidence Calibration
An agent is only useful if its "Confidence Score" reflects reality. If it says "I am 99% sure" but is wrong, your escalation logic breaks.
Use Expected Calibration Error (ECE) to measure this alignment.
def calculate_ece(probs, labels, bins=10):
# Group predictions by confidence bin (0.0-0.1, 0.1-0.2...)
# For each bin, calculate |avg_confidence - avg_accuracy|
# Weighted average is your ECE Score.
# Lower is better.
pass