Quick Answer
The governance framework uses a Complexity-Criticality matrix to decide oversight level, testing depth, and release controls.
TL;DR
- Classify systems by complexity and criticality.
- Assign controls that match risk.
- Review and update as systems evolve.
FAQ
What is criticality?
Criticality reflects potential harm or business impact if the system fails.
What is complexity?
Complexity reflects how many components, dependencies, and decision paths the system has.
How do I choose the right controls?
Higher criticality requires stricter evals, audit trails, and human oversight.
The Severity/Complexity Matrix
Not all AI features require the same level of scrutiny. We categorize use cases to determine the required "Evaluation Depth".
Consequence Weighting: Not All Errors Are Equal
Traditional accuracy metrics treat every query as equal. But in production:
- Query A: "Tell me a joke" → AI fails. (Annoyance)
- Query B: "Can I return this used item?" → AI lies. (Financial Loss)
If you have 50 queries like A and 1 query like B, getting B wrong means 98% accuracy on paper, but 100% failure in business terms.
The Weighted Formula
Where RiskWeight is derived from the estimated dollar cost or reputation risk of
a failure in that category.
Risk Tiers
Most organizations use a specific tier system to assign weights.
| Tier | Example | Consequence | Weight |
|---|---|---|---|
| Critical | "How do I reset my pacemaker?" | Safety Risk / Lawsuit | 50.0 |
| High | "What is the refund window?" | Financial Loss if wrong | 20.0 |
| Medium | "How do I contact support?" | User Frustration | 5.0 |
| Low | "Write a poem." | Minor Annoyance | 1.0 |
From Risk to Testable Hypotheses
Vague risks lead to vague evals. Turn "what if it fails?" into specific, testable criteria:
| Vague Risk | Testable Hypothesis | Eval Approach |
|---|---|---|
| "It might hallucinate" | "All claims are grounded in retrieved context" | Faithfulness score > 0.9 |
| "Users might get wrong info" | "Policy questions match official docs" | Golden set accuracy > 95% |
| "It could say something unsafe" | "Safety-critical queries trigger guardrails" | 100% escalation on safety set |
| "Quality might degrade" | "Week-over-week scores don't drop >5%" | Drift monitoring alerts |
Evaluation Maturity Model
Where does your organization stand?
Level 1: Ad-Hoc (The "Vibes" Phase)
- Method: Engineers verify outputs manually during dev.
- Dataset: None.
- Risk: High. Regressions break features constantly.
Level 2: Managed (The "Unit Test" Phase)
- Method: Deterministic regression tests (string matching).
- Dataset: Small CSV of inputs/outputs.
- Risk: Medium. Catches bugs, but fails to measure hallucinations.
Level 3: Optimized (The "Production" Phase)
- Method: LLM-as-a-Judge, Semantic Similarity, RAG Metrics.
- Dataset: Curated "Golden Dataset" from production logs.
- Risk: Low. Confidence in deployment.
Level 4: Governed (The "Compliance" Phase)
- Method: Adversarial Red Teaming, Bias detection, Privacy scanning (PII).
- Dataset: Synthetic attack vectors.
- Risk: Minimal. Ready for Regulated Industry (Finance/Health).
Regulatory Compliance
The EU AI Act and NIST AI RMF require observability. Your eval pipeline is your primary evidence of compliance.
Audit Tip
Always log the Input, Retrieved Context (IDs), and Generated Output for every eval run. You will need to prove why the model gave a dangerous answer 6 months from now or 3 years from now.
Implementation
This is implemented by tagging your GoldenDataset with a category
or tier field, and passing a weight map to your scorer.