Home/Scoping/Risk-First Scoping

Risk-First Scoping

Consequence weighting, critical user journeys, and turning "what if it fails?" into testable hypotheses.

Quick Answer

The governance framework uses a Complexity-Criticality matrix to decide oversight level, testing depth, and release controls.

TL;DR

Classify systems by complexity and criticality.
Assign controls that match risk.
Review and update as systems evolve.

FAQ

What is criticality?

Criticality reflects potential harm or business impact if the system fails.

What is complexity?

Complexity reflects how many components, dependencies, and decision paths the system has.

How do I choose the right controls?

Higher criticality requires stricter evals, audit trails, and human oversight.

The Severity/Complexity Matrix

Not all AI features require the same level of scrutiny. We categorize use cases to determine the required "Evaluation Depth".

Consequence Weighting: Not All Errors Are Equal

Traditional accuracy metrics treat every query as equal. But in production:

Query A: "Tell me a joke" → AI fails. (Annoyance)
Query B: "Can I return this used item?" → AI lies. (Financial Loss)

If you have 50 queries like A and 1 query like B, getting B wrong means 98% accuracy on paper, but 100% failure in business terms.

The Weighted Formula

Score = Σ (Accuracy_i × RiskWeight_i) / Σ RiskWeight_i

Where RiskWeight is derived from the estimated dollar cost or reputation risk of a failure in that category.

Risk Tiers

Most organizations use a specific tier system to assign weights.

Tier	Example	Consequence	Weight
Critical	"How do I reset my pacemaker?"	Safety Risk / Lawsuit	50.0
High	"What is the refund window?"	Financial Loss if wrong	20.0
Medium	"How do I contact support?"	User Frustration	5.0
Low	"Write a poem."	Minor Annoyance	1.0

From Risk to Testable Hypotheses

Vague risks lead to vague evals. Turn "what if it fails?" into specific, testable criteria:

Vague Risk	Testable Hypothesis	Eval Approach
"It might hallucinate"	"All claims are grounded in retrieved context"	Faithfulness score > 0.9
"Users might get wrong info"	"Policy questions match official docs"	Golden set accuracy > 95%
"It could say something unsafe"	"Safety-critical queries trigger guardrails"	100% escalation on safety set
"Quality might degrade"	"Week-over-week scores don't drop >5%"	Drift monitoring alerts

Evaluation Maturity Model

Where does your organization stand?

Level 1: Ad-Hoc (The "Vibes" Phase)

Method: Engineers verify outputs manually during dev.
Dataset: None.
Risk: High. Regressions break features constantly.

Level 2: Managed (The "Unit Test" Phase)

Method: Deterministic regression tests (string matching).
Dataset: Small CSV of inputs/outputs.
Risk: Medium. Catches bugs, but fails to measure hallucinations.

Level 3: Optimized (The "Production" Phase)

Method: LLM-as-a-Judge, Semantic Similarity, RAG Metrics.
Dataset: Curated "Golden Dataset" from production logs.
Risk: Low. Confidence in deployment.

Level 4: Governed (The "Compliance" Phase)

Method: Adversarial Red Teaming, Bias detection, Privacy scanning (PII).
Dataset: Synthetic attack vectors.
Risk: Minimal. Ready for Regulated Industry (Finance/Health).

Regulatory Compliance

The EU AI Act and NIST AI RMF require observability. Your eval pipeline is your primary evidence of compliance.

Audit Tip

Always log the Input, Retrieved Context (IDs), and Generated Output for every eval run. You will need to prove why the model gave a dangerous answer 6 months from now or 3 years from now.

Implementation

This is implemented by tagging your GoldenDataset with a category or tier field, and passing a weight map to your scorer.

View Scorer Implementation →