Evaluation & Governance Lab

This is the oversight room. It does not train models, it keeps score. Review how runs behaved, explore bias and drift with safe examples, and walk a governance checklist before trusting anything in the real world.

1. Metrics board

Total runs

Completed

Failed

Avg duration

N/A

Avg accuracy (cls)

Avg F1 (cls)

Worst F1

Avg R² (reg)

Avg RMSE (reg)

Avg tokens in (LLM)

Avg tokens out (LLM)

Avg tool calls

Runs per studio

2. Bias and fairness explorer

Toy examples to illustrate fairness concepts. Adjust the threshold to see how group metrics move. Real fairness work needs richer context and data.

Threshold0.50

Accuracy by group

False positive rate

False negative rate

Accuracy treats all outcomes equally; false positives and false negatives capture where errors land. Shifts between groups can hint at fairness gaps.

3. Drift and robustness simulator

Slide through scenarios to see how metrics degrade when data drifts. Monitoring and retraining plans keep you ready.

Accuracy

91.0%

RMSE

0.42

Shift intensity

Higher shift → distribution moved further from baseline

Monitoring needed

Low

Metric trend

Evaluation & Governance Lab

1. Metrics board

2. Bias and fairness explorer

3. Drift and robustness simulator

4. Governance checklist and risk view

5. Run inspection panel