This is the oversight room. It does not train models, it keeps score. Review how runs behaved, explore bias and drift with safe examples, and walk a governance checklist before trusting anything in the real world.
Total runs
0
Completed
0
Failed
0
Avg duration
N/A
Avg accuracy (cls)
-
Avg F1 (cls)
-
Worst F1
-
Avg R² (reg)
-
Avg RMSE (reg)
-
Avg tokens in (LLM)
-
Avg tokens out (LLM)
-
Avg tool calls
-
Runs per studio
Toy examples to illustrate fairness concepts. Adjust the threshold to see how group metrics move. Real fairness work needs richer context and data.
Accuracy by group
False positive rate
False negative rate
Accuracy treats all outcomes equally; false positives and false negatives capture where errors land. Shifts between groups can hint at fairness gaps.
Slide through scenarios to see how metrics degrade when data drifts. Monitoring and retraining plans keep you ready.
Accuracy
91.0%
RMSE
0.42
Shift intensity
5%
Higher shift → distribution moved further from baseline
Monitoring needed
Low
Metric trend
Purpose and context
Who is affected if this model is wrong?
Is the intended use clearly documented?
Is there a human in the loop for critical decisions?
Data and consent
Do you have permission to use the data?
Are sensitive attributes handled appropriately?
Is data retention limited and documented?
Performance and limitations
Are the main metrics acceptable for the use case?
Do you know where the model fails?
Is there a rollback plan if performance drops?
Fairness and harm
Have you checked for group disparities?
Is there a mitigation plan for observed bias?
Could misuse of this model cause harm?
Monitoring and rollback
Are drift and uptime monitors defined?
Are alerts routed to responsible owners?
Is there a tested rollback or disable switch?
Select a run to inspect.