Evaluation & Governance Lab

Evaluation & Governance Lab

This is the oversight room. It does not train models, it keeps score. Review how runs behaved, explore bias and drift with safe examples, and walk a governance checklist before trusting anything in the real world.

Security reminderThis studio is for education and experimentation. Do not upload production data or secrets. Outputs are demos; review before using anywhere safety-critical or financial.

1. Metrics board

Total runs

0

Completed

0

Failed

0

Avg duration

N/A

Avg accuracy (cls)

-

Avg F1 (cls)

-

Worst F1

-

Avg R² (reg)

-

Avg RMSE (reg)

-

Avg tokens in (LLM)

-

Avg tokens out (LLM)

-

Avg tool calls

-

Runs per studio

2. Bias and fairness explorer

Toy examples to illustrate fairness concepts. Adjust the threshold to see how group metrics move. Real fairness work needs richer context and data.

0.50

Accuracy by group

False positive rate

False negative rate

Accuracy treats all outcomes equally; false positives and false negatives capture where errors land. Shifts between groups can hint at fairness gaps.

3. Drift and robustness simulator

Slide through scenarios to see how metrics degrade when data drifts. Monitoring and retraining plans keep you ready.

Accuracy

91.0%

RMSE

0.42

Shift intensity

5%

Higher shift → distribution moved further from baseline

Monitoring needed

Low

Metric trend

4. Governance checklist and risk view

Risk: Low(based on answers)

Purpose and context

Who is affected if this model is wrong?

Is the intended use clearly documented?

Is there a human in the loop for critical decisions?

Data and consent

Do you have permission to use the data?

Are sensitive attributes handled appropriately?

Is data retention limited and documented?

Performance and limitations

Are the main metrics acceptable for the use case?

Do you know where the model fails?

Is there a rollback plan if performance drops?

Fairness and harm

Have you checked for group disparities?

Is there a mitigation plan for observed bias?

Could misuse of this model cause harm?

Monitoring and rollback

Are drift and uptime monitors defined?

Are alerts routed to responsible owners?

Is there a tested rollback or disable switch?

5. Run inspection panel

Select a run to inspect.

Need to revisit a run? Open the Control Room for the full log of jobs across studios. Control Room