Exercises — Module 07: Evaluations
7.1 — Build a ten-case suite
Pick a harness (yours or the capstone). Write 10 eval cases across three types: three golden (exact trace / output match), three rubric (with an LLM-as-judge), four task-success (with side-effect checks).
Run them. Which are flaky? Why?
Deliverable: the suite + a flake analysis.
7.2 — Calibrate an LLM judge
Label 20 cases by hand (rubric-scored). Build an LLM-as-judge prompt. Compare judge scores to human scores. Adjust the prompt until agreement is ≥80%. Document the prompt and the disagreements.
Deliverable: prompt, agreement numbers, disagreement analysis.
7.3 — Replay
Implement a trace recorder and replayer. Record 10 real sessions, save them, then re-run through your current harness.
- How many behave identically?
- Among those that do not, what changed?
Deliverable: replay tool + analysis.
7.4 — Hold-out set
Split your eval set into “dev” and “held-out” (80/20). Iterate on your harness using only dev scores. Every 5 iterations, peek at held-out. Plot the gap. Observe overfitting.
Deliverable: plot + reflection.
7.5 — Regression CI
Wire your eval suite into CI. Failing evals block merges. Add a “flake quarantine” so known-flaky cases report but do not block.
Deliverable: CI config + a pull request that is correctly blocked by the evals.