Module 10 — Capstone: build your own harness
Goals
Build a working harness end-to-end, from empty directory to passing evals, demonstrating fluency with every subsystem. This is the exercise that turns “has read about harness engineering” into “has done harness engineering.”
1. The assignment
Build micro-harness: a CLI agent that can:
- Read and write files in a scoped directory.
- Run shell commands through a sandbox.
- Search the web via a mock tool.
- Route to one of two subagents (
coder,researcher) based on the user’s request. - Respect a permission model with
allow | deny | asksemantics. - Be driven by an eval harness you also build.
Starter code is in ../capstone/. Read it, then extend it.
You do not need to use the Anthropic SDK specifically — any model provider is fine — but the starter uses the Anthropic SDK as a reference. Substitute if you prefer.
2. Milestones
Tackle these in order; each milestone references earlier modules.
Milestone 1 — The minimal loop (Module 01)
- Implement the core
whileloop inagent.py. - Add termination: end-turn, max-turns, max-wall-clock.
- Implement streaming and a cancellation token.
- Shape errors: retry transient, surface semantic, propagate catastrophic.
Done when: you can run python -m micro_harness "say hello" and get a streamed response; you can Ctrl-C mid-stream and it exits cleanly; a tool that always 500s does not crash the loop.
Milestone 2 — The tool registry (Module 02)
- Flesh out the tool registry with JSONSchema validation.
- Add tools:
read_file,write_file,list_dir,run_shell,web_search(mocked). - Support parallel tool calls.
- Add versioning metadata.
Done when: tools with invalid arguments return structured errors; parallel tools run concurrently; tool handlers cannot access files outside the scoped directory.
Milestone 3 — Context and caching (Module 03)
- Add a compaction policy: keep last K turns verbatim, summarize older turns.
- Add prompt caching with cache breakpoints at end-of-system, end-of-tools.
- Measure: before/after token count and cache hit rate.
Done when: a 40-turn conversation fits in context; a second run of the same conversation hits cache on the prefix.
Milestone 4 — Subagents (Module 04)
- Split into router,
coder,researcher. - Router is a lightweight agent with a tiny prompt.
- Subagents receive a self-sufficient brief and return a compressed result.
- Budget each subagent: max turns, max tokens, max wall-clock.
Done when: "fix the bug in foo.py" routes to coder; "summarize the latest news on X" routes to researcher; the parent context stays under budget.
Milestone 5 — Permissions and safety (Module 05)
- Implement a permission layer with
allow | deny | ask. - Gate
run_shellon an argument-level policy (denylist of obvious dangers; allowlist of safe patterns). - Gate
write_fileoutside a whitelisted directory. - Add hooks: pre/post-tool logging, redaction of anything that looks like an API key in tool outputs.
- Simulate a prompt injection in a file the agent reads; confirm it is contained.
Done when: a deliberately-crafted adversarial file cannot cause writes outside the scope; the permission layer logs every decision with reason.
Milestone 6 — Prompts as code (Module 06)
- Split the system prompt into named layers (identity, behavior, tool-use policy, safety). Each in its own file.
- Version the layers.
- Add a CLI flag to swap prompt versions for A/B testing.
Done when: git log on any prompt layer shows its history; running with --prompt-version v1 vs. v2 produces measurably different behavior.
Milestone 7 — Evals (Module 07)
- Build an eval runner.
- Write ten eval cases: three golden (exact tool-call trace match), three rubric (LLM-as-judge), four task-success (side-effect verification).
- Run the suite in CI. Require passing for merges.
Done when: make eval prints a green scorecard; introducing a regression (e.g., dropping a permission check) fails the suite.
Milestone 8 — Observability (Module 08)
- Emit structured logs for every turn and tool call.
- Record a replay-able trace for every run.
- Implement
micro_harness replay <trace_id>that re-runs a past trace through the current harness.
Done when: you can replay a week-old trace and see whether today’s harness handles it the same or differently.
Milestone 9 — Architecture review (Module 08)
- Write a one-page architecture doc: the subsystems, their interfaces, the decisions you made and rejected.
- Draw the agent graph.
- Identify the three places you would have done it differently on a second attempt.
Done when: someone who has not read this trainer could read your architecture doc and modify your harness without getting lost.
Milestone 10 — Frontier stretch (Module 09)
Pick one:
- Add a second model for cheap-then-escalate model cascade.
- Add a persistent memory file the agent reads on every run and writes summaries to at session end.
- Add shadow-traffic comparison: run two versions of the harness side-by-side on a fixture set and diff their behaviors.
- Add computer-use: one screenshot tool, one click-at-coords tool, and a test harness that drives a headless browser.
Whatever you pick, write up what worked, what did not, and what you would try next.
3. Grading rubric
Grade yourself honestly against this rubric (this is what an interviewer would use):
| Dimension | 0 (missing) | 1 (present) | 2 (thoughtful) | 3 (excellent) |
|---|---|---|---|---|
| Loop hygiene | No termination | Basic | Handles all three stop reasons | Also handles cancellation and partial termination |
| Tool registry | Ad-hoc | Validates schemas | Structured errors, timeouts, parallelism | Versioning, metrics, MCP-compatible |
| Context mgmt | All-or-nothing | Sliding window | Compaction + caching | Adaptive policy with measurement |
| Subagents | None | One subagent | Router + specialists | Budgets, briefing, compressed returns |
| Permissions | None | Ask-on-everything | Tiered by blast radius | Argument-sensitive, logged, with recovery |
| Prompts | One blob | Layered | Versioned and tested | Separate owners, measured |
| Evals | None | Runs manually | In CI, multiple types | Replay, drift detection, dashboard |
| Architecture | Not drawn | Sketched | Justified decisions | Trade-offs articulated, migration-ready |
A capstone at all 2s is production-grade. Any 3s are bonus points and show deep engagement.
4. Reflection
After the capstone, re-read your Module 00 one-paragraph definition of harness engineering. Rewrite it. The delta between the two versions is what you learned. Save both.
5. Where to go next
- Ship it. Deploy the harness to yourself for a month. Use it daily. Take notes on every failure.
- Read production harnesses. Claude Code, OpenAI Agents SDK, the big open-source options. Read them with the vocabulary you now own. Notice what they did and did not solve.
- Pick a specialty. See Module 09, section 11.
- Teach. Nothing cements understanding like explaining it. Give a talk. Write a blog post. Mentor someone through this trainer.
You are now a harness engineer. Welcome to the field.