Overview
Agent Bench defines benchmarks as YAML, launches any CLI agent in an isolated workspace, and grades results with cascaded judge tiers from Agent Judge. The filesystem is the contract. The bench writesINSTRUCTION.md to the workspace. The agent reads it and modifies files. Any CLI tool — Claude Code, Gemini CLI, Amazon Q, a shell script — can compete.
How It Works
- Provide copies the workspace template and writes
INSTRUCTION.md - Setup scripts prepare the workspace (clone repo, compile, measure baseline)
- Agent runs — any command that reads
INSTRUCTION.mdand modifies files - Post scripts finalize (run tests, generate coverage reports)
- Grade evaluates with a cascaded jury
Benchmark Format
benchmark.yaml — Jury Configuration
task.yaml — Task Definition
Agent Config
Agents are defined by a command and a timeout:CLI
| Command | Purpose |
|---|---|
bench list | List available benchmarks |
bench tasks --benchmark <name> | List tasks in a benchmark |
bench provide --benchmark <name> --task <id> --workspace <dir> | Set up workspace |
bench grade --benchmark <name> --task <id> --workspace <dir> | Evaluate agent’s work |
bench run --benchmark <name> --agent <config> | Full pipeline |
Bring Your Own Agent
Built-in Judge Types
| Type | What it checks |
|---|---|
file-exists | File exists at path |
file-content | File content matches expected |
maven-build | Maven build succeeds |
coverage-preservation | JaCoCo coverage not dropped from baseline |
coverage-improvement | JaCoCo coverage exceeds threshold |
JudgeFactory.register().
Benchmarks
| Benchmark | Tasks | Description |
|---|---|---|
hello-world | 1 | Smoke test for validating the harness — submit your own agent to verify it works |
code-coverage | 1 (spring-petclinic) | Write JUnit tests from scratch to maximize JaCoCo coverage, graded by a 4-tier jury |
Architecture
Two modules:agent-bench-core (CLI, catalog, judge factory, result model) and agent-bench-agents (LLM-based judges). Module layering enforced by ArchUnit.
Resources
GitHub Repository
Source code and benchmarks
Agent Judge
Cascaded judge framework (core dependency)
Code Coverage Experiment
29-run benchmark study with variant analysis
Blog: I Read My Agent's Diary
Analysis of the code-coverage experiment results