Skip to main content
GitHubAgent JudgeAgent Client

Overview

Agent Bench defines benchmarks as YAML, launches any CLI agent in an isolated workspace, and grades results with cascaded judge tiers from Agent Judge. The filesystem is the contract. The bench writes INSTRUCTION.md to the workspace. The agent reads it and modifies files. Any CLI tool — Claude Code, Gemini CLI, Amazon Q, a shell script — can compete.

How It Works

provide → setup scripts → agent → post scripts → grade → result.json
  1. Provide copies the workspace template and writes INSTRUCTION.md
  2. Setup scripts prepare the workspace (clone repo, compile, measure baseline)
  3. Agent runs — any command that reads INSTRUCTION.md and modifies files
  4. Post scripts finalize (run tests, generate coverage reports)
  5. Grade evaluates with a cascaded jury

Benchmark Format

benchmarks/code-coverage/
├── benchmark.yaml       # jury config: cascaded judge tiers
├── prompts/             # judge prompts
└── tasks/
    └── spring-petclinic/
        └── task.yaml    # instruction, setup/post scripts, metadata

benchmark.yaml — Jury Configuration

schema: bench.benchmark.v1
name: code-coverage
version: "1.0"
default-timeout: PT45M

jury:
  tiers:
    - name: build
      policy: REJECT_ON_ANY_FAIL
      checks:
        - type: maven-build
          goals: [clean, test]
    - name: coverage-improvement
      policy: ACCEPT_ON_ALL_PASS
      checks:
        - type: coverage-improvement
          min: 50.0

task.yaml — Task Definition

schema: bench.task.v1
id: spring-petclinic
instruction: |
  Write JUnit tests for this Spring Boot project to maximize code coverage.
  Use narrow test slices (@WebMvcTest, @DataJpaTest) over @SpringBootTest.
timeout: PT45M
metadata:
  baselineCoverage: 0.0
setup:
  - "git init && git remote add origin ... && git fetch --depth 1 origin edf4db28affc && git checkout FETCH_HEAD"
  - "./mvnw clean compile -q"
post:
  - "./mvnw clean test jacoco:report -q"

Agent Config

Agents are defined by a command and a timeout:
# agents/claude-code.yaml
command: claude --print --dangerously-skip-permissions 'Read INSTRUCTION.md and follow the instructions precisely.'
timeout: PT45M

CLI

CommandPurpose
bench listList available benchmarks
bench tasks --benchmark <name>List tasks in a benchmark
bench provide --benchmark <name> --task <id> --workspace <dir>Set up workspace
bench grade --benchmark <name> --task <id> --workspace <dir>Evaluate agent’s work
bench run --benchmark <name> --agent <config>Full pipeline

Bring Your Own Agent

# 1. Set up workspace
bench provide --benchmark code-coverage --task spring-petclinic --workspace /tmp/petclinic

# 2. Run any agent
cd /tmp/petclinic && your-agent "$(cat INSTRUCTION.md)"

# 3. Grade
bench grade --benchmark code-coverage --task spring-petclinic --workspace /tmp/petclinic

Built-in Judge Types

TypeWhat it checks
file-existsFile exists at path
file-contentFile content matches expected
maven-buildMaven build succeeds
coverage-preservationJaCoCo coverage not dropped from baseline
coverage-improvementJaCoCo coverage exceeds threshold
Custom types registered via JudgeFactory.register().

Benchmarks

BenchmarkTasksDescription
hello-world1Smoke test for validating the harness — submit your own agent to verify it works
code-coverage1 (spring-petclinic)Write JUnit tests from scratch to maximize JaCoCo coverage, graded by a 4-tier jury
The code-coverage benchmark was validated in a 29-run experiment testing 7 prompt/knowledge variants with Claude Sonnet 4.6 on spring-petclinic. See the blog post for analysis.

Architecture

Two modules: agent-bench-core (CLI, catalog, judge factory, result model) and agent-bench-agents (LLM-based judges). Module layering enforced by ArchUnit.

Resources

GitHub Repository

Source code and benchmarks

Agent Judge

Cascaded judge framework (core dependency)

Code Coverage Experiment

29-run benchmark study with variant analysis

Blog: I Read My Agent's Diary

Analysis of the code-coverage experiment results

License

Apache License 2.0