Agent Bench - Spring AI Community

This project has moved to markpollack/agent-bench. Current documentation is at lab.pollack.ai/projects/agent-bench. The content below may be outdated.

GitHub • Agent Judge • Agent Client

Overview

Agent Bench defines benchmarks as YAML, launches any CLI agent in an isolated workspace, and grades results with cascaded judge tiers from Agent Judge. The filesystem is the contract. The bench writes INSTRUCTION.md to the workspace. The agent reads it and modifies files. Any CLI tool — Claude Code, Gemini CLI, Amazon Q, a shell script — can compete.

How It Works

provide → setup scripts → agent → post scripts → grade → result.json

Provide copies the workspace template and writes INSTRUCTION.md
Setup scripts prepare the workspace (clone repo, compile, measure baseline)
Agent runs — any command that reads INSTRUCTION.md and modifies files
Post scripts finalize (run tests, generate coverage reports)
Grade evaluates with a cascaded jury

Benchmark Format

benchmarks/code-coverage/
├── benchmark.yaml       # jury config: cascaded judge tiers
├── prompts/             # judge prompts
└── tasks/
    └── spring-petclinic/
        └── task.yaml    # instruction, setup/post scripts, metadata

benchmark.yaml — Jury Configuration

schema: bench.benchmark.v1
name: code-coverage
version: "1.0"
default-timeout: PT45M

jury:
  tiers:
    - name: build
      policy: REJECT_ON_ANY_FAIL
      checks:
        - type: maven-build
          goals: [clean, test]
    - name: coverage-improvement
      policy: ACCEPT_ON_ALL_PASS
      checks:
        - type: coverage-improvement
          min: 50.0

task.yaml — Task Definition

schema: bench.task.v1
id: spring-petclinic
instruction: |
  Write JUnit tests for this Spring Boot project to maximize code coverage.
  Use narrow test slices (@WebMvcTest, @DataJpaTest) over @SpringBootTest.
timeout: PT45M
metadata:
  baselineCoverage: 0.0
setup:
  - "git init && git remote add origin ... && git fetch --depth 1 origin edf4db28affc && git checkout FETCH_HEAD"
  - "./mvnw clean compile -q"
post:
  - "./mvnw clean test jacoco:report -q"

Agent Config

Agents are defined by a command and a timeout:

# agents/claude-code.yaml
command: claude --print --dangerously-skip-permissions 'Read INSTRUCTION.md and follow the instructions precisely.'
timeout: PT45M

CLI

Command	Purpose
`bench list`	List available benchmarks
`bench tasks --benchmark <name>`	List tasks in a benchmark
`bench provide --benchmark <name> --task <id> --workspace <dir>`	Set up workspace
`bench grade --benchmark <name> --task <id> --workspace <dir>`	Evaluate agent’s work
`bench run --benchmark <name> --agent <config>`	Full pipeline

Bring Your Own Agent

# 1. Set up workspace
bench provide --benchmark code-coverage --task spring-petclinic --workspace /tmp/petclinic

# 2. Run any agent
cd /tmp/petclinic && your-agent "$(cat INSTRUCTION.md)"

# 3. Grade
bench grade --benchmark code-coverage --task spring-petclinic --workspace /tmp/petclinic

Built-in Judge Types

Type	What it checks
`file-exists`	File exists at path
`file-content`	File content matches expected
`maven-build`	Maven build succeeds
`coverage-preservation`	JaCoCo coverage not dropped from baseline
`coverage-improvement`	JaCoCo coverage exceeds threshold

Custom types registered via JudgeFactory.register().

Benchmarks

Benchmark	Tasks	Description
`hello-world`	1	Smoke test for validating the harness — submit your own agent to verify it works
`code-coverage`	1 (spring-petclinic)	Write JUnit tests from scratch to maximize JaCoCo coverage, graded by a 4-tier jury

The code-coverage benchmark was validated in a 29-run experiment testing 7 prompt/knowledge variants with Claude Sonnet 4.6 on spring-petclinic. See the blog post for analysis.

Architecture

Two modules: agent-bench-core (CLI, catalog, judge factory, result model) and agent-bench-agents (LLM-based judges). Module layering enforced by ArchUnit.

Resources

GitHub Repository

Source code and benchmarks

Agent Judge

Cascaded judge framework (core dependency)

Code Coverage Experiment

29-run benchmark study with variant analysis

Blog: I Read My Agent's Diary

Analysis of the code-coverage experiment results

License

Apache License 2.0

Documentation Index

​Overview

​How It Works

​Benchmark Format

​benchmark.yaml — Jury Configuration

​task.yaml — Task Definition

​Agent Config

​CLI

​Bring Your Own Agent

​Built-in Judge Types

​Benchmarks

​Architecture

​Resources

GitHub Repository

Agent Judge

Code Coverage Experiment

Blog: I Read My Agent's Diary

​License

Overview

How It Works

Benchmark Format

benchmark.yaml — Jury Configuration

task.yaml — Task Definition

Agent Config

CLI

Bring Your Own Agent

Built-in Judge Types

Benchmarks

Architecture

Resources

License