Overview
Agent Bench evaluates AI coding agents on enterprise Java development tasks. It defines benchmarks as YAML, launches any CLI agent in a workspace, and grades results with cascaded judge tiers from Agent Judge. The filesystem is the interface. The bench writesINSTRUCTION.md; the agent reads it and modifies files. Claude Code, Gemini CLI, Amazon Q, or a shell script — any tool that reads a file and writes files can compete.
How It Works
Technical Foundation
YAML-Driven Benchmarks
Benchmarks, tasks, and agent configs defined in YAML — no code changes to add a new benchmark
Cascaded Jury
Multi-tier judge cascade with policies (REJECT_ON_ANY_FAIL, ACCEPT_ON_ALL_PASS, FINAL_TIER)
Any CLI Agent
Agent config is just
command + timeout — any executable that reads files worksPass@k Metrics
Multi-attempt runs with pass@k computation (Chen et al. formula)
Extensible Judges
5 built-in judge types + custom types via
JudgeFactory.register()Open Source
Apache 2.0 licensed, community contributions welcome
Available Benchmarks
| Benchmark | Tasks | What it measures |
|---|---|---|
hello-world | 1 | File creation — validates infrastructure |
code-coverage | 1 (spring-petclinic) | JUnit test generation, JaCoCo coverage improvement |
Current Status
Agent Bench is in active development. The harness, CLI, result model, and judge integration are working. The code-coverage benchmark has validated judges (T0-T2 pass against a preserved experiment workspace with 91.6% coverage). A live end-to-end agent run is next.
Resources
Agent Bench Project
Full documentation with CLI reference, YAML formats, and architecture
GitHub Repository
Source code and benchmarks
Agent Judge
Cascaded judge framework used for grading
Contact Us
Questions and collaboration