Skip to main content

Overview

Agent Bench evaluates AI coding agents on enterprise Java development tasks. It defines benchmarks as YAML, launches any CLI agent in a workspace, and grades results with cascaded judge tiers from Agent Judge. The filesystem is the interface. The bench writes INSTRUCTION.md; the agent reads it and modifies files. Claude Code, Gemini CLI, Amazon Q, or a shell script — any tool that reads a file and writes files can compete.

How It Works

1

Provide

Copy workspace template and write INSTRUCTION.md with the task description
2

Setup

Run setup scripts in the workspace (clone repo, compile, measure baseline coverage)
3

Agent

Launch the agent command in the workspace — it reads INSTRUCTION.md and does the work
4

Post

Run post scripts (build, test, generate JaCoCo reports)
5

Grade

Evaluate with a cascaded jury — each tier gates the next (build passes before checking coverage)

Technical Foundation

YAML-Driven Benchmarks

Benchmarks, tasks, and agent configs defined in YAML — no code changes to add a new benchmark

Cascaded Jury

Multi-tier judge cascade with policies (REJECT_ON_ANY_FAIL, ACCEPT_ON_ALL_PASS, FINAL_TIER)

Any CLI Agent

Agent config is just command + timeout — any executable that reads files works

Pass@k Metrics

Multi-attempt runs with pass@k computation (Chen et al. formula)

Extensible Judges

5 built-in judge types + custom types via JudgeFactory.register()

Open Source

Apache 2.0 licensed, community contributions welcome

Available Benchmarks

BenchmarkTasksWhat it measures
hello-world1File creation — validates infrastructure
code-coverage1 (spring-petclinic)JUnit test generation, JaCoCo coverage improvement

Current Status

Agent Bench is in active development. The harness, CLI, result model, and judge integration are working. The code-coverage benchmark has validated judges (T0-T2 pass against a preserved experiment workspace with 91.6% coverage). A live end-to-end agent run is next.

Resources

Agent Bench Project

Full documentation with CLI reference, YAML formats, and architecture

GitHub Repository

Source code and benchmarks

Agent Judge

Cascaded judge framework used for grading

Contact Us

Questions and collaboration