> ## Documentation Index
> Fetch the complete documentation index at: https://springaicommunity.mintlify.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Agent Bench

> Benchmarking framework for AI coding agents on enterprise Java tasks

## Overview

Agent Bench evaluates AI coding agents on enterprise Java development tasks. It defines benchmarks as YAML, launches any CLI agent in a workspace, and grades results with cascaded judge tiers from [Agent Judge](https://github.com/spring-ai-community/agent-judge).

The filesystem is the interface. The bench writes `INSTRUCTION.md`; the agent reads it and modifies files. Claude Code, Gemini CLI, Amazon Q, or a shell script — any tool that reads a file and writes files can compete.

## How It Works

<Steps>
  <Step title="Provide">
    Copy workspace template and write `INSTRUCTION.md` with the task description
  </Step>

  <Step title="Setup">
    Run setup scripts in the workspace (clone repo, compile, measure baseline coverage)
  </Step>

  <Step title="Agent">
    Launch the agent command in the workspace — it reads `INSTRUCTION.md` and does the work
  </Step>

  <Step title="Post">
    Run post scripts (build, test, generate JaCoCo reports)
  </Step>

  <Step title="Grade">
    Evaluate with a cascaded jury — each tier gates the next (build passes before checking coverage)
  </Step>
</Steps>

## Technical Foundation

<CardGroup cols={3}>
  <Card title="YAML-Driven Benchmarks" icon="file-code">
    Benchmarks, tasks, and agent configs defined in YAML — no code changes to add a new benchmark
  </Card>

  <Card title="Cascaded Jury" icon="gavel">
    Multi-tier judge cascade with policies (REJECT\_ON\_ANY\_FAIL, ACCEPT\_ON\_ALL\_PASS, FINAL\_TIER)
  </Card>

  <Card title="Any CLI Agent" icon="terminal">
    Agent config is just `command` + `timeout` — any executable that reads files works
  </Card>

  <Card title="Pass@k Metrics" icon="chart-simple">
    Multi-attempt runs with pass\@k computation (Chen et al. formula)
  </Card>

  <Card title="Extensible Judges" icon="puzzle-piece">
    5 built-in judge types + custom types via `JudgeFactory.register()`
  </Card>

  <Card title="Open Source" icon="code-branch">
    Apache 2.0 licensed, community contributions welcome
  </Card>
</CardGroup>

## Available Benchmarks

| Benchmark       | Tasks                | What it measures                                   |
| --------------- | -------------------- | -------------------------------------------------- |
| `hello-world`   | 1                    | File creation — validates infrastructure           |
| `code-coverage` | 1 (spring-petclinic) | JUnit test generation, JaCoCo coverage improvement |

## Current Status

<Note>
  Agent Bench is in active development. The harness, CLI, result model, and judge integration are working. The code-coverage benchmark has validated judges (T0-T2 pass against a preserved experiment workspace with 91.6% coverage). A live end-to-end agent run is next.
</Note>

## Resources

<CardGroup cols={2}>
  <Card title="Agent Bench Project" icon="flask" href="/projects/incubating/agent-bench">
    Full documentation with CLI reference, YAML formats, and architecture
  </Card>

  <Card title="GitHub Repository" icon="github" href="https://github.com/spring-ai-community/agent-bench">
    Source code and benchmarks
  </Card>

  <Card title="Agent Judge" icon="gavel" href="https://github.com/spring-ai-community/agent-judge">
    Cascaded judge framework used for grading
  </Card>

  <Card title="Contact Us" icon="envelope" href="https://github.com/spring-ai-community/community/issues">
    Questions and collaboration
  </Card>
</CardGroup>
