Skip to main content
GitHubDocsMaven

Overview

Spring AI Bench measures modern agents on real enterprise development tasks — issue triage, PR review, coverage uplift, compliance validation, dependency upgrades. Run benchmarks on YOUR repos to measure YOUR scenarios. If agents have evolved, benchmarks must evolve too. Spring AI Bench Benchmarking
Existing benchmarks (SWE-bench) measure yesterday’s agents on static 2023 Python patches. They can’t evaluate the agents teams actually use (Claude, Gemini, Amazon Q, Amp) on enterprise Java workflows.

Why Different from SWE-bench

Full Dev Lifecycle

Beyond patch loops: triage, PR review, coverage, compliance

Java-First

Agents show dismal 7-10% lower results on Java - we need better benchmarks

Any Agent

Claude, Gemini, Amazon Q, Amp, custom — not just one architecture

Reproducible

One-click Docker + open scaffolding

Modern Paradigm

2025 declarative goal agents, not 2024 patch-loops

Open Standards

Following best practices for benchmark design

What Spring AI Bench Does

Can AI act as a true Java developer agent? Not just fixing bugs, but:
1

Issue Analysis

Analyzing and labeling issues with domain-specific labels
2

PR Review

Comprehensive pull request analysis with risk assessment
3

Test Coverage

Raising coverage while keeping builds green
4

Static Analysis

Cleaning up checkstyle violations and code quality issues
5

API Migration

Migrating APIs and upgrading dependencies
6

Compliance

Keeping builds compliant with enterprise standards
That’s the standard enterprise developers hold themselves to — and the standard we should evaluate AI against.

Run It Yourself

Unlike static benchmarks, Spring AI Bench runs on YOUR repos:
# 1. Clone and build dependencies (5 minutes)
git clone https://github.com/spring-ai-community/spring-ai-agents.git
cd spring-ai-agents && ./mvnw clean install -DskipTests

git clone https://github.com/spring-ai-community/spring-ai-bench.git
cd spring-ai-bench

# 2. Set your API keys
export ANTHROPIC_API_KEY=your_key
export GEMINI_API_KEY=your_key

# 3. Run on YOUR codebase
./mvnw test -Dtest=HelloWorldMultiAgentTest -pl bench-agents

# 4. View results in your browser
open file:///tmp/bench-reports/index.html

Current Implementation

Supported Agent Providers

Spring AI Bench integrates with multiple AI agent providers through the Spring AI Agents framework:

Claude Code

Claude CodeAnthropic’s Claude via CLI integration

Gemini

GeminiGoogle’s Gemini models

Amazon Q

Amazon QAWS’s AI development assistant

Amp

AmpAmp AI agent platform

Codex

CodexOpenAI Codex integration

Custom Agents

Bring your own agent implementation
All agent providers support the same benchmark specifications, enabling fair comparisons across different AI models and platforms.

Benchmark Tracks

  • hello-world: File creation and basic infrastructure validation
  • Code Coverage Uplift: Autonomous test generation achieving 71.4% coverage on Spring tutorials
  • Issue Analysis & Labeling: Automated issue triage and classification
  • Pull Request Review: Comprehensive PR analysis with structured reports
  • Static Analysis Remediation: Fix code quality issues while preserving functionality
  • Integration Testing
  • Bug Fixing
  • Dependency Upgrades
  • API Migration
  • Compliance Validation
  • Performance Optimization
  • Documentation Generation

Code Coverage Achievement

One of the most impressive demonstrations of Spring AI Bench is the autonomous code coverage agent that increased test coverage from 0% to 71.4% on Spring’s official gs-rest-service tutorial.

Coverage Results

  • Starting Coverage: 0%
  • Final Coverage: 71.4%
  • Tests Generated: Complete test suite
  • Build Status: ✅ All tests passing

Code Quality

  • Claude: Production-ready tests with @WebMvcTest, jsonPath(), BDD naming
  • Gemini: Same coverage, but used slower @SpringBootTest patterns
  • Key Insight: Model quality matters beyond just metrics
Learn More: See the full Code Coverage Analysis in the official documentation for detailed results and methodology.

Architecture

Spring AI Bench is built around a Sandbox abstraction:

LocalSandbox

Direct process execution (fast, development)

DockerSandbox

Container isolation (secure, production-ready)

CloudSandbox

Distributed execution (planned)
Key components:
  • BenchHarness: End-to-end benchmark execution
  • AgentRunner: Agent execution with Spring AI Agents integration
  • SuccessVerifier: Validation of benchmark results
  • ReportGenerator: HTML and JSON report generation

Resources

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
I