Spring AI Bench - Spring AI Community

Overview

Spring AI Bench measures modern agents on real enterprise development tasks — issue triage, PR review, coverage uplift, compliance validation, dependency upgrades. Run benchmarks on YOUR repos to measure YOUR scenarios. If agents have evolved, benchmarks must evolve too.

Existing benchmarks (SWE-bench) measure yesterday’s agents on static 2023 Python patches. They can’t evaluate the agents teams actually use (Claude, Gemini, Amazon Q, Amp) on enterprise Java workflows.

Why Different from SWE-bench

Full Dev Lifecycle

Beyond patch loops: triage, PR review, coverage, compliance

Java-First

Agents show dismal 7-10% lower results on Java - we need better benchmarks

Any Agent

Claude, Gemini, Amazon Q, Amp, custom — not just one architecture

Reproducible

One-click Docker + open scaffolding

Modern Paradigm

2025 declarative goal agents, not 2024 patch-loops

Open Standards

Following best practices for benchmark design

What Spring AI Bench Does

Can AI act as a true Java developer agent? Not just fixing bugs, but:

Issue Analysis

Analyzing and labeling issues with domain-specific labels

PR Review

Comprehensive pull request analysis with risk assessment

Test Coverage

Raising coverage while keeping builds green

Static Analysis

Cleaning up checkstyle violations and code quality issues

API Migration

Migrating APIs and upgrading dependencies

Compliance

Keeping builds compliant with enterprise standards

That’s the standard enterprise developers hold themselves to — and the standard we should evaluate AI against.

Run It Yourself

Unlike static benchmarks, Spring AI Bench runs on YOUR repos:

# 1. Clone and build dependencies (5 minutes)
git clone https://github.com/spring-ai-community/agent-client.git
cd agent-client && ./mvnw clean install -DskipTests

git clone https://github.com/spring-ai-community/spring-ai-bench.git
cd spring-ai-bench

# 2. Set your API keys
export ANTHROPIC_API_KEY=your_key
export GEMINI_API_KEY=your_key

# 3. Run on YOUR codebase
./mvnw test -Dtest=HelloWorldMultiAgentTest -pl bench-agents

# 4. View results in your browser
open file:///tmp/bench-reports/index.html

Current Implementation

Supported Agent Providers

Spring AI Bench integrates with multiple AI agent providers through the Spring AI Agents framework:

Claude Code

Anthropic’s Claude via CLI integration

Gemini

Google’s Gemini models

Amazon Q

AWS’s AI development assistant

Amp

Amp AI agent platform

Codex

OpenAI Codex integration

Custom Agents

Bring your own agent implementation

All agent providers support the same benchmark specifications, enabling fair comparisons across different AI models and platforms.

Benchmark Tracks

✅ Production Ready

hello-world: File creation and basic infrastructure validation
Code Coverage Uplift: Autonomous test generation achieving 71.4% coverage on Spring tutorials

🚧 In Active Development

Issue Analysis & Labeling: Automated issue triage and classification
Pull Request Review: Comprehensive PR analysis with structured reports
Static Analysis Remediation: Fix code quality issues while preserving functionality

📋 Future Roadmap

Integration Testing
Bug Fixing
Dependency Upgrades
API Migration
Compliance Validation
Performance Optimization
Documentation Generation

Code Coverage Achievement

One of the most impressive demonstrations of Spring AI Bench is the autonomous code coverage agent that increased test coverage from 0% to 71.4% on Spring’s official gs-rest-service tutorial.

Coverage Results

Starting Coverage: 0%
Final Coverage: 71.4%
Tests Generated: Complete test suite
Build Status: ✅ All tests passing

Code Quality

Claude: Production-ready tests with @WebMvcTest, jsonPath(), BDD naming
Gemini: Same coverage, but used slower @SpringBootTest patterns
Key Insight: Model quality matters beyond just metrics

Learn More: See the full Code Coverage Analysis in the official documentation for detailed results and methodology.

Architecture

Spring AI Bench is built around a Sandbox abstraction:

LocalSandbox

Direct process execution (fast, development)

DockerSandbox

Container isolation (secure, production-ready)

CloudSandbox

Distributed execution (planned)

Key components:

BenchHarness: End-to-end benchmark execution
AgentRunner: Agent execution with Spring AI Agents integration
SuccessVerifier: Validation of benchmark results
ReportGenerator: HTML and JSON report generation

Resources

Official Documentation

Complete documentation and analysis

GitHub Repository

View source code and contribute

Getting Started

Quick start guide and setup

Code Coverage Results

Detailed coverage analysis and results

Agent Integration

Setting up AI agents for benchmarking

Why Different from SWE-bench

Evidence and comparative analysis

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Community

Projects

Production Projects

Incubating Projects

Get Involved

Benchmarking

Blog

​Overview

​Why Different from SWE-bench

Full Dev Lifecycle

Java-First

Any Agent

Reproducible

Modern Paradigm

Open Standards

​What Spring AI Bench Does

​Run It Yourself

​Current Implementation

​Supported Agent Providers

Claude Code

Gemini

Amazon Q

Amp

Codex

Custom Agents

​Benchmark Tracks

​Code Coverage Achievement

Coverage Results

Code Quality

​Architecture

LocalSandbox

DockerSandbox

CloudSandbox

​Resources

Official Documentation

GitHub Repository

Getting Started

Code Coverage Results

Agent Integration

Why Different from SWE-bench

​License

Overview

Why Different from SWE-bench

What Spring AI Bench Does

Run It Yourself

Current Implementation

Supported Agent Providers

Benchmark Tracks

Code Coverage Achievement

Architecture

Resources

License