Agent Judge

Overview

Agent Judge is an agent-agnostic evaluation framework for verifying AI agent task completion. It provides a pluggable architecture with deterministic rules, command execution, and LLM-powered evaluation - all with zero coupling to any specific agent implementation. The library follows a clean separation of concerns: agent-judge-core has zero external dependencies, while specialized modules add capabilities like process execution (agent-judge-exec) and LLM evaluation (agent-judge-llm).

Core Abstractions

Judge

Functional interface for evaluation logic - takes JudgmentContext, returns Judgment

Judgment

Result containing Score, JudgmentStatus, reasoning, and granular Checks

Score

Sealed interface: BooleanScore, NumericalScore, or CategoricalScore

Jury

Multi-judge aggregation with configurable voting strategies

Module Structure

Module	Description	Dependencies
`agent-judge-core`	Core Judge API and abstractions	None (zero deps)
`agent-judge-exec`	Command execution judges	agent-sandbox, zt-exec
`agent-judge-llm`	LLM-powered evaluation	spring-ai-client-chat
`agent-judge-agent`	Agent-as-judge bridge interface	Core only
`agent-judge-advisor`	AgentClient advisors	spring-ai-agent-client
`agent-judge-bom`	Bill of Materials	N/A

Quick Start

Maven BOM

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springaicommunity.judge</groupId>
            <artifactId>agent-judge-bom</artifactId>
            <version>0.1.0-SNAPSHOT</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <dependency>
        <groupId>org.springaicommunity.judge</groupId>
        <artifactId>agent-judge-core</artifactId>
    </dependency>
</dependencies>

Judge Interface

The core Judge interface is a functional interface for lambda support:

@FunctionalInterface
public interface Judge {
    Judgment judge(JudgmentContext context);
}

Async and Reactive Variants:

// For CompletableFuture-based async
public interface AsyncJudge {
    CompletableFuture<Judgment> judgeAsync(JudgmentContext context);
}

// For Spring WebFlux / Project Reactor
public interface ReactiveJudge {
    Mono<Judgment> judge(JudgmentContext context);
}

Judgment Context

Complete evaluation input with all context an agent execution:

JudgmentContext context = JudgmentContext.builder()
    .goal("Increase test coverage to 80%")
    .workspace(Path.of("/project"))
    .executionTime(Duration.ofMinutes(5))
    .startedAt(Instant.now())
    .agentOutput("Added 15 test cases...")
    .status(ExecutionStatus.SUCCESS)
    .build();

ExecutionStatus values: SUCCESS, FAILED, TIMEOUT, CANCELLED, UNKNOWN

Judgment Results

Judgment result = judge.judge(context);

// Result properties
Score score = result.score();              // BooleanScore, NumericalScore, or CategoricalScore
JudgmentStatus status = result.status();   // PASS, FAIL, ABSTAIN, or ERROR
String reasoning = result.reasoning();      // Human-readable explanation
List<Check> checks = result.checks();       // Granular sub-assertions

JudgmentStatus values: PASS, FAIL, ABSTAIN, ERROR

Checks (Sub-Assertions)

Provide granular failure reporting:

Judgment.builder()
    .score(BooleanScore.FAIL)
    .status(JudgmentStatus.FAIL)
    .reasoning("2 of 3 checks failed")
    .check(Check.passed("file-exists", "output.txt exists"))
    .check(Check.failed("content-valid", "Missing required header"))
    .check(Check.failed("format-correct", "Invalid JSON structure"))
    .build();

Score Types

BooleanScore

Simple pass/fail scoring:

BooleanScore.PASS   // true
BooleanScore.FAIL   // false

NumericalScore

Continuous scoring with bounds and normalization:

// Score with min/max bounds
NumericalScore score = NumericalScore.of(85.0, 0.0, 100.0);

// Auto-normalizes to [0.0, 1.0]
double normalized = score.normalized();  // 0.85

CategoricalScore

Discrete categories from an allowed set:

CategoricalScore score = CategoricalScore.of(
    "GOOD",
    Set.of("EXCELLENT", "GOOD", "FAIR", "POOR")
);

The Scores utility class converts between score types for heterogeneous aggregation.

Built-in Judges

Deterministic Judges (agent-judge-core)

FileExistsJudge

Verifies file existence:

FileExistsJudge judge = FileExistsJudge.of("target/output.txt");
Judgment result = judge.judge(context);

FileContentJudge

Verifies file content with match modes:

// Exact match
FileContentJudge.exact("config.json", expectedContent);

// Contains substring
FileContentJudge.contains("output.log", "BUILD SUCCESS");

// Regex pattern
FileContentJudge.regex("version.txt", "\\d+\\.\\d+\\.\\d+");

Custom Deterministic Judge

Build rule-based judges:

DeterministicJudge judge = DeterministicJudge.builder()
    .name("config-valid")
    .check(ctx -> validateConfig(ctx.workspace()))
    .build();

Command Judges (agent-judge-exec)

CommandJudge

Execute shell commands and evaluate results:

CommandJudge judge = CommandJudge.builder()
    .name("maven-tests")
    .command("mvn", "test", "-q")
    .expectedExitCode(0)
    .timeout(Duration.ofMinutes(5))
    .build();

BuildSuccessJudge

Specialized for build tools with wrapper auto-detection:

// Maven - auto-detects mvnw wrapper
BuildSuccessJudge maven = BuildSuccessJudge.maven("clean", "verify");

// Gradle - auto-detects gradlew wrapper
BuildSuccessJudge gradle = BuildSuccessJudge.gradle("build", "test");

Default timeout: 10 minutes.

LLM Judges (agent-judge-llm)

CorrectnessJudge

Uses LLM to evaluate if the agent accomplished its goal:

CorrectnessJudge judge = CorrectnessJudge.builder()
    .chatClient(chatClient)
    .build();

// Returns YES/NO with reasoning
Judgment result = judge.judge(context);

Uses template method pattern for customization.

Custom LLM Judge

Build custom LLM-powered evaluation:

LLMJudge judge = new LLMJudge(chatClient) {
    @Override
    protected String buildPrompt(JudgmentContext ctx) {
        return "Evaluate code quality for: " + ctx.goal();
    }

    @Override
    protected Judgment parseResponse(String response, JudgmentContext ctx) {
        // Parse LLM response into Judgment
    }
};

Agent Judges (agent-judge-agent)

Delegate evaluation to an AI agent using a bridge interface:

public interface JudgeAgentClient {
    JudgeAgentResponse execute(String goal, Path workspace);
}

// Adapt any agent client without hard dependencies
JudgeAgentClient adapter = (goal, workspace) ->
    myAgentClient.run(goal, workspace);

Judge Composition

The Judges utility class provides composition operators:

// Logical AND (short-circuit)
Judge combined = Judges.and(fileExistsJudge, contentValidJudge);

// Logical OR (short-circuit)
Judge either = Judges.or(primaryJudge, fallbackJudge);

// Combine multiple judges
Judge all = Judges.allOf(judge1, judge2, judge3);
Judge any = Judges.anyOf(judge1, judge2, judge3);

// Test judges
Judge pass = Judges.alwaysPass("Skipped in test mode");
Judge fail = Judges.alwaysFail("Feature not implemented");

// Add metadata to any judge
Judge named = Judges.named(myJudge, "my-judge", "Description", JudgeType.DETERMINISTIC);

Jury System

Combine multiple judges with voting strategies:

SimpleJury jury = SimpleJury.builder()
    .name("comprehensive-check")
    .judge(fileJudge)
    .judge(testJudge, 2.0)  // Weight of 2.0
    .judge(codeQualityJudge)
    .votingStrategy(VotingStrategy.majority())
    .parallel(true)  // Execute judges in parallel (default)
    .build();

Verdict verdict = jury.vote(context);

// Access results
Judgment aggregated = verdict.aggregated();
List<Judgment> individual = verdict.individual();
Map<String, Judgment> byName = verdict.individualByName();
Map<String, Double> weights = verdict.weights();

Voting Strategies

Strategy	Description	Pass Condition
`MajorityVotingStrategy`	Majority vote	passCount > failCount
`ConsensusStrategy`	Unanimous agreement	All judges agree
`AverageVotingStrategy`	Simple average	average >= 0.5
`WeightedAverageStrategy`	Weighted average	weighted avg >= 0.5
`MedianVotingStrategy`	Median (outlier-robust)	median >= 0.5

Majority Voting Policies:

MajorityVotingStrategy.builder()
    .tiePolicy(TiePolicy.FAIL)           // PASS, FAIL, or ABSTAIN on ties
    .errorPolicy(ErrorPolicy.TREAT_AS_FAIL)  // How to handle ERROR judgments
    .build();

Jury Utilities

// Create jury from judges with auto-naming
Jury jury = Juries.fromJudges(VotingStrategy.majority(), judge1, judge2, judge3);

// Combine juries into meta-jury
Jury metaJury = Juries.combine(jury1, jury2, VotingStrategy.consensus());

// Create meta-jury from multiple juries
Jury combined = Juries.allOf(VotingStrategy.average(), jury1, jury2, jury3);

Utilities

MavenTestRunner

Run Maven tests with wrapper auto-detection:

ExecResult result = MavenTestRunner.run(projectPath, Duration.ofMinutes(10));
if (result.exitCode() == 0) {
    // Tests passed
}

JaCoCoReportParser

Parse JaCoCo XML reports for coverage metrics:

CoverageMetrics metrics = JaCoCoReportParser.parse(
    projectPath.resolve("target/site/jacoco/jacoco.xml")
);

double lineCoverage = metrics.linePercentage();      // e.g., 85.5
double branchCoverage = metrics.branchPercentage();  // e.g., 72.3
double methodCoverage = metrics.methodPercentage();  // e.g., 90.1

Spring AI Agents Integration

Agent Judge powers the evaluation system in Agent Client:

CoverageJudge judge = new CoverageJudge(80.0);

AgentClientResponse response = agentClient
    .goal("Increase test coverage to 80%")
    .advisors(JudgeAdvisor.builder().judge(judge).build())
    .run();

Resources

GitHub Repository

Source code and contribution guidelines

Maven Central

Published artifacts

License

Agent Judge is Open Source software released under the Apache 2.0 license.

Community

Projects

Production Projects

Incubating Projects

Get Involved

Benchmarking

Blog

Overview

Core Abstractions

Judge

Judgment

Score

Jury

Module Structure

Quick Start

Maven BOM

Judge Interface

Judgment Context

Judgment Results

Checks (Sub-Assertions)

Score Types

Built-in Judges

Deterministic Judges (agent-judge-core)

Command Judges (agent-judge-exec)

LLM Judges (agent-judge-llm)

Agent Judges (agent-judge-agent)

Judge Composition

Jury System

Voting Strategies

Jury Utilities

Utilities

MavenTestRunner

JaCoCoReportParser

Spring AI Agents Integration

Resources

GitHub Repository

Maven Central

License

Community

Projects

Production Projects

Incubating Projects

Get Involved

Benchmarking

Blog

​Overview

​Core Abstractions

Judge

Judgment

Score

Jury

​Module Structure

​Quick Start

​Maven BOM

​Judge Interface

​Judgment Context

​Judgment Results

​Checks (Sub-Assertions)

​Score Types

​Built-in Judges

​Deterministic Judges (agent-judge-core)

​Command Judges (agent-judge-exec)

​LLM Judges (agent-judge-llm)

​Agent Judges (agent-judge-agent)

​Judge Composition

​Jury System

​Voting Strategies

​Jury Utilities

​Utilities

​MavenTestRunner

​JaCoCoReportParser

​Spring AI Agents Integration

​Resources

GitHub Repository

Maven Central

​License

Overview

Core Abstractions

Module Structure

Quick Start

Maven BOM

Judge Interface

Judgment Context

Judgment Results

Checks (Sub-Assertions)

Score Types

Built-in Judges

Deterministic Judges (agent-judge-core)

Command Judges (agent-judge-exec)

LLM Judges (agent-judge-llm)

Agent Judges (agent-judge-agent)

Judge Composition

Jury System

Voting Strategies

Jury Utilities

Utilities

MavenTestRunner

JaCoCoReportParser

Spring AI Agents Integration

Resources

License