Skip to main content
Incubating Status GitHubMaven Central

Overview

Agent Judge is an agent-agnostic evaluation framework for verifying AI agent task completion. It provides a pluggable architecture with deterministic rules, command execution, and LLM-powered evaluation - all with zero coupling to any specific agent implementation. The library follows a clean separation of concerns: agent-judge-core has zero external dependencies, while specialized modules add capabilities like process execution (agent-judge-exec) and LLM evaluation (agent-judge-llm).

Core Abstractions

Judge

Functional interface for evaluation logic - takes JudgmentContext, returns Judgment

Judgment

Result containing Score, JudgmentStatus, reasoning, and granular Checks

Score

Sealed interface: BooleanScore, NumericalScore, or CategoricalScore

Jury

Multi-judge aggregation with configurable voting strategies

Module Structure

ModuleDescriptionDependencies
agent-judge-coreCore Judge API and abstractionsNone (zero deps)
agent-judge-execCommand execution judgesagent-sandbox, zt-exec
agent-judge-llmLLM-powered evaluationspring-ai-client-chat
agent-judge-agentAgent-as-judge bridge interfaceCore only
agent-judge-advisorAgentClient advisorsspring-ai-agent-client
agent-judge-bomBill of MaterialsN/A

Quick Start

Maven BOM

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springaicommunity.judge</groupId>
            <artifactId>agent-judge-bom</artifactId>
            <version>0.1.0-SNAPSHOT</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <dependency>
        <groupId>org.springaicommunity.judge</groupId>
        <artifactId>agent-judge-core</artifactId>
    </dependency>
</dependencies>

Judge Interface

The core Judge interface is a functional interface for lambda support:
@FunctionalInterface
public interface Judge {
    Judgment judge(JudgmentContext context);
}
Async and Reactive Variants:
// For CompletableFuture-based async
public interface AsyncJudge {
    CompletableFuture<Judgment> judgeAsync(JudgmentContext context);
}

// For Spring WebFlux / Project Reactor
public interface ReactiveJudge {
    Mono<Judgment> judge(JudgmentContext context);
}

Judgment Context

Complete evaluation input with all context an agent execution:
JudgmentContext context = JudgmentContext.builder()
    .goal("Increase test coverage to 80%")
    .workspace(Path.of("/project"))
    .executionTime(Duration.ofMinutes(5))
    .startedAt(Instant.now())
    .agentOutput("Added 15 test cases...")
    .status(ExecutionStatus.SUCCESS)
    .build();
ExecutionStatus values: SUCCESS, FAILED, TIMEOUT, CANCELLED, UNKNOWN

Judgment Results

Judgment result = judge.judge(context);

// Result properties
Score score = result.score();              // BooleanScore, NumericalScore, or CategoricalScore
JudgmentStatus status = result.status();   // PASS, FAIL, ABSTAIN, or ERROR
String reasoning = result.reasoning();      // Human-readable explanation
List<Check> checks = result.checks();       // Granular sub-assertions
JudgmentStatus values: PASS, FAIL, ABSTAIN, ERROR

Checks (Sub-Assertions)

Provide granular failure reporting:
Judgment.builder()
    .score(BooleanScore.FAIL)
    .status(JudgmentStatus.FAIL)
    .reasoning("2 of 3 checks failed")
    .check(Check.passed("file-exists", "output.txt exists"))
    .check(Check.failed("content-valid", "Missing required header"))
    .check(Check.failed("format-correct", "Invalid JSON structure"))
    .build();

Score Types

Simple pass/fail scoring:
BooleanScore.PASS   // true
BooleanScore.FAIL   // false
Continuous scoring with bounds and normalization:
// Score with min/max bounds
NumericalScore score = NumericalScore.of(85.0, 0.0, 100.0);

// Auto-normalizes to [0.0, 1.0]
double normalized = score.normalized();  // 0.85
Discrete categories from an allowed set:
CategoricalScore score = CategoricalScore.of(
    "GOOD",
    Set.of("EXCELLENT", "GOOD", "FAIR", "POOR")
);
The Scores utility class converts between score types for heterogeneous aggregation.

Built-in Judges

Deterministic Judges (agent-judge-core)

Verifies file existence:
FileExistsJudge judge = FileExistsJudge.of("target/output.txt");
Judgment result = judge.judge(context);
Verifies file content with match modes:
// Exact match
FileContentJudge.exact("config.json", expectedContent);

// Contains substring
FileContentJudge.contains("output.log", "BUILD SUCCESS");

// Regex pattern
FileContentJudge.regex("version.txt", "\\d+\\.\\d+\\.\\d+");
Build rule-based judges:
DeterministicJudge judge = DeterministicJudge.builder()
    .name("config-valid")
    .check(ctx -> validateConfig(ctx.workspace()))
    .build();

Command Judges (agent-judge-exec)

Execute shell commands and evaluate results:
CommandJudge judge = CommandJudge.builder()
    .name("maven-tests")
    .command("mvn", "test", "-q")
    .expectedExitCode(0)
    .timeout(Duration.ofMinutes(5))
    .build();
Specialized for build tools with wrapper auto-detection:
// Maven - auto-detects mvnw wrapper
BuildSuccessJudge maven = BuildSuccessJudge.maven("clean", "verify");

// Gradle - auto-detects gradlew wrapper
BuildSuccessJudge gradle = BuildSuccessJudge.gradle("build", "test");
Default timeout: 10 minutes.

LLM Judges (agent-judge-llm)

Uses LLM to evaluate if the agent accomplished its goal:
CorrectnessJudge judge = CorrectnessJudge.builder()
    .chatClient(chatClient)
    .build();

// Returns YES/NO with reasoning
Judgment result = judge.judge(context);
Uses template method pattern for customization.
Build custom LLM-powered evaluation:
LLMJudge judge = new LLMJudge(chatClient) {
    @Override
    protected String buildPrompt(JudgmentContext ctx) {
        return "Evaluate code quality for: " + ctx.goal();
    }

    @Override
    protected Judgment parseResponse(String response, JudgmentContext ctx) {
        // Parse LLM response into Judgment
    }
};

Agent Judges (agent-judge-agent)

Delegate evaluation to an AI agent using a bridge interface:
public interface JudgeAgentClient {
    JudgeAgentResponse execute(String goal, Path workspace);
}

// Adapt any agent client without hard dependencies
JudgeAgentClient adapter = (goal, workspace) ->
    myAgentClient.run(goal, workspace);

Judge Composition

The Judges utility class provides composition operators:
// Logical AND (short-circuit)
Judge combined = Judges.and(fileExistsJudge, contentValidJudge);

// Logical OR (short-circuit)
Judge either = Judges.or(primaryJudge, fallbackJudge);

// Combine multiple judges
Judge all = Judges.allOf(judge1, judge2, judge3);
Judge any = Judges.anyOf(judge1, judge2, judge3);

// Test judges
Judge pass = Judges.alwaysPass("Skipped in test mode");
Judge fail = Judges.alwaysFail("Feature not implemented");

// Add metadata to any judge
Judge named = Judges.named(myJudge, "my-judge", "Description", JudgeType.DETERMINISTIC);

Jury System

Combine multiple judges with voting strategies:
SimpleJury jury = SimpleJury.builder()
    .name("comprehensive-check")
    .judge(fileJudge)
    .judge(testJudge, 2.0)  // Weight of 2.0
    .judge(codeQualityJudge)
    .votingStrategy(VotingStrategy.majority())
    .parallel(true)  // Execute judges in parallel (default)
    .build();

Verdict verdict = jury.vote(context);

// Access results
Judgment aggregated = verdict.aggregated();
List<Judgment> individual = verdict.individual();
Map<String, Judgment> byName = verdict.individualByName();
Map<String, Double> weights = verdict.weights();

Voting Strategies

StrategyDescriptionPass Condition
MajorityVotingStrategyMajority votepassCount > failCount
ConsensusStrategyUnanimous agreementAll judges agree
AverageVotingStrategySimple averageaverage >= 0.5
WeightedAverageStrategyWeighted averageweighted avg >= 0.5
MedianVotingStrategyMedian (outlier-robust)median >= 0.5
Majority Voting Policies:
MajorityVotingStrategy.builder()
    .tiePolicy(TiePolicy.FAIL)           // PASS, FAIL, or ABSTAIN on ties
    .errorPolicy(ErrorPolicy.TREAT_AS_FAIL)  // How to handle ERROR judgments
    .build();

Jury Utilities

// Create jury from judges with auto-naming
Jury jury = Juries.fromJudges(VotingStrategy.majority(), judge1, judge2, judge3);

// Combine juries into meta-jury
Jury metaJury = Juries.combine(jury1, jury2, VotingStrategy.consensus());

// Create meta-jury from multiple juries
Jury combined = Juries.allOf(VotingStrategy.average(), jury1, jury2, jury3);

Utilities

MavenTestRunner

Run Maven tests with wrapper auto-detection:
ExecResult result = MavenTestRunner.run(projectPath, Duration.ofMinutes(10));
if (result.exitCode() == 0) {
    // Tests passed
}

JaCoCoReportParser

Parse JaCoCo XML reports for coverage metrics:
CoverageMetrics metrics = JaCoCoReportParser.parse(
    projectPath.resolve("target/site/jacoco/jacoco.xml")
);

double lineCoverage = metrics.linePercentage();      // e.g., 85.5
double branchCoverage = metrics.branchPercentage();  // e.g., 72.3
double methodCoverage = metrics.methodPercentage();  // e.g., 90.1

Spring AI Agents Integration

Agent Judge powers the evaluation system in Agent Client:
CoverageJudge judge = new CoverageJudge(80.0);

AgentClientResponse response = agentClient
    .goal("Increase test coverage to 80%")
    .advisors(JudgeAdvisor.builder().judge(judge).build())
    .run();

Resources

License

Agent Judge is Open Source software released under the Apache 2.0 license.