Documentation Index Fetch the complete documentation index at: https://springaicommunity.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
GitHub • Maven Central
Overview
Agent Judge is an agent-agnostic evaluation framework for verifying AI agent task completion. It provides a pluggable architecture with deterministic rules, command execution, and LLM-powered evaluation - all with zero coupling to any specific agent implementation.
The library follows a clean separation of concerns: agent-judge-core has zero external dependencies, while specialized modules add capabilities like process execution (agent-judge-exec) and LLM evaluation (agent-judge-llm).
Core Abstractions
Judge Functional interface for evaluation logic - takes JudgmentContext, returns Judgment
Judgment Result containing Score, JudgmentStatus, reasoning, and granular Checks
Score Sealed interface: BooleanScore, NumericalScore, or CategoricalScore
Jury Multi-judge aggregation with configurable voting strategies
Module Structure
Module Description Dependencies agent-judge-coreCore Judge API and abstractions None (zero deps) agent-judge-execCommand execution judges agent-sandbox, zt-exec agent-judge-llmLLM-powered evaluation spring-ai-client-chat agent-judge-agentAgent-as-judge bridge interface Core only agent-judge-advisorAgentClient advisors spring-ai-agent-client agent-judge-bomBill of Materials N/A
Quick Start
Maven BOM
< dependencyManagement >
< dependencies >
< dependency >
< groupId > org.springaicommunity.judge </ groupId >
< artifactId > agent-judge-bom </ artifactId >
< version > 0.9.1 </ version >
< type > pom </ type >
< scope > import </ scope >
</ dependency >
</ dependencies >
</ dependencyManagement >
< dependencies >
< dependency >
< groupId > org.springaicommunity.judge </ groupId >
< artifactId > agent-judge-core </ artifactId >
</ dependency >
</ dependencies >
Judge Interface
The core Judge interface is a functional interface for lambda support:
@ FunctionalInterface
public interface Judge {
Judgment judge ( JudgmentContext context );
}
Async and Reactive Variants:
// For CompletableFuture-based async
public interface AsyncJudge {
CompletableFuture < Judgment > judgeAsync ( JudgmentContext context );
}
// For Spring WebFlux / Project Reactor
public interface ReactiveJudge {
Mono < Judgment > judge ( JudgmentContext context );
}
Judgment Context
Complete evaluation input with all context an agent execution:
JudgmentContext context = JudgmentContext . builder ()
. goal ( "Increase test coverage to 80%" )
. workspace ( Path . of ( "/project" ))
. executionTime ( Duration . ofMinutes ( 5 ))
. startedAt ( Instant . now ())
. agentOutput ( "Added 15 test cases..." )
. status ( ExecutionStatus . SUCCESS )
. build ();
ExecutionStatus values: SUCCESS, FAILED, TIMEOUT, CANCELLED, UNKNOWN
Judgment Results
Judgment result = judge . judge (context);
// Result properties
Score score = result . score (); // BooleanScore, NumericalScore, or CategoricalScore
JudgmentStatus status = result . status (); // PASS, FAIL, ABSTAIN, or ERROR
String reasoning = result . reasoning (); // Human-readable explanation
List < Check > checks = result . checks (); // Granular sub-assertions
JudgmentStatus values: PASS, FAIL, ABSTAIN, ERROR
Checks (Sub-Assertions)
Provide granular failure reporting:
Judgment . builder ()
. score ( BooleanScore . FAIL )
. status ( JudgmentStatus . FAIL )
. reasoning ( "2 of 3 checks failed" )
. check ( Check . passed ( "file-exists" , "output.txt exists" ))
. check ( Check . failed ( "content-valid" , "Missing required header" ))
. check ( Check . failed ( "format-correct" , "Invalid JSON structure" ))
. build ();
Score Types
Simple pass/fail scoring: BooleanScore . PASS // true
BooleanScore . FAIL // false
Continuous scoring with bounds and normalization: // Score with min/max bounds
NumericalScore score = NumericalScore . of ( 85.0 , 0.0 , 100.0 );
// Auto-normalizes to [0.0, 1.0]
double normalized = score . normalized (); // 0.85
Discrete categories from an allowed set: CategoricalScore score = CategoricalScore . of (
"GOOD" ,
Set . of ( "EXCELLENT" , "GOOD" , "FAIR" , "POOR" )
);
The Scores utility class converts between score types for heterogeneous aggregation.
Built-in Judges
Deterministic Judges (agent-judge-core)
Verifies file existence: FileExistsJudge judge = FileExistsJudge . of ( "target/output.txt" );
Judgment result = judge . judge (context);
Verifies file content with match modes: // Exact match
FileContentJudge . exact ( "config.json" , expectedContent);
// Contains substring
FileContentJudge . contains ( "output.log" , "BUILD SUCCESS" );
// Regex pattern
FileContentJudge . regex ( "version.txt" , " \\ d+ \\ . \\ d+ \\ . \\ d+" );
Custom Deterministic Judge
Build rule-based judges: DeterministicJudge judge = DeterministicJudge . builder ()
. name ( "config-valid" )
. check (ctx -> validateConfig ( ctx . workspace ()))
. build ();
Command Judges (agent-judge-exec)
Execute shell commands and evaluate results: CommandJudge judge = CommandJudge . builder ()
. name ( "maven-tests" )
. command ( "mvn" , "test" , "-q" )
. expectedExitCode ( 0 )
. timeout ( Duration . ofMinutes ( 5 ))
. build ();
Specialized for build tools with wrapper auto-detection: // Maven - auto-detects mvnw wrapper
BuildSuccessJudge maven = BuildSuccessJudge . maven ( "clean" , "verify" );
// Gradle - auto-detects gradlew wrapper
BuildSuccessJudge gradle = BuildSuccessJudge . gradle ( "build" , "test" );
Default timeout: 10 minutes.
LLM Judges (agent-judge-llm)
Uses LLM to evaluate if the agent accomplished its goal: CorrectnessJudge judge = CorrectnessJudge . builder ()
. chatClient (chatClient)
. build ();
// Returns YES/NO with reasoning
Judgment result = judge . judge (context);
Uses template method pattern for customization.
Build custom LLM-powered evaluation: LLMJudge judge = new LLMJudge (chatClient) {
@ Override
protected String buildPrompt ( JudgmentContext ctx ) {
return "Evaluate code quality for: " + ctx . goal ();
}
@ Override
protected Judgment parseResponse ( String response , JudgmentContext ctx ) {
// Parse LLM response into Judgment
}
};
Agent Judges (agent-judge-agent)
Delegate evaluation to an AI agent using a bridge interface:
public interface JudgeAgentClient {
JudgeAgentResponse execute ( String goal , Path workspace );
}
// Adapt any agent client without hard dependencies
JudgeAgentClient adapter = (goal, workspace) ->
myAgentClient . run (goal, workspace);
Judge Composition
The Judges utility class provides composition operators:
// Logical AND (short-circuit)
Judge combined = Judges . and (fileExistsJudge, contentValidJudge);
// Logical OR (short-circuit)
Judge either = Judges . or (primaryJudge, fallbackJudge);
// Combine multiple judges
Judge all = Judges . allOf (judge1, judge2, judge3);
Judge any = Judges . anyOf (judge1, judge2, judge3);
// Test judges
Judge pass = Judges . alwaysPass ( "Skipped in test mode" );
Judge fail = Judges . alwaysFail ( "Feature not implemented" );
// Add metadata to any judge
Judge named = Judges . named (myJudge, "my-judge" , "Description" , JudgeType . DETERMINISTIC );
Jury System
Combine multiple judges with voting strategies:
SimpleJury jury = SimpleJury . builder ()
. name ( "comprehensive-check" )
. judge (fileJudge)
. judge (testJudge, 2.0 ) // Weight of 2.0
. judge (codeQualityJudge)
. votingStrategy ( VotingStrategy . majority ())
. parallel ( true ) // Execute judges in parallel (default)
. build ();
Verdict verdict = jury . vote (context);
// Access results
Judgment aggregated = verdict . aggregated ();
List < Judgment > individual = verdict . individual ();
Map < String , Judgment > byName = verdict . individualByName ();
Map < String , Double > weights = verdict . weights ();
Voting Strategies
Strategy Description Pass Condition MajorityVotingStrategyMajority vote passCount > failCount ConsensusStrategyUnanimous agreement All judges agree AverageVotingStrategySimple average average >= 0.5 WeightedAverageStrategyWeighted average weighted avg >= 0.5 MedianVotingStrategyMedian (outlier-robust) median >= 0.5
Majority Voting Policies:
MajorityVotingStrategy . builder ()
. tiePolicy ( TiePolicy . FAIL ) // PASS, FAIL, or ABSTAIN on ties
. errorPolicy ( ErrorPolicy . TREAT_AS_FAIL ) // How to handle ERROR judgments
. build ();
Jury Utilities
// Create jury from judges with auto-naming
Jury jury = Juries . fromJudges ( VotingStrategy . majority (), judge1, judge2, judge3);
// Combine juries into meta-jury
Jury metaJury = Juries . combine (jury1, jury2, VotingStrategy . consensus ());
// Create meta-jury from multiple juries
Jury combined = Juries . allOf ( VotingStrategy . average (), jury1, jury2, jury3);
Utilities
MavenTestRunner
Run Maven tests with wrapper auto-detection:
ExecResult result = MavenTestRunner . run (projectPath, Duration . ofMinutes ( 10 ));
if ( result . exitCode () == 0 ) {
// Tests passed
}
JaCoCoReportParser
Parse JaCoCo XML reports for coverage metrics:
CoverageMetrics metrics = JaCoCoReportParser . parse (
projectPath . resolve ( "target/site/jacoco/jacoco.xml" )
);
double lineCoverage = metrics . linePercentage (); // e.g., 85.5
double branchCoverage = metrics . branchPercentage (); // e.g., 72.3
double methodCoverage = metrics . methodPercentage (); // e.g., 90.1
Spring AI Agents Integration
Agent Judge powers the evaluation system in Agent Client :
CoverageJudge judge = new CoverageJudge ( 80.0 );
AgentClientResponse response = agentClient
. goal ( "Increase test coverage to 80%" )
. advisors ( JudgeAdvisor . builder (). judge (judge). build ())
. run ();
Resources
GitHub Repository Source code and contribution guidelines
Maven Central Published artifacts
License
Agent Judge is Open Source software released under the Apache 2.0 license .