Overview
Agent Judge is an agent-agnostic evaluation framework for verifying AI agent task completion. It provides a pluggable architecture with deterministic rules, command execution, and LLM-powered evaluation - all with zero coupling to any specific agent implementation. The library follows a clean separation of concerns:agent-judge-core has zero external dependencies, while specialized modules add capabilities like process execution (agent-judge-exec) and LLM evaluation (agent-judge-llm).
Core Abstractions
Judge
Functional interface for evaluation logic - takes JudgmentContext, returns Judgment
Judgment
Result containing Score, JudgmentStatus, reasoning, and granular Checks
Score
Sealed interface: BooleanScore, NumericalScore, or CategoricalScore
Jury
Multi-judge aggregation with configurable voting strategies
Module Structure
| Module | Description | Dependencies |
|---|---|---|
agent-judge-core | Core Judge API and abstractions | None (zero deps) |
agent-judge-exec | Command execution judges | agent-sandbox, zt-exec |
agent-judge-llm | LLM-powered evaluation | spring-ai-client-chat |
agent-judge-agent | Agent-as-judge bridge interface | Core only |
agent-judge-advisor | AgentClient advisors | spring-ai-agent-client |
agent-judge-bom | Bill of Materials | N/A |
Quick Start
Maven BOM
Judge Interface
The coreJudge interface is a functional interface for lambda support:
Judgment Context
Complete evaluation input with all context an agent execution:SUCCESS, FAILED, TIMEOUT, CANCELLED, UNKNOWN
Judgment Results
PASS, FAIL, ABSTAIN, ERROR
Checks (Sub-Assertions)
Provide granular failure reporting:Score Types
BooleanScore
BooleanScore
Simple pass/fail scoring:
NumericalScore
NumericalScore
Continuous scoring with bounds and normalization:
CategoricalScore
CategoricalScore
Discrete categories from an allowed set:
Scores utility class converts between score types for heterogeneous aggregation.
Built-in Judges
Deterministic Judges (agent-judge-core)
FileExistsJudge
FileExistsJudge
Verifies file existence:
FileContentJudge
FileContentJudge
Verifies file content with match modes:
Custom Deterministic Judge
Custom Deterministic Judge
Build rule-based judges:
Command Judges (agent-judge-exec)
CommandJudge
CommandJudge
Execute shell commands and evaluate results:
BuildSuccessJudge
BuildSuccessJudge
Specialized for build tools with wrapper auto-detection:Default timeout: 10 minutes.
LLM Judges (agent-judge-llm)
CorrectnessJudge
CorrectnessJudge
Uses LLM to evaluate if the agent accomplished its goal:Uses template method pattern for customization.
Custom LLM Judge
Custom LLM Judge
Build custom LLM-powered evaluation:
Agent Judges (agent-judge-agent)
Delegate evaluation to an AI agent using a bridge interface:Judge Composition
TheJudges utility class provides composition operators:
Jury System
Combine multiple judges with voting strategies:Voting Strategies
| Strategy | Description | Pass Condition |
|---|---|---|
MajorityVotingStrategy | Majority vote | passCount > failCount |
ConsensusStrategy | Unanimous agreement | All judges agree |
AverageVotingStrategy | Simple average | average >= 0.5 |
WeightedAverageStrategy | Weighted average | weighted avg >= 0.5 |
MedianVotingStrategy | Median (outlier-robust) | median >= 0.5 |