Understanding Prompt Caching
Prompt caching allows you to mark portions of your prompt for reuse across multiple API requests. When you enable it, Anthropic caches the specified content and charges reduced rates for cached segments in subsequent requests.How It Works
The cache operates on exact prefix matching. Consider this sequence:Cost Structure
Pricing varies significantly by model tier:| Model | Base Input | Cache Write | Cache Read | Savings |
|---|---|---|---|---|
| Claude Sonnet 4.5 | $3/MTok | $3.75/MTok (+25%) | $0.30/MTok | 90% |
| Claude Sonnet 4 | $3/MTok | $3.75/MTok (+25%) | $0.30/MTok | 90% |
| Claude Opus 4.1 | $15/MTok | $18.75/MTok (+25%) | $1.50/MTok | 90% |
| Claude Opus 4 | $15/MTok | $18.75/MTok (+25%) | $1.50/MTok | 90% |
| Claude Haiku 4.5 | $1/MTok | $1.25/MTok (+25%) | $0.10/MTok | 90% |
| Claude Haiku 3.5 | $0.80/MTok | $1/MTok (+25%) | $0.08/MTok | 90% |
| Claude Haiku 3 | $0.25/MTok | $0.30/MTok (+25%) | $0.03/MTok | 90% |
- First request: 5,000 tokens × 0.01875** (cache write)
- Subsequent requests: 5,000 tokens × 0.00150** (cache read)
- Savings: 90% reduction on cached content
- Breakeven: 2nd request (the 25% write premium is recovered immediately)
Requirements and Limitations
Minimum token thresholds vary by model:| Model | Minimum Cacheable Tokens |
|---|---|
| Claude Sonnet 4.5, , Claude Sonnet 4, Claude Opus 4.1, Claude Opus 4 | 1,024 |
| Claude Haiku 3.5, Claude Haiku 3 | 2,048 |
| Claude Haiku 4.5 | 4,096 |
- Maximum 4 cache breakpoints per request
- Cache TTL: 5 minutes default, 1 hour optional (at higher write cost)
- Cache refreshes on each use within TTL window
- Cache entries become available after the first response begins (not available for concurrent parallel requests)
Cache Hierarchy and Cascade Invalidation
Anthropic processes request components in a specific order, and this order determines how cache invalidation works.Spring AI Cache Strategies
Rather than requiring you to manually place cache breakpoints (which can be error-prone and tedious), Spring AI provides five strategic patterns through theAnthropicCacheStrategy enum. Each strategy handles cache control directive placement automatically while respecting Anthropic’s 4-breakpoint limit.
Strategy Overview
| Strategy | Breakpoints | Cached Content | Typical Use Case |
|---|---|---|---|
NONE | 0 | Nothing | One-off requests, testing |
SYSTEM_ONLY | 1 | System message | Stable system prompts, <20 tools |
TOOLS_ONLY | 1 | Tool definitions | Large tools, dynamic system prompts |
SYSTEM_AND_TOOLS | 2 | Tools + System | 20+ tools, both stable |
CONVERSATION_HISTORY | 1-4 | Full conversation | Multi-turn conversations |
SYSTEM_ONLY Strategy
This strategy caches the system message content. Since tools appear before the system message in Anthropic’s request hierarchy (Tools → System → Messages), they automatically become part of the cache prefix when you place a cache breakpoint on the system message. Important: Changing any tool definition will invalidate the system cache due to the cache hierarchy.cacheCreationInputTokens will be greater than zero and cacheReadInputTokens will be zero. Subsequent requests with the same system prompt will show zero for cacheCreationInputTokens and a positive value for cacheReadInputTokens. This is how you can verify that caching is working as expected in your application.
Use this strategy when your system prompt is large (meeting the minimum token threshold) and stable, but user questions vary.
TOOLS_ONLY Strategy
This strategy caches tool definitions while processing the system message fresh on each request. The use case becomes clear in multi-tenant scenarios where tools are shared but system prompts need customization. Consider a SaaS application serving multiple organizations:- First request (any tenant): Tools cached at 1.25x cost
- All subsequent requests (all tenants): Tools read from cache at 0.1x cost
- Each tenant’s system prompt: Processed fresh at 1.0x cost (by design)
$0.01875 once to create the cache, then $0.0015 per request for cache reads—regardless of which tenant is making the request.
SYSTEM_AND_TOOLS Strategy
This strategy creates two independent cache breakpoints: one for tools (breakpoint 1) and one for the system message (breakpoint 2). This separation matters when you have more than 20 tools or when you need deterministic caching of both components. The key advantage: changing the system message does not invalidate the tool cache.- Breakpoint 1 (tools):
hash(tools) - Breakpoint 2 (system):
hash(tools + system)
- System changes only: Tool cache (breakpoint 1) remains valid, system cache (breakpoint 2) invalidated
- Tool changes: Both caches invalidated
CONVERSATION_HISTORY Strategy
For multi-turn conversations, this strategy caches the entire conversation history incrementally. Spring AI places a cache breakpoint on the last user message in the conversation history. This is particularly useful when building conversational AI applications (such as chatbots, virtual assistants, and customer support systems).CONVERSATION_HISTORY, both tools and system prompts must remain stable throughout the conversation. Changes to either invalidate the entire conversation cache.
Example: Partnership Agreement Analysis
Here’s a complete example showing cache efficiency for multi-question document analysis:- First question: 3,500 tokens × 0.013 (cache write)
- Questions 2-5: 3,500 tokens × 0.001 (cache read) each
- Total cached content cost: 0.001) = $0.017
- Without caching: 5 × (3,500 tokens × 0.053
Getting Started
Note: Prompt caching support is available in Spring AI1.1.0 and later. Try it with the latest 1.1.0-SNAPSHOT version.
Add the Spring AI Anthropic starter to your project:
Advanced Configuration Options
Extended Cache TTL
The default cache TTL is 5 minutes. For scenarios where requests arrive less frequently, you can configure a 1-hour cache:anthropic-beta: extended-cache-ttl-2025-04-11) when you configure 1-hour TTL.
When to use each TTL:
- 5 minutes (default): Real-time conversations, frequently updated content
- 1 hour: Infrequent requests (>5 min apart), stable reference materials, lower traffic
Content Length Filtering
You can set minimum content lengths per message type to optimize breakpoint usage:Implementation Details
For those interested in internals, here’s how Spring AI handles cache management:CacheEligibilityResolver determines whether each message or tool qualifies for caching based on the chosen strategy, message type eligibility, content length requirements, and available breakpoints. The CacheBreakpointTracker enforces Anthropic’s 4-breakpoint limit with thread-safe tracking per request.
For CONVERSATION_HISTORY, Spring AI uses aggregate eligibility checking—it considers the combined content of all message types (user, assistant, tool) within the last ~20 content blocks when determining cache eligibility. This prevents short user questions (such as “Tell me more”) from blocking cache creation when there are substantial assistant responses in the conversation history.
Practical Considerations
When Caching Doesn’t Help
Avoid caching when:- Content changes frequently (cache miss rate >50%)
- Prompts are below minimum token thresholds for your model
- Making one-off requests with no reuse patterns
- Tools or system prompts change more often than content is reused
Strategy Anti-Patterns
Avoid these common mistakes:- Don’t use SYSTEM_ONLY if your system prompt changes frequently—you’ll pay cache write costs without getting cache hits
- Don’t use TOOLS_ONLY if your tools change frequently—you’ll pay cache write costs without getting cache hits. Note that SYSTEM_ONLY won’t help either when tools change frequently, since tool changes invalidate the system cache
- Don’t use CONVERSATION_HISTORY if you can’t guarantee tool and system stability—changes invalidate the entire conversation cache
- Don’t use SYSTEM_AND_TOOLS if you only have a few small tools (<20)—SYSTEM_ONLY’s implicit caching is sufficient