Skip to content

Conversation

sobychacko
Copy link
Contributor

… incremental caching pattern

This commit updates the CONVERSATION_HISTORY cache strategy to align with Anthropic's official documentation and cookbook examples (https://github.com/anthropics/claude-cookbooks/blob/main/misc/prompt_caching.ipynb) for incremental conversation caching.

Cache breakpoint placement:

  • Before: Cache breakpoint on penultimate (second-to-last) user message
  • After: Cache breakpoint on last user message

Aggregate eligibility:

  • Before: Only considered user messages for min content length check
  • After: Considers all message types (user, assistant, tool) within 20-block lookback window for aggregate eligibility

Anthropic's documentation and cookbook demonstrate incremental caching by placing cache_control on the LAST user message:

result.append({
    "role": "user",
    "content": [{
        "type": "text",
        "text": turn["content"][0]["text"],
        "cache_control": {"type": "ephemeral"}  # On LAST user message
    }]
})

This pattern is also shown in their official docs: https://docs.claude.com/en/docs/build-with-claude/prompt-caching#large-context-caching-example

Anthropic's caching system uses prefix matching to find the longest matching prefix from the cache. By placing cache_control on the last user message, we enable the following incremental caching pattern:

Turn 1: Cache [System + User1]
Turn 2: Reuse [System + User1], process [Assistant1 + User2],
        cache [System + User1 + Assistant1 + User2]
Turn 3: Reuse [System + User1 + Assistant1 + User2],
        process [Assistant2 + User3],
        cache [System + User1 + Assistant1 + User2 + Assistant2 + User3]

The cache grows incrementally with each turn, building a larger prefix that can be reused. This is the recommended pattern from Anthropic.

The new implementation considers all message types (user, assistant, tool) within the 20-block lookback window when checking minimum content length. This ensures that:

  • Short user questions don't prevent caching when conversation has long assistant responses
  • The full conversation context is considered for the 1024+ token minimum
  • Aligns with Anthropic's note: "The automatic prefix checking only looks back approximately 20 content blocks from each explicit breakpoint"

None. This is an implementation detail of the CONVERSATION_HISTORY strategy. The API surface remains unchanged. Users may observe:

  • Different cache hit patterns (should be more effective)

  • Cache metrics may show higher cache read tokens as conversations grow

  • Updated shouldRespectMinLengthForUserHistoryCaching() to test aggregate eligibility with combined message lengths

  • Renamed shouldApplyCacheControlToLastUserMessageForConversationHistory() (from shouldRespectAllButLastUserMessageForUserHistoryCaching)

  • Added shouldDemonstrateIncrementalCachingAcrossMultipleTurns() integration test showing cache growth pattern across 4 conversation turns

  • Updated mock test assertions to verify last message has cache_control

Updated anthropic-chat.adoc to clarify:

Thank you for taking time to contribute this pull request!
You might have already read the contributor guide, but as a reminder, please make sure to:

  • Add a Signed-off-by line to each commit (git commit -s) per the DCO
  • Rebase your changes on the latest main branch and squash your commits
  • Add/Update unit tests as needed
  • Run a build and make sure all tests pass prior to submission

For more details, please check the contributor guide.
Thank you upfront!

… incremental caching pattern

This commit updates the CONVERSATION_HISTORY cache strategy to align with
Anthropic's official documentation and cookbook examples
(https://github.com/anthropics/claude-cookbooks/blob/main/misc/prompt_caching.ipynb)
for incremental conversation caching.

**Cache breakpoint placement:**
- Before: Cache breakpoint on penultimate (second-to-last) user message
- After: Cache breakpoint on last user message

**Aggregate eligibility:**
- Before: Only considered user messages for min content length check
- After: Considers all message types (user, assistant, tool) within 20-block
  lookback window for aggregate eligibility

Anthropic's documentation and cookbook demonstrate incremental caching by
placing cache_control on the LAST user message:

```python
result.append({
    "role": "user",
    "content": [{
        "type": "text",
        "text": turn["content"][0]["text"],
        "cache_control": {"type": "ephemeral"}  # On LAST user message
    }]
})
```

This pattern is also shown in their official docs:
https://docs.claude.com/en/docs/build-with-claude/prompt-caching#large-context-caching-example

Anthropic's caching system uses prefix matching to find the longest matching
prefix from the cache. By placing cache_control on the last user message,
we enable the following incremental caching pattern:

```
Turn 1: Cache [System + User1]
Turn 2: Reuse [System + User1], process [Assistant1 + User2],
        cache [System + User1 + Assistant1 + User2]
Turn 3: Reuse [System + User1 + Assistant1 + User2],
        process [Assistant2 + User3],
        cache [System + User1 + Assistant1 + User2 + Assistant2 + User3]
```

The cache grows incrementally with each turn, building a larger prefix that
can be reused. This is the recommended pattern from Anthropic.

The new implementation considers all message types (user, assistant, tool)
within the 20-block lookback window when checking minimum content length.
This ensures that:

- Short user questions don't prevent caching when conversation has long
  assistant responses
- The full conversation context is considered for the 1024+ token minimum
- Aligns with Anthropic's note: "The automatic prefix checking only looks
  back approximately 20 content blocks from each explicit breakpoint"

None. This is an implementation detail of the CONVERSATION_HISTORY strategy.
The API surface remains unchanged. Users may observe:

- Different cache hit patterns (should be more effective)
- Cache metrics may show higher cache read tokens as conversations grow

- Updated `shouldRespectMinLengthForUserHistoryCaching()` to test aggregate
  eligibility with combined message lengths
- Renamed `shouldApplyCacheControlToLastUserMessageForConversationHistory()`
  (from `shouldRespectAllButLastUserMessageForUserHistoryCaching`)
- Added `shouldDemonstrateIncrementalCachingAcrossMultipleTurns()` integration
  test showing cache growth pattern across 4 conversation turns
- Updated mock test assertions to verify last message has cache_control

Updated anthropic-chat.adoc to clarify:
- CONVERSATION_HISTORY strategy description now mentions incremental prefix
  caching
- Code example comments updated to reflect cache breakpoint on last user
  message
- Implementation Details section expanded with explanation of prefix matching
  and aggregate eligibility checking

- Anthropic Prompt Caching Docs: https://docs.claude.com/en/docs/build-with-claude/prompt-caching
- Anthropic Cookbook: https://github.com/anthropics/claude-cookbooks/blob/main/misc/prompt_caching.ipynb

Signed-off-by: Soby Chacko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants