What exactly does the cache hash include?

All prompt prefix tokens + tools list + system message + cache_control markers. Including whitespace. One character change = cache miss. In production, make prompt build deterministic (alphabetical tool list, etc.).

Anthropic Prompt Caching Deep-Dive: 1.25× Write, 0.10× Read — Turning the Math Into Maximum Savings

Q: Can I achieve 95%+ cache hit?

Theoretically yes, in practice 85-92% is the sweet spot. For 95%+ you need to aggressively minimize cache writes — which means 1h TTL, but high traffic requires 5min refresh. Trade-off. Module 7.5 details hit-rate optimization.

Anthropic's caching math looks simple: write 1.25×, read 0.10×. But to achieve 90% savings in production you need to master breakpoint count, TTL choice, multi-cache layering, and refresh strategies. Master-level this lesson.

Şükrü Yusuf KAYA

22 min read

5/14/2026

Advanced

Anthropic Prompt Caching Derinlemesine: 1.25× Yazma, 0.10× Okuma — Matematiği Maksimum Tasarrufa Çevirmek

💎 Bu ders mücevher

Eğer kursta bir teknik öğrenip 10 katı geri ödeyeceğini bilseydin, bu olurdu. Anthropic prompt caching, üretim LLM iş yüklerinin %75-90'ında doğrudan uygulanabilir, mühendislik emeği minimum, tasarruf maksimum. Şimdi anatomi.

Matematik — 5 Saniyede Anla#

Sonnet 4.6 input fiyatı: $3.00/M

Cache pricing:

İşlem	Çarpan	Fiyat
Standart input	1×	$3.00/M
Cache write (yeni cache)	1.25×	$3.75/M
Cache read (mevcut cache)	0.10×	$0.30/M ⭐

Break-even#

Cache write yapmak normal'den 25% pahalı. Ne zaman geri öder?

Cache hit cost  = Write_cost + N_reads × Read_cost
Standart cost   = (N_reads + 1) × Standard_cost

Eşitliği çözünce: N_reads = 0.22 ≈ 1

Yani: 2. istekten itibaren kâr.

3 istekte zaten %50+ tasarruf. 10 istekte %85. 100 istekte %89.

Nasıl Çalışır?#

Anthropic, prompt'un belirli bir kısmının hash'ini cache key olarak kullanır. Aynı prompt prefix'ini bir daha gönderirsen, modelin internal state'ini cache'den restore eder ve sadece sonraki dinamik kısımları işler.

İlk istek:
  [Static 4K]  [Dynamic 200]  →  Model işler hepsini
  Cache yazıldı: hash(Static 4K)
  Fatura: 4K × 1.25× + 200 × 1× = 5200 token equivalent

İkinci istek (5dk içinde):
  [Static 4K]  [Dynamic Different 200]  →  hash hit!
  Sadece dynamic kısmı yeniden işle
  Fatura: 4K × 0.10× + 200 × 1× = 600 token equivalent

İkinci istek %88 daha ucuza geldi.

Ephemeral (5dk) vs 1h TTL#

Anthropic iki TTL seçeneği sunar:

Özellik	Ephemeral (5dk)	1 Hour
Cache write fiyatı	1.25×	2×
Cache read fiyatı	0.10×	0.10×
TTL	5 dakika	60 dakika
Maksimum cache size	1MB	1MB
Min cacheable	1024 token (Sonnet/Opus), 2048 token (Haiku)	aynı

Hangisini ne zaman?#

Yüksek-trafik scenario (50+ istek/dakika aynı prefix):
  → Ephemeral 5dk
  → Her 5dk'da 1 yazma + ~250 okuma
  → Yazma maliyeti her seferinde 1.25× (ucuz)

Düşük-trafik scenario (saatte 20-50 istek):
  → 1h TTL
  → Saatte 1 yazma + 30-50 okuma
  → 2× yazma ama 12 kat seyrek = net kâr

Karar formülü#

Eğer istek/saat ≥ 100  →  Ephemeral (5dk)
Eğer 30 ≤ istek/saat < 100  →  Test et, marjda
Eğer istek/saat < 30  →  1h TTL daha ekonomik

4 Breakpoint — Anthropic'in Eşsiz Mekanizması#

Anthropic prompt'a 4 cache breakpoint koymana izin veriyor. Her breakpoint kendi cache key'i oluşturuyor.

Niye 4 breakpoint?#

Senaryo: prompt 3 katmandan oluşuyor —

System (genel rol)
Static knowledge (FAQ, kurallar) — günde 1 kez güncellenir
Tenant config (per-customer) — saatlik
Dynamic user query

Eğer hepsini tek breakpoint'le cache'lersen, tenant config değiştiğinde tüm cache invalid.

Multi-breakpoint ile her katman ayrı cache:

messages = [
    {
        "role": "system",
        "content": [
            # Breakpoint 1: genel sistem — yüzlerce gün değişmez
            {"type": "text", "text": GENERAL_SYSTEM_PROMPT,
             "cache_control": {"type": "ephemeral"}},

            # Breakpoint 2: FAQ + kurallar — günde 1 kez güncellenir
            {"type": "text", "text": FAQ_AND_RULES,
             "cache_control": {"type": "ephemeral"}},

            # Breakpoint 3: tenant config — saatlik
            {"type": "text", "text": tenant_config,
             "cache_control": {"type": "ephemeral"}},
        ],
    },
    # User mesaj — dinamik, cache değil
    {"role": "user", "content": user_query},
]

Avantaj#

Tenant config değişince sadece breakpoint 3 invalid
Breakpoint 1 ve 2 hâlâ cache hit yapıyor

Gerçek RAG Örneği — Kompleks Cache Pattern#

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """Sen müşteri hizmetleri asistanısın..."""  # 2K token
FAQ = load_faq()  # 4K token
PRODUCT_CATALOG = load_catalog()  # 6K token

def rag_answer(user_query: str, retrieved_chunks: list[str]):
    """Cache-optimized RAG cevabı."""

    retrieved_context = "\n\n".join(retrieved_chunks)  # değişken

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system=[
            {"type": "text", "text": SYSTEM_PROMPT,
             "cache_control": {"type": "ephemeral"}},  # ← cache layer 1

            {"type": "text", "text": FAQ + "\n\n" + PRODUCT_CATALOG,
             "cache_control": {"type": "ephemeral"}},  # ← cache layer 2
        ],
        messages=[
            {
                "role": "user",
                "content": f"BAĞLAM:\n{retrieved_context}\n\nSORU: {user_query}"
            }
        ],
    )
    return response

# 100 isteklik test
for query, chunks in test_queries:
    response = rag_answer(query, chunks)
    usage = response.usage
    print(f"input: {usage.input_tokens}")
    print(f"cache_read: {usage.cache_read_input_tokens}")
    print(f"cache_create: {usage.cache_creation_input_tokens}")

Cache Metrics İzleme#

Her isteğin sonrası şu metric'leri logla:

def log_cache_metrics(response, request_id):
    usage = response.usage
    total_input = (
        usage.input_tokens
        + (usage.cache_read_input_tokens or 0)
        + (usage.cache_creation_input_tokens or 0)
    )
    cache_hit_pct = (
        (usage.cache_read_input_tokens or 0) / total_input * 100
        if total_input > 0 else 0
    )

    log_metric("cache_hit_pct", cache_hit_pct)
    log_metric("cache_read_tokens", usage.cache_read_input_tokens or 0)
    log_metric("cache_creation_tokens", usage.cache_creation_input_tokens or 0)

    # Cost calc
    cost_input = usage.input_tokens / 1_000_000 * 3.00
    cost_cache_read = (usage.cache_read_input_tokens or 0) / 1_000_000 * 0.30
    cost_cache_write = (usage.cache_creation_input_tokens or 0) / 1_000_000 * 3.75
    cost_output = usage.output_tokens / 1_000_000 * 15.00
    total_cost = cost_input + cost_cache_read + cost_cache_write + cost_output

    log_metric("cost_per_request", total_cost)

    # What would it have cost without cache?
    counterfactual_input = total_input
    counterfactual_cost = (
        counterfactual_input / 1_000_000 * 3.00
        + usage.output_tokens / 1_000_000 * 15.00
    )
    savings = counterfactual_cost - total_cost
    log_metric("cache_savings_per_request", savings)

Dashboard hedefleri#

Cache hit % > 70: iyi
Cache hit % > 85: harika
Savings/request > $0.005: ekonomik anlamlı
Daily savings > $10: kursun amacına ulaştın

Vaka — %88 Tasarruf Hikayesi#

Bir Türkçe e-ticaret asistan, aylık 500K istek:

Before (cache yok)#

System: 2K token
FAQ + Catalog: 10K token
Retrieved chunks: 3K token
User query: 200 token
Total input/istek: 15.2K

Output: 400 token

Aylık maliyet:
  Input:  500K × 15.2K × $3/M  = $22,800
  Output: 500K × 400 × $15/M   = $3,000
  TOPLAM: $25,800

After (2-layer cache, ephemeral)#

İlk istek (cache write):
  System write:  2K × $3.75/M  = $0.0075
  Catalog write: 10K × $3.75/M = $0.0375
  Retrieved:     3K × $3/M     = $0.009
  Query:         200 × $3/M    = $0.0006
  Output:        400 × $15/M   = $0.006
  TOTAL:         $0.0606

Sonraki 99 istek (cache read):
  System read:   2K × $0.30/M  = $0.0006
  Catalog read:  10K × $0.30/M = $0.003
  Retrieved:     3K × $3/M     = $0.009  (her seferinde farklı)
  Query:         200 × $3/M    = $0.0006
  Output:        400 × $15/M   = $0.006
  TOTAL/req:     $0.0192
  100 istek:     $0.0606 + 99 × $0.0192 = $1.96

Aylık (her 5 dakikada cache yenilenir, ~8.640 yenileme/ay):
  Cache write maliyeti: 8.640 × $0.0455 = $393
  Cache read maliyeti: ...
  ~$3,000 toplam input cache pattern'i
  +  Output: $3,000 (aynı)
  +  Retrieved chunks dinamik: ~$4,500
  ≈ $10,500/ay

Tasarruf: $25,800 → $10,500 = %59 ✅

Eğer retrieved chunks'ı da Modül 6'nın selection'ı ile sıkıştırırsan, %75-85'e çıkar.

🧪 Lab 7 — Hazır olunca yapacaksın

Lab 7: Bir RAG chatbot'una prompt caching ekle. Hedef: %75 maliyet düşüşü, kalite parity. Bu derstekikod örneklerini başlangıç olarak al, kendi RAG mimarinde uygula. Cache hit % > 80 ve $/request < %50 baseline = pass.

▶️ Sıradaki ders

7.2 — OpenAI Automatic Cached Input. Anthropic manuel breakpoint kontrol verirken, OpenAI otomatik cache yapıyor. Avantaj: kod değişikliği yok. Dezavantaj: optimization kısıtlı. Pattern'leri görelim.

Frequently Asked Questions