Cache hash'i tam olarak neyi içeriyor?

Prompt prefix'inin tüm token'ları + tools listesi + system message + cache_control marker'ları. Whitespace dahil. Bir karakter değişikliği = cache miss. Production'da prompt build'ini deterministic yap (alfabetik sıralı tool list, vs.).

Anthropic Prompt Caching Derinlemesine: 1.25× Yazma, 0.10× Okuma — Matematiği Maksimum Tasarrufa Çevirmek

Q: Cache hit %95+ alabilir miyim?

Teorik olarak evet, pratik %85-92 sweet spot. %95+ için cache write'ları çok agresif minimize etmen lazım — bu da TTL'i 1h yapmak demek, ama trafik yüksekse 5dk'da cache yenilenmesi gerekiyor. Trade-off. Modül 7.5'te hit-rate optimization detaylı.

Anthropic'in caching matematiği basit gibi: write 1.25×, read 0.10×. Ama production'da %90 tasarruf almak için breakpoint sayısı, TTL seçimi, çoklu cache layering ve refresh stratejilerini bilmen gerek. Bu derste tam usta düzeyi.

Şükrü Yusuf KAYA

22 dakikalık okuma

14.05.2026

İleri

Anthropic Prompt Caching Derinlemesine: 1.25× Yazma, 0.10× Okuma — Matematiği Maksimum Tasarrufa Çevirmek

💎 Bu ders mücevher

Eğer kursta bir teknik öğrenip 10 katı geri ödeyeceğini bilseydin, bu olurdu. Anthropic prompt caching, üretim LLM iş yüklerinin %75-90'ında doğrudan uygulanabilir, mühendislik emeği minimum, tasarruf maksimum. Şimdi anatomi.

Matematik — 5 Saniyede Anla#

Sonnet 4.6 input fiyatı: $3.00/M

Cache pricing:

İşlem	Çarpan	Fiyat
Standart input	1×	$3.00/M
Cache write (yeni cache)	1.25×	$3.75/M
Cache read (mevcut cache)	0.10×	$0.30/M ⭐

Break-even#

Cache write yapmak normal'den 25% pahalı. Ne zaman geri öder?

Cache hit cost  = Write_cost + N_reads × Read_cost
Standart cost   = (N_reads + 1) × Standard_cost

Eşitliği çözünce: N_reads = 0.22 ≈ 1

Yani: 2. istekten itibaren kâr.

3 istekte zaten %50+ tasarruf. 10 istekte %85. 100 istekte %89.

Nasıl Çalışır?#

Anthropic, prompt'un belirli bir kısmının hash'ini cache key olarak kullanır. Aynı prompt prefix'ini bir daha gönderirsen, modelin internal state'ini cache'den restore eder ve sadece sonraki dinamik kısımları işler.

İlk istek:
  [Static 4K]  [Dynamic 200]  →  Model işler hepsini
  Cache yazıldı: hash(Static 4K)
  Fatura: 4K × 1.25× + 200 × 1× = 5200 token equivalent

İkinci istek (5dk içinde):
  [Static 4K]  [Dynamic Different 200]  →  hash hit!
  Sadece dynamic kısmı yeniden işle
  Fatura: 4K × 0.10× + 200 × 1× = 600 token equivalent

İkinci istek %88 daha ucuza geldi.

Ephemeral (5dk) vs 1h TTL#

Anthropic iki TTL seçeneği sunar:

Özellik	Ephemeral (5dk)	1 Hour
Cache write fiyatı	1.25×	2×
Cache read fiyatı	0.10×	0.10×
TTL	5 dakika	60 dakika
Maksimum cache size	1MB	1MB
Min cacheable	1024 token (Sonnet/Opus), 2048 token (Haiku)	aynı

Hangisini ne zaman?#

Yüksek-trafik scenario (50+ istek/dakika aynı prefix):
  → Ephemeral 5dk
  → Her 5dk'da 1 yazma + ~250 okuma
  → Yazma maliyeti her seferinde 1.25× (ucuz)

Düşük-trafik scenario (saatte 20-50 istek):
  → 1h TTL
  → Saatte 1 yazma + 30-50 okuma
  → 2× yazma ama 12 kat seyrek = net kâr

Karar formülü#

Eğer istek/saat ≥ 100  →  Ephemeral (5dk)
Eğer 30 ≤ istek/saat < 100  →  Test et, marjda
Eğer istek/saat < 30  →  1h TTL daha ekonomik

4 Breakpoint — Anthropic'in Eşsiz Mekanizması#

Anthropic prompt'a 4 cache breakpoint koymana izin veriyor. Her breakpoint kendi cache key'i oluşturuyor.

Niye 4 breakpoint?#

Senaryo: prompt 3 katmandan oluşuyor —

System (genel rol)
Static knowledge (FAQ, kurallar) — günde 1 kez güncellenir
Tenant config (per-customer) — saatlik
Dynamic user query

Eğer hepsini tek breakpoint'le cache'lersen, tenant config değiştiğinde tüm cache invalid.

Multi-breakpoint ile her katman ayrı cache:

messages = [
    {
        "role": "system",
        "content": [
            # Breakpoint 1: genel sistem — yüzlerce gün değişmez
            {"type": "text", "text": GENERAL_SYSTEM_PROMPT,
             "cache_control": {"type": "ephemeral"}},

            # Breakpoint 2: FAQ + kurallar — günde 1 kez güncellenir
            {"type": "text", "text": FAQ_AND_RULES,
             "cache_control": {"type": "ephemeral"}},

            # Breakpoint 3: tenant config — saatlik
            {"type": "text", "text": tenant_config,
             "cache_control": {"type": "ephemeral"}},
        ],
    },
    # User mesaj — dinamik, cache değil
    {"role": "user", "content": user_query},
]

Avantaj#

Tenant config değişince sadece breakpoint 3 invalid
Breakpoint 1 ve 2 hâlâ cache hit yapıyor

Gerçek RAG Örneği — Kompleks Cache Pattern#

import anthropic

client = anthropic.Anthropic()

SYSTEM_PROMPT = """Sen müşteri hizmetleri asistanısın..."""  # 2K token
FAQ = load_faq()  # 4K token
PRODUCT_CATALOG = load_catalog()  # 6K token

def rag_answer(user_query: str, retrieved_chunks: list[str]):
    """Cache-optimized RAG cevabı."""

    retrieved_context = "\n\n".join(retrieved_chunks)  # değişken

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=500,
        system=[
            {"type": "text", "text": SYSTEM_PROMPT,
             "cache_control": {"type": "ephemeral"}},  # ← cache layer 1

            {"type": "text", "text": FAQ + "\n\n" + PRODUCT_CATALOG,
             "cache_control": {"type": "ephemeral"}},  # ← cache layer 2
        ],
        messages=[
            {
                "role": "user",
                "content": f"BAĞLAM:\n{retrieved_context}\n\nSORU: {user_query}"
            }
        ],
    )
    return response

# 100 isteklik test
for query, chunks in test_queries:
    response = rag_answer(query, chunks)
    usage = response.usage
    print(f"input: {usage.input_tokens}")
    print(f"cache_read: {usage.cache_read_input_tokens}")
    print(f"cache_create: {usage.cache_creation_input_tokens}")

Cache Metrics İzleme#

Her isteğin sonrası şu metric'leri logla:

def log_cache_metrics(response, request_id):
    usage = response.usage
    total_input = (
        usage.input_tokens
        + (usage.cache_read_input_tokens or 0)
        + (usage.cache_creation_input_tokens or 0)
    )
    cache_hit_pct = (
        (usage.cache_read_input_tokens or 0) / total_input * 100
        if total_input > 0 else 0
    )

    log_metric("cache_hit_pct", cache_hit_pct)
    log_metric("cache_read_tokens", usage.cache_read_input_tokens or 0)
    log_metric("cache_creation_tokens", usage.cache_creation_input_tokens or 0)

    # Cost calc
    cost_input = usage.input_tokens / 1_000_000 * 3.00
    cost_cache_read = (usage.cache_read_input_tokens or 0) / 1_000_000 * 0.30
    cost_cache_write = (usage.cache_creation_input_tokens or 0) / 1_000_000 * 3.75
    cost_output = usage.output_tokens / 1_000_000 * 15.00
    total_cost = cost_input + cost_cache_read + cost_cache_write + cost_output

    log_metric("cost_per_request", total_cost)

    # What would it have cost without cache?
    counterfactual_input = total_input
    counterfactual_cost = (
        counterfactual_input / 1_000_000 * 3.00
        + usage.output_tokens / 1_000_000 * 15.00
    )
    savings = counterfactual_cost - total_cost
    log_metric("cache_savings_per_request", savings)

Dashboard hedefleri#

Cache hit % > 70: iyi
Cache hit % > 85: harika
Savings/request > $0.005: ekonomik anlamlı
Daily savings > $10: kursun amacına ulaştın

Vaka — %88 Tasarruf Hikayesi#

Bir Türkçe e-ticaret asistan, aylık 500K istek:

Before (cache yok)#

System: 2K token
FAQ + Catalog: 10K token
Retrieved chunks: 3K token
User query: 200 token
Total input/istek: 15.2K

Output: 400 token

Aylık maliyet:
  Input:  500K × 15.2K × $3/M  = $22,800
  Output: 500K × 400 × $15/M   = $3,000
  TOPLAM: $25,800

After (2-layer cache, ephemeral)#

İlk istek (cache write):
  System write:  2K × $3.75/M  = $0.0075
  Catalog write: 10K × $3.75/M = $0.0375
  Retrieved:     3K × $3/M     = $0.009
  Query:         200 × $3/M    = $0.0006
  Output:        400 × $15/M   = $0.006
  TOTAL:         $0.0606

Sonraki 99 istek (cache read):
  System read:   2K × $0.30/M  = $0.0006
  Catalog read:  10K × $0.30/M = $0.003
  Retrieved:     3K × $3/M     = $0.009  (her seferinde farklı)
  Query:         200 × $3/M    = $0.0006
  Output:        400 × $15/M   = $0.006
  TOTAL/req:     $0.0192
  100 istek:     $0.0606 + 99 × $0.0192 = $1.96

Aylık (her 5 dakikada cache yenilenir, ~8.640 yenileme/ay):
  Cache write maliyeti: 8.640 × $0.0455 = $393
  Cache read maliyeti: ...
  ~$3,000 toplam input cache pattern'i
  +  Output: $3,000 (aynı)
  +  Retrieved chunks dinamik: ~$4,500
  ≈ $10,500/ay

Tasarruf: $25,800 → $10,500 = %59 ✅

Eğer retrieved chunks'ı da Modül 6'nın selection'ı ile sıkıştırırsan, %75-85'e çıkar.

🧪 Lab 7 — Hazır olunca yapacaksın

Lab 7: Bir RAG chatbot'una prompt caching ekle. Hedef: %75 maliyet düşüşü, kalite parity. Bu derstekikod örneklerini başlangıç olarak al, kendi RAG mimarinde uygula. Cache hit % > 80 ve $/request < %50 baseline = pass.

▶️ Sıradaki ders

7.2 — OpenAI Automatic Cached Input. Anthropic manuel breakpoint kontrol verirken, OpenAI otomatik cache yapıyor. Avantaj: kod değişikliği yok. Dezavantaj: optimization kısıtlı. Pattern'leri görelim.

Sık Sorulan Sorular

Teorik olarak evet, pratik %85-92 sweet spot. %95+ için cache write'ları çok agresif minimize etmen lazım — bu da TTL'i 1h yapmak demek, ama trafik yüksekse 5dk'da cache yenilenmesi gerekiyor. Trade-off. Modül 7.5'te hit-rate optimization detaylı.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

Modül 0: Neden Maliyet, Neden Şimdi?