Cache Monitoring: Dashboard ve Alerting

Production cache stack'inde ne ölçersin? Cache hit rate, p50/p95 latency, cost per request, regression detection. Grafana dashboard JSON example.

Şükrü Yusuf KAYA

12 dakikalık okuma

27.06.2026

Orta

Production Cache Monitoring#

Cache'i deploy ettin, çalışıyor. Ama nasıl emin oluyorsun? Monitoring olmadan caching'in regression yapması fark edilmeyebilir.

Gözlenmesi gereken 5 metrik var.

5 Kritik Metrik#

Metrik	Hedef	Alert eşiği
Cache hit rate	%90+	<%85 alarm
Cache write rate	Düşük	Spike olunca regression
Avg cost per request	Trend takip	%20+ artış olunca
p50 latency	<2s	<%50 artış
p95 latency	<5s	<%30 artış

Prometheus Metric Export#

python

from prometheus_client import Counter, Histogram, Gauge, start_http_server
 
# Metrikler
cache_hits = Counter('llm_cache_hits_total', 'Cache read tokens')
cache_writes = Counter('llm_cache_writes_total', 'Cache write tokens')
input_tokens = Counter('llm_input_tokens_total', 'Fresh input tokens')
output_tokens = Counter('llm_output_tokens_total', 'Output tokens')
request_latency = Histogram('llm_request_latency_seconds', 'Request latency', buckets=(0.1, 0.5, 1, 2, 5, 10))
request_cost = Counter('llm_request_cost_usd', 'Cost in USD')
 
def instrument_call(start_time, usage):
    """Her LLM call'undan sonra çağır."""
    duration = time.time() - start_time
    request_latency.observe(duration)
 
    cw = usage.cache_creation_input_tokens or 0
    cr = usage.cache_read_input_tokens or 0
 
    cache_writes.inc(cw)
    cache_hits.inc(cr)
    input_tokens.inc(usage.input_tokens)
    output_tokens.inc(usage.output_tokens)
 
    # Cost
    cost = (
        usage.input_tokens / 1e6 * 3.0
        + cw / 1e6 * 3.75
        + cr / 1e6 * 0.30
        + usage.output_tokens / 1e6 * 15.0
    )
    request_cost.inc(cost)
 
# Server başlat
start_http_server(9100)  # /metrics endpoint

Prometheus metric export

Grafana Dashboard#

Dashboard'un 4 ana paneli olmalı:

Query:

sum(rate(llm_cache_hits_total[5m]))
/
(sum(rate(llm_cache_hits_total[5m])) + sum(rate(llm_cache_writes_total[5m])))

Görüntü: Time series. Y axis %0-100. Alert: <85% → Slack notification.

Alerting Best Practices#

yaml

# Prometheus alert rules
groups:
  - name: llm_cache
    rules:
      - alert: LowCacheHitRate
        expr: |
          sum(rate(llm_cache_hits_total[10m]))
          /
          (sum(rate(llm_cache_hits_total[10m])) + sum(rate(llm_cache_writes_total[10m])))
          < 0.85
        for: 5m
        annotations:
          summary: "Cache hit rate < 85% for 5 minutes"
 
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, rate(llm_request_latency_seconds_bucket[5m])) > 5
        for: 5m
        annotations:
          summary: "p95 latency > 5s"
 
      - alert: CostSpike
        expr: |
          sum(increase(llm_request_cost_usd[1h]))
          > 2 * avg_over_time(sum(increase(llm_request_cost_usd[1h]))[24h])
        annotations:
          summary: "Hourly cost 2× baseline"

Prometheus alerting rules

Cost Surveillance

Cost spike alert kritik. Bir bug ile cache miss oluyorsa farkı saatlerce sonra fark edersin. Real-time alarm pahalı bir fark.

✓ Pekiştir#

Bir Sonraki Derste#

Production observability stack lab.

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

İlgili İçerikler

1. Temeller — Context Penceresi Ekonomisi