Cache Monitoring: Dashboard ve Alerting
Production cache stack'inde ne ölçersin? Cache hit rate, p50/p95 latency, cost per request, regression detection. Grafana dashboard JSON example.
Şükrü Yusuf KAYA
12 min read
IntermediateProduction Cache Monitoring#
Cache'i deploy ettin, çalışıyor. Ama nasıl emin oluyorsun? Monitoring olmadan caching'in regression yapması fark edilmeyebilir.
Gözlenmesi gereken 5 metrik var.
5 Kritik Metrik#
| Metrik | Hedef | Alert eşiği |
|---|---|---|
| Cache hit rate | %90+ | <%85 alarm |
| Cache write rate | Düşük | Spike olunca regression |
| Avg cost per request | Trend takip | %20+ artış olunca |
| p50 latency | <2s | <%50 artış |
| p95 latency | <5s | <%30 artış |
Prometheus Metric Export#
python
from prometheus_client import Counter, Histogram, Gauge, start_http_server # Metriklercache_hits = Counter('llm_cache_hits_total', 'Cache read tokens')cache_writes = Counter('llm_cache_writes_total', 'Cache write tokens')input_tokens = Counter('llm_input_tokens_total', 'Fresh input tokens')output_tokens = Counter('llm_output_tokens_total', 'Output tokens')request_latency = Histogram('llm_request_latency_seconds', 'Request latency', buckets=(0.1, 0.5, 1, 2, 5, 10))request_cost = Counter('llm_request_cost_usd', 'Cost in USD') def instrument_call(start_time, usage): """Her LLM call'undan sonra çağır.""" duration = time.time() - start_time request_latency.observe(duration) cw = usage.cache_creation_input_tokens or 0 cr = usage.cache_read_input_tokens or 0 cache_writes.inc(cw) cache_hits.inc(cr) input_tokens.inc(usage.input_tokens) output_tokens.inc(usage.output_tokens) # Cost cost = ( usage.input_tokens / 1e6 * 3.0 + cw / 1e6 * 3.75 + cr / 1e6 * 0.30 + usage.output_tokens / 1e6 * 15.0 ) request_cost.inc(cost) # Server başlatstart_http_server(9100) # /metrics endpointPrometheus metric export
Grafana Dashboard#
Dashboard'un 4 ana paneli olmalı:
Query:
sum(rate(llm_cache_hits_total[5m])) / (sum(rate(llm_cache_hits_total[5m])) + sum(rate(llm_cache_writes_total[5m])))
Görüntü: Time series. Y axis %0-100.
Alert: <85% → Slack notification.
Alerting Best Practices#
yaml
# Prometheus alert rulesgroups: - name: llm_cache rules: - alert: LowCacheHitRate expr: | sum(rate(llm_cache_hits_total[10m])) / (sum(rate(llm_cache_hits_total[10m])) + sum(rate(llm_cache_writes_total[10m]))) < 0.85 for: 5m annotations: summary: "Cache hit rate < 85% for 5 minutes" - alert: HighRequestLatency expr: histogram_quantile(0.95, rate(llm_request_latency_seconds_bucket[5m])) > 5 for: 5m annotations: summary: "p95 latency > 5s" - alert: CostSpike expr: | sum(increase(llm_request_cost_usd[1h])) > 2 * avg_over_time(sum(increase(llm_request_cost_usd[1h]))[24h]) annotations: summary: "Hourly cost 2× baseline"Prometheus alerting rules
Cost Surveillance
Cost spike alert kritik. Bir bug ile cache miss oluyorsa farkı saatlerce sonra fark edersin. Real-time alarm pahalı bir fark.
✓ Pekiştir#
Bir Sonraki Derste#
Production observability stack lab.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
1. Temeller — Context Penceresi Ekonomisi
Bu Eğitim Hakkında ve Prompt Caching Neden Önemli?
Start Learning1. Temeller — Context Penceresi Ekonomisi
Token Ekonomisi 101: Input vs Output Cost Asimetrisi
Start Learning1. Temeller — Context Penceresi Ekonomisi