Cache Monitoring: Dashboard ve Alerting
Production cache stack'inde ne ölçersin? Cache hit rate, p50/p95 latency, cost per request, regression detection. Grafana dashboard JSON example.
Şükrü Yusuf KAYA
12 min read
IntermediateProduction Cache Monitoring
Cache'i deploy ettin, çalışıyor. Ama nasıl emin oluyorsun? Monitoring olmadan caching'in regression yapması fark edilmeyebilir.
Gözlenmesi gereken 5 metrik var.
5 Kritik Metrik#
| Metrik | Hedef | Alert eşiği |
|---|---|---|
| Cache hit rate | %90+ | <%85 alarm |
| Cache write rate | Düşük | Spike olunca regression |
| Avg cost per request | Trend takip | %20+ artış olunca |
| p50 latency | <2s | <%50 artış |
| p95 latency | <5s | <%30 artış |
Prometheus Metric Export#
python
from prometheus_client import Counter, Histogram, Gauge, start_http_server # Metriklercache_hits = Counter('llm_cache_hits_total', 'Cache read tokens')cache_writes = Counter('llm_cache_writes_total', 'Cache write tokens')input_tokens = Counter('llm_input_tokens_total', 'Fresh input tokens')output_tokens = Counter('llm_output_tokens_total', 'Output tokens')request_latency = Histogram('llm_request_latency_seconds', 'Request latency', buckets=(0.1, 0.5, 1, 2, 5, 10))request_cost = Counter('llm_request_cost_usd', 'Cost in USD') def instrument_call(start_time, usage): """Her LLM call'undan sonra çağır.""" duration = time.time() - start_time request_latency.observe(duration) cw = usage.cache_creation_input_tokens or 0 cr = usage.cache_read_input_tokens or 0 cache_writes.inc(cw) cache_hits.inc(cr) input_tokens.inc(usage.input_tokens) output_tokens.inc(usage.output_tokens) # Cost cost = ( usage.input_tokens / 1e6 * 3.0 + cw / 1e6 * 3.75 + cr / 1e6 * 0.30 + usage.output_tokens / 1e6 * 15.0 ) request_cost.inc(cost) # Server başlatstart_http_server(9100) # /metrics endpointPrometheus metric export
Grafana Dashboard#
Dashboard'un 4 ana paneli olmalı:
Query:
sum(rate(llm_cache_hits_total[5m])) / (sum(rate(llm_cache_hits_total[5m])) + sum(rate(llm_cache_writes_total[5m])))
Görüntü: Time series. Y axis %0-100.
Alert: <85% → Slack notification.
Alerting Best Practices#
yaml
# Prometheus alert rulesgroups: - name: llm_cache rules: - alert: LowCacheHitRate expr: | sum(rate(llm_cache_hits_total[10m])) / (sum(rate(llm_cache_hits_total[10m])) + sum(rate(llm_cache_writes_total[10m]))) < 0.85 for: 5m annotations: summary: "Cache hit rate < 85% for 5 minutes" - alert: HighRequestLatency expr: histogram_quantile(0.95, rate(llm_request_latency_seconds_bucket[5m])) > 5 for: 5m annotations: summary: "p95 latency > 5s" - alert: CostSpike expr: | sum(increase(llm_request_cost_usd[1h])) > 2 * avg_over_time(sum(increase(llm_request_cost_usd[1h]))[24h]) annotations: summary: "Hourly cost 2× baseline"Prometheus alerting rules
Cost Surveillance
Cost spike alert kritik. Bir bug ile cache miss oluyorsa farkı saatlerce sonra fark edersin. Real-time alarm pahalı bir fark.
✓ Pekiştir#
Bir Sonraki Derste#
Production observability stack lab.
Yorumlar & Soru-Cevap
(0)Yorum yazmak için giriş yap.
Yorumlar yükleniyor...
Related Content
1. Temeller — Context Penceresi Ekonomisi
Bu Eğitim Hakkında ve Prompt Caching Neden Önemli?
Start Learning1. Temeller — Context Penceresi Ekonomisi
Token Ekonomisi 101: Input vs Output Cost Asimetrisi
Start Learning1. Temeller — Context Penceresi Ekonomisi