İçeriğe geç

Cache Monitoring: Dashboard ve Alerting

Production cache stack'inde ne ölçersin? Cache hit rate, p50/p95 latency, cost per request, regression detection. Grafana dashboard JSON example.

Şükrü Yusuf KAYA
12 dakikalık okuma
Orta

Production Cache Monitoring#

Cache'i deploy ettin, çalışıyor. Ama nasıl emin oluyorsun? Monitoring olmadan caching'in regression yapması fark edilmeyebilir.
Gözlenmesi gereken 5 metrik var.

5 Kritik Metrik#

MetrikHedefAlert eşiği
Cache hit rate%90+<%85 alarm
Cache write rateDüşükSpike olunca regression
Avg cost per requestTrend takip%20+ artış olunca
p50 latency<2s<%50 artış
p95 latency<5s<%30 artış

Prometheus Metric Export#

python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
 
# Metrikler
cache_hits = Counter('llm_cache_hits_total', 'Cache read tokens')
cache_writes = Counter('llm_cache_writes_total', 'Cache write tokens')
input_tokens = Counter('llm_input_tokens_total', 'Fresh input tokens')
output_tokens = Counter('llm_output_tokens_total', 'Output tokens')
request_latency = Histogram('llm_request_latency_seconds', 'Request latency', buckets=(0.1, 0.5, 1, 2, 5, 10))
request_cost = Counter('llm_request_cost_usd', 'Cost in USD')
 
def instrument_call(start_time, usage):
"""Her LLM call'undan sonra çağır."""
duration = time.time() - start_time
request_latency.observe(duration)
 
cw = usage.cache_creation_input_tokens or 0
cr = usage.cache_read_input_tokens or 0
 
cache_writes.inc(cw)
cache_hits.inc(cr)
input_tokens.inc(usage.input_tokens)
output_tokens.inc(usage.output_tokens)
 
# Cost
cost = (
usage.input_tokens / 1e6 * 3.0
+ cw / 1e6 * 3.75
+ cr / 1e6 * 0.30
+ usage.output_tokens / 1e6 * 15.0
)
request_cost.inc(cost)
 
# Server başlat
start_http_server(9100) # /metrics endpoint
Prometheus metric export

Grafana Dashboard#

Dashboard'un 4 ana paneli olmalı:
Query:
sum(rate(llm_cache_hits_total[5m])) / (sum(rate(llm_cache_hits_total[5m])) + sum(rate(llm_cache_writes_total[5m])))
Görüntü: Time series. Y axis %0-100. Alert: <85% → Slack notification.

Alerting Best Practices#

yaml
# Prometheus alert rules
groups:
- name: llm_cache
rules:
- alert: LowCacheHitRate
expr: |
sum(rate(llm_cache_hits_total[10m]))
/
(sum(rate(llm_cache_hits_total[10m])) + sum(rate(llm_cache_writes_total[10m])))
< 0.85
for: 5m
annotations:
summary: "Cache hit rate < 85% for 5 minutes"
 
- alert: HighRequestLatency
expr: histogram_quantile(0.95, rate(llm_request_latency_seconds_bucket[5m])) > 5
for: 5m
annotations:
summary: "p95 latency > 5s"
 
- alert: CostSpike
expr: |
sum(increase(llm_request_cost_usd[1h]))
> 2 * avg_over_time(sum(increase(llm_request_cost_usd[1h]))[24h])
annotations:
summary: "Hourly cost 2× baseline"
Prometheus alerting rules
Cost Surveillance
Cost spike alert kritik. Bir bug ile cache miss oluyorsa farkı saatlerce sonra fark edersin. Real-time alarm pahalı bir fark.

✓ Pekiştir#

Bir Sonraki Derste#

Production observability stack lab.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

İlgili İçerikler