Skip to content

Cache Monitoring: Dashboard ve Alerting

Production cache stack'inde ne ölçersin? Cache hit rate, p50/p95 latency, cost per request, regression detection. Grafana dashboard JSON example.

Şükrü Yusuf KAYA
12 min read
Intermediate

Production Cache Monitoring

Cache'i deploy ettin, çalışıyor. Ama nasıl emin oluyorsun? Monitoring olmadan caching'in regression yapması fark edilmeyebilir.
Gözlenmesi gereken 5 metrik var.

5 Kritik Metrik#

MetrikHedefAlert eşiği
Cache hit rate%90+<%85 alarm
Cache write rateDüşükSpike olunca regression
Avg cost per requestTrend takip%20+ artış olunca
p50 latency<2s<%50 artış
p95 latency<5s<%30 artış

Prometheus Metric Export#

python
from prometheus_client import Counter, Histogram, Gauge, start_http_server
 
# Metrikler
cache_hits = Counter('llm_cache_hits_total', 'Cache read tokens')
cache_writes = Counter('llm_cache_writes_total', 'Cache write tokens')
input_tokens = Counter('llm_input_tokens_total', 'Fresh input tokens')
output_tokens = Counter('llm_output_tokens_total', 'Output tokens')
request_latency = Histogram('llm_request_latency_seconds', 'Request latency', buckets=(0.1, 0.5, 1, 2, 5, 10))
request_cost = Counter('llm_request_cost_usd', 'Cost in USD')
 
def instrument_call(start_time, usage):
"""Her LLM call'undan sonra çağır."""
duration = time.time() - start_time
request_latency.observe(duration)
 
cw = usage.cache_creation_input_tokens or 0
cr = usage.cache_read_input_tokens or 0
 
cache_writes.inc(cw)
cache_hits.inc(cr)
input_tokens.inc(usage.input_tokens)
output_tokens.inc(usage.output_tokens)
 
# Cost
cost = (
usage.input_tokens / 1e6 * 3.0
+ cw / 1e6 * 3.75
+ cr / 1e6 * 0.30
+ usage.output_tokens / 1e6 * 15.0
)
request_cost.inc(cost)
 
# Server başlat
start_http_server(9100) # /metrics endpoint
Prometheus metric export

Grafana Dashboard#

Dashboard'un 4 ana paneli olmalı:
Query:
sum(rate(llm_cache_hits_total[5m])) / (sum(rate(llm_cache_hits_total[5m])) + sum(rate(llm_cache_writes_total[5m])))
Görüntü: Time series. Y axis %0-100. Alert: <85% → Slack notification.

Alerting Best Practices#

yaml
# Prometheus alert rules
groups:
- name: llm_cache
rules:
- alert: LowCacheHitRate
expr: |
sum(rate(llm_cache_hits_total[10m]))
/
(sum(rate(llm_cache_hits_total[10m])) + sum(rate(llm_cache_writes_total[10m])))
< 0.85
for: 5m
annotations:
summary: "Cache hit rate < 85% for 5 minutes"
 
- alert: HighRequestLatency
expr: histogram_quantile(0.95, rate(llm_request_latency_seconds_bucket[5m])) > 5
for: 5m
annotations:
summary: "p95 latency > 5s"
 
- alert: CostSpike
expr: |
sum(increase(llm_request_cost_usd[1h]))
> 2 * avg_over_time(sum(increase(llm_request_cost_usd[1h]))[24h])
annotations:
summary: "Hourly cost 2× baseline"
Prometheus alerting rules
Cost Surveillance
Cost spike alert kritik. Bir bug ile cache miss oluyorsa farkı saatlerce sonra fark edersin. Real-time alarm pahalı bir fark.

✓ Pekiştir#

Bir Sonraki Derste#

Production observability stack lab.

Yorumlar & Soru-Cevap

(0)
Yorum yazmak için giriş yap.
Yorumlar yükleniyor...

Related Content