No on-call rotation, small team — what's minimum to monitor?

Minimum monitoring (single dev / small team): **Sev 1 alerts (call phone)**: 1. All service down (health check fail 5min) 2. Error rate > %10 (15min) 3. p95 latency > 10sec (15min) **Daily check (10 min in morning)**: 1. Conversation count trend (abnormal drop/spike) 2. Error log last 24h 3. GPU utilization trend **Weekly check (Monday 30min)**: 1. Cost analysis (past week) 2. Top error patterns 3. User feedback (if any thumbs up/down) Invest more as you grow. **Over-monitoring is excess cost, distraction**.

Monitoring, Observability and Alerting: Watch Your Production LLM — From Metrics to Action

Monitoring and observability layer of production LLM serving: Prometheus metrics (vLLM native), Grafana dashboard design, OpenTelemetry tracing, log aggregation (Loki/Elastic), alerting rules (Slack/PagerDuty), error tracking with Sentry. Turkish-specific anomalies: hallucination detection, tokenizer errors, prompt injection alert. An LLM engineer's 'what to monitor' guide.

Şükrü Yusuf KAYA

75 min read

5/13/2026

Advanced

Monitoring, Observability ve Alerting: Production LLM'inizi Gözleyin — Metrikten Eyleme

👁️ Production LLM — Ne Görmüyorsan Yönetemezsin

Production LLM kurmak teknik bir başarı. Onu yönetmek ayrı bir disiplin. Modeliniz şu an üretimde — kullanıcılar neye soruyor? Modeliniz ne kadar hızlı yanıtlıyor? Yanıtlar doğru mu? Hallucination var mı? GPU dolu mu? Çökme öncesi uyarı geliyor mu? Bu sorulara cevap veremezsen, bir gün gerçekten kötü bir şey olduğunda bilgisiz kalırsın. Bu ders production'daki Türkçe LLM'i gerçekten görmenin mühendisliğini öğretiyor. Prometheus metrikleri vLLM'den 'ücretsiz' geliyor. Grafana ile dashboard kuruyorsun. Türkçe-spesifik anomalileri (hallucination, tokenizer hatası, prompt injection) tespit eden kurallar yazıyorsun. Slack'ten 3:00'te uyarı geldiğinde doğru yere bakmayı öğreniyorsun. 75 dakika sonra: LLM mühendisi değil, LLM operatörü olmaya başlayacaksın.

Bu Derste Neler Var? (12 Bölüm)#

Niye monitoring — production'ın gözleri
3 katman: metrics, logs, traces
Prometheus + vLLM native metrikleri
Grafana dashboard tasarımı — kuralları
OpenTelemetry tracing — request boyunca yolculuk
Log aggregation — Loki vs Elastic
Türkçe-spesifik metrikler — neyi ölçmek lazım?
Hallucination tespit — quality monitoring
Alert kuralları — sessiz olmayanlar
Sentry ile error tracking
On-call playbook — 3:00'te ne yapacaksın?
Egzersizler

1-6. Monitoring Stack#

1.1 3 Katman: Metrics, Logs, Traces#

Metrics: sayısal, zamana göre değişen (CPU %, request/sn, latency). Prometheus + Grafana.

Logs: olay açıklamaları (request body, error stack trace). Loki veya Elasticsearch.

Traces: bir request'in tüm yolculuğu (frontend → API → vLLM → DB → response). OpenTelemetry + Jaeger/Tempo.

Üçü birlikte: 'bir kullanıcı şikayet etti, 14:32'de yavaş cevap aldı' →

Metrics'ten p95 latency artışını gör
Logs'tan error pattern'leri çıkart
Traces'tan o spesifik request'in nerede yavaşladığını görür

3.1 vLLM Prometheus metrikleri#

vLLM

/metrics

endpoint'i otomatik açar. Önemli metrikler:

Latency:

vllm:time_to_first_token_seconds
— TTFB
vllm:time_per_output_token_seconds
— TPOT
vllm:e2e_request_latency_seconds
— end-to-end

Throughput:

vllm:prompt_tokens_total
vllm:generation_tokens_total
vllm:num_requests_running
vllm:num_requests_waiting

Resource:

vllm:gpu_cache_usage_perc
— KV cache utilization
vllm:cpu_cache_usage_perc
— CPU cache (swap)

4.1 Grafana Dashboard yapısı#

İdeal dashboard 4 satırda 4-6 panel:

Satır 1 — Top-line Health:

Active users (concurrent)
Total requests / sn
Error rate %
p95 latency

Satır 2 — Latency Detail:

TTFB (p50/p95/p99)
TPOT (token üretim hızı)
E2E request time

Satır 3 — Resource:

GPU utilization %
KV cache utilization %
GPU memory usage
Queue depth

Satır 4 — Business Metrics:

Conversation count
User satisfaction (thumbs up/down ratio)
Token cost / hour
Türkçe quality score (örn. MT-Bench-TR moving average)

7-9. Türkçe Anomali Tespit + Alerting#

7.1 Türkçe-spesifik metrikler#

Tokenizer hataları:

token_count
>>
word_count × 2
→ tokenization patolojik (Türkçe için fertility 1.5-2.5 normal)
Eğer >3 → muhtemelen tokenizer bug veya non-Turkish input

Hallucination göstergeleri:

Yanıt çok kısa (
<10 token
) → 'cevap veremiyorum' refusal
Yanıt çok uzun (
>2000 token
) → muhtemelen tekrarlama / döngü
'Üzgünüm' / 'bilmiyorum' yüzdesi yüksek → confidence düşük

Prompt injection alarmı:

Kullanıcı input'unda 'ignore previous instructions', 'sen artık', 'system prompt' gibi anahtar kelimeler
Türkçe varyantlar: 'önceki talimatları unut', 'sen şimdi', 'sistem mesajı'

8.1 Hallucination tespit (lightweight)#

Production'da gerçek zamanlı LLM judge pahalı. Heuristic yaklaşım:

Confidence proxy: token-level log-probability ortalaması. Düşük (-2.0 ortalama) → düşük confidence → hallucination olasılığı yüksek.

Fact density: yanıtta sayısal/tarihsel bilgi varsa, retrieval-augmented setup'ta source ile karşılaştır. Eşleşmiyorsa flag.

Repetition check: aynı n-gram (4-gram) 3+ tekrar → döngü, hallucination indicator.

9.1 Alert kuralları#

İyi alert kuralları eyleme geçirilebilir. Kötü alert (sessizleşen, ignore edilen) zararlıdır.

Kritik (Severity 1 — sayfa):

p95 latency > 5sn (10dk için)
Error rate > %5 (5dk için)
GPU memory > %95 (5dk için)
Tüm replicas down

Yüksek (Severity 2 — Slack):

p95 latency > 3sn (15dk)
Error rate > %1 (15dk)
Single replica down
Token cost spike (%50 artış)

Bilgi (Severity 3 — log):

Latency drift (saat-saat değişim)
Quality score düşüş (günlük rapor)

Anti-pattern: 'CPU > %80' gibi alarm. Çünkü her zaman trigger eder, kimse bakmaz.

✅ Ders 16.4 Özeti — Monitoring & Observability

Production LLM 3 katmanda izlenir: Metrics (Prometheus, Grafana), Logs (Loki/Elastic), Traces (OpenTelemetry, Jaeger). vLLM /metrics endpoint native, ek çalışma yok. Grafana dashboard 4 satır: top-line health, latency detail, resource, business. Türkçe-spesifik anomaliler: tokenizer fertility >3 patolojik, refusal % yüksek, prompt injection anahtar kelimeleri. Alert kuralları eyleme geçirilebilir: p95 >5sn (sev1), error >%1 (sev2). Anti-pattern: 'CPU > %80' sürekli trigger, sessizleşir. Sonraki ders capstone: Türkçe ChatGPT klonu — bu altyapıyı kullanarak son ürünü yayınlama.

Sonraki Ders: Capstone — Türkçe ChatGPT Klonu#

Ders 16.5'te Modül 16'nın capstone'u: bu 4 dersi birleştirerek Türkçe ChatGPT klonu üret. Modül 15.6'daki DPO model + 16.3'te quantize + 16.2'de vLLM serve + 16.4'te monitor + Next.js frontend + Vercel deploy. Müfredatın 7. production artefakt'ı. Üretim seviyesi, gerçek müşteri kullanabilir.

Frequently Asked Questions

**Metrics**: ~%1-2 overhead (Prometheus scraping). Negligible. **Logs**: %3-5 overhead (if logging every request). Reducible via sampling. **Traces**: %5-10 overhead (full tracing). %1-5 sampling recommended in production. Total (full monitoring): ~%10-15 overhead. Trade-off: detect problems in **seconds instead of minutes**. Worth it. **Optimization**: - Metrics: log everything (cheap) - Logs: sampling (1/10 normal, 100/100 errors) - Traces: sampling (1/100 normal, 100/100 errors)

Yorumlar & Soru-Cevap

(0)

Yorum yazmak için giriş yap.

Yorumlar yükleniyor...

Pillar topics this article maps to

Pillar Topic

LLMOps: Production-Grade LLM Operations

LLMOps is the engineering discipline that covers the development, deployment, monitoring, evaluation and cost management of LLM-powered applications — extending classic MLOps with prompt versioning, eval-driven CI and observability tailored for non-deterministic systems.