# AI Observability and LLM Monitoring Engineering Training (Langfuse + Phoenix + Helicone + Weave + Braintrust + LangSmith)

> Source: https://sukruyusufkaya.com/en/training/ai-observability-llm-monitoring-muhendisligi-egitimi
> Updated: 2026-05-19T15:34:36.171Z
> Level: advanced
> Topics: llm observability, ai observability, langfuse, arize phoenix, helicone, w&b weave, braintrust, langsmith, opentelemetry genai, openllmetry, openinference, litellm observability, llm tracing, llm monitoring, prompt management, llm cost tracking, llm eval framework, llm-as-judge, production llm, kvkk uyumlu observability
**TLDR:** A 3-day advanced Turkish training that addresses end to end the observability discipline of production generative-AI and LLM applications. Includes Langfuse, Arize Phoenix + AX, Helicone, Weights & Biases Weave, Braintrust, LangSmith, OpenTelemetry GenAI Semantic Conventions, OpenLLMetry, OpenInference, LiteLLM observability, KVKK-compliant PII redaction, eval-driven observability, cost + latency + quality monitoring, production incident response.

## Açıklama

The AI Observability and LLM Monitoring Engineering Training is a 3-day advanced program designed for ML Engineers, ML Platform Engineers, MLOps practitioners, Senior Backend Developers, and AI/LLM SREs who want to tie production generative-AI applications to the observability, measurement, evaluation, and incident-response discipline.

## Kazanımlar

- Clearly frame how LLM observability differs from classical APM.
- Build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions.
- Make team-appropriate choices among Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith.
- Provide KVKK-compliant observability by setting up self-hosted Langfuse + Helicone + Phoenix deployments.
- Integrate the eval-driven observability discipline into the CI/CD pipeline.
- Build a cost + latency + quality three-dimensional monitoring dashboard.
- Continuously measure production quality with an LLM-as-judge eval framework.
- Manage production incidents with failed-trace analysis + RCA + blameless post-mortem.
- Set up PagerDuty + Slack alerting + on-call rotation + escalation policy.
- Address the special observability needs of reasoning models (o3/R1/Claude Extended Thinking) and agents.

<p>This training is designed to address end to end — in Turkish — AI observability: the discipline of placing generative-AI and LLM applications under observation in production, measuring them, evaluating them, and ensuring their operational sustainability. The 2024-2026 period witnessed the birth and standard-setting race of LLM observability platforms (Langfuse, Arize Phoenix, Helicone, W&B Weave, Braintrust, LangSmith); in the same period, the vendor-agnostic trace standard took shape with OpenTelemetry GenAI Semantic Conventions. In Turkey, a training that addresses this discipline end to end at the math + tool stack + production experience + KVKK compliance triangle is virtually nonexistent — existing content either stays at short single-tool tutorials or freezes from the APM perspective. This program is designed to fill that gap as Turkey's most comprehensive production-grade AI observability reference training.</p>

<p>The program's strategic backbone is the first module, which clarifies how LLM observability differs from the classical APM (Application Performance Monitoring) approach. Details why classical APM solutions like Datadog, New Relic, Dynatrace fall short on LLM applications, and the LLM-specific observability needs like semantic output (non-deterministic, semantic output), hallucination, prompt drift, cost explosion, token-level cost attribution, RAG retrieval quality, and agent tool-selection accuracy. The 4-pillar framework in generative-AI observability (trace + eval + cost + quality drift) is established. The 2026 ecosystem map compares Langfuse (open-source, 13K+ GitHub stars), Arize Phoenix + AX (ML observability tradition), Helicone (proxy-based, YC W23), W&B Weave + Braintrust (eval-first), and LangSmith (LangChain native). The decision framework: open-source vs SaaS vs enterprise hybrid; self-hosted Langfuse vs Helicone vs Phoenix; and selection from the KVKK + EU AI Act + GDPR compliance perspective is presented.</p>

<p>The second module covers in detail the OpenTelemetry GenAI Semantic Conventions specification that shaped the AI observability standard in the 2024-2026 period. The gen_ai.* attribute namespace (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens), span events (gen_ai.content.prompt, gen_ai.content.completion), metrics (gen_ai.client.token.usage histogram); auto-instrumentation in Python + Node.js with Traceloop OpenLLMetry SDK; Arize OpenInference: OpenAI / Anthropic / LlamaIndex / LangChain wrappers; custom span addition and context-propagation patterns. Multi-backend routing with the OpenTelemetry Collector (Langfuse + Phoenix in parallel), sampling strategies (head sampling vs tail sampling, cost vs visibility trade-off), and self-hosted OTLP gateway + KVKK-compliant PII redaction are done hands-on. Thanks to this standard, traces become portable across backends like Langfuse, Phoenix, Helicone, W&B Weave; providing a strategic anti-vendor-lock-in advantage.</p>

<p>The third module covers end to end Langfuse, the leading open-source LLM observability platform of the 2024-2026 period. The @observe decorator + low-level SDK integration of the Python SDK; Node.js + Java SDK + OpenTelemetry adapter usage; trace + span + generation + score hierarchy modeling. The prompt-management layer: prompt versioning + production label + A/B testing pipeline; dataset creation + ground truth + LLM-as-judge eval framework; custom evaluator (Python function) + scheduled eval runs. On the self-hosting side: Docker Compose + Kubernetes Helm chart deployment; PostgreSQL + Clickhouse + S3 storage architecture; PII redaction + masking + KVKK-compliant Turkish-data handling. The stack chosen by approximately 80% of enterprise AI teams in Turkey — open-source, flexible, on-premise deployable, and eval-first in philosophy.</p>

<p>The fourth module covers in detail the 2024-2026 versions of Arize — which has an ML observability heritage. Phoenix (open-source, MIT licensed, shaping the OpenInference standard) Docker + local setup, OpenInference instrumentation (OpenAI, Anthropic, Bedrock, LlamaIndex, LangChain auto-tracing); span tree visualization + RAG retrieval debugging. Phoenix LLM Evals (built-in evaluators: hallucination, toxicity, relevance, QA correctness, code readability); custom evaluator + LLM-as-judge prompt templates; batched eval + analysis via the Phoenix dashboard. Production embedding drift detection + UMAP visualization; RAG context relevance + retrieval quality monitoring; Arize AX SaaS enterprise scaling + multi-tenancy + RBAC. Phoenix's ML maturity in production embedding monitoring is the most important advantage carried over to LLM observability — ideal for RAG-heavy teams.</p>

<p>The fifth module covers in detail Helicone's (YC W23, open-source) differentiating proxy architecture. Tracing without SDK integration via a single base_url change (OpenAI / Anthropic / OpenRouter); async log ingestion + tagging via Helicone-Property headers; custom property + user-level cost attribution. Token usage + cost-tracking dashboard + budget alerts; the 30-50% cost-reduction recipe with semantic cache; rate limiting + retry logic + provider failover. Self-hosting: Helicone OSS Docker setup; sub-100ms overhead with Cloudflare Workers Edge deployment; Vault (API-key rotation + KVKK-compliant secret management). Ideal for fast-iteration teams preferring development speed + zero-config setup — especially Turkish startups.</p>

<p>The sixth module covers in detail Weave (W&B team's LLM-specific product launched 2024) and Braintrust (supported by Andrej Karpathy + Imbue team, eval-first paradigm). Weave: ML-experiment-tracking heritage + @weave.op() decorator auto-tracing + dataset versioning + interactive Jupyter / Colab integration + comparison view. Braintrust: offline + online eval with braintrust SDK + eval() function; AutoEvals library built-in LLM-as-judge prompts; production span analysis + prompt playground. Eval-first philosophy: 'regression test on every PR' approach; CI/CD pipeline integration with prompt-change gating. Which team should prefer Weave/Braintrust vs Langfuse/Phoenix — a detailed decision matrix is provided.</p>

<p>The seventh module addresses LangSmith, the LangChain team's commercial observability product (Plus $39/month, Enterprise SaaS + on-prem). LangChain / LangGraph native integration; zero-config tracing via LANGSMITH_TRACING=true; LangGraph + LangChain Runnable hierarchy trace visualization; production debugging with run metadata + custom tags. Dataset upload + ground truth + golden-answer management; built-in evaluators (correctness, conciseness, helpfulness); experiment compare view + A/B prompt regression test. Prompt Hub (shared prompt registry + versioning); self-hosted LangSmith (on-prem) Kubernetes deployment; enterprise tier SOC2 + RBAC + audit logging. The lowest-friction choice for teams using the LangChain / LangGraph ecosystem.</p>

<p>The eighth module mathematically addresses the foundational data model of LLM observability. Trace (user session) → root span (request) → child span (LLM call + tool call + retriever call + nested chain) → event hierarchy; span types (LLM call, tool call, retriever, custom function); distributed tracing with context propagation across microservices. LLM-specific metrics: TTFT (Time To First Token, critical metric for streaming UX), TPOT (Time Per Output Token, throughput measurement), prompt + completion + cached + reasoning token breakdown (reasoning-model billing matters). Cost calculation: model price table + dynamic price computation (OpenAI/Anthropic/Gemini up-to-date pricing); per-user + per-feature + per-endpoint cost attribution. Quality metrics: groundedness, faithfulness, relevance LLM-as-judge implementation. Agent-specific metrics: tool-selection accuracy, planning depth, max-iterations breach rate.</p>

<p>The ninth module is dedicated to the eval-driven observability discipline at the heart of systematically monitoring LLM quality in production. Offline eval pipeline: regression eval on prompt changes in CI/CD; GitHub Actions + Langfuse / Braintrust eval integration; golden-dataset versioning + drift detection. Online eval + user feedback: continuous LLM-as-judge scoring of production traces; thumbs up/down + structured feedback + NPS collection; the user feedback → dataset → eval improvement loop. LLM-as-judge discipline: judge prompt design + bias mitigation (position bias, length bias, verbosity bias); pairwise comparison + reference-based + reference-free judge; multi-judge ensemble + human-judge agreement validation. With this discipline, production quality regressions feed back to CI/CD.</p>

<p>The tenth module addresses the mandatory three-dimensional monitoring discipline for the economic and operational sustainability of production LLM applications. Cost monitoring: token-usage trend + model distribution + per-endpoint breakdown; user-level cost attribution + per-tenant budgeting; semantic cache hit-rate + cost-reduction effectiveness. Latency + SLO/SLI: P50/P95/P99 TTFT + TPOT histograms; SLO/SLI definition ('P95 TTFT < 1.5s, success rate > 99.5%'); error budget + alerting threshold management. Quality monitoring: hallucination rate + sycophancy drift + refusal rate tracking; Grafana dashboard + Prometheus metrics integration; Datadog LLM Observability + New Relic AI Monitoring overview. These three dimensions together provide production sustainability for enterprise AI applications.</p>

<p>The eleventh module focuses on the real-world use moment of AI observability — production incident debugging and resolution. Failed-trace analysis: error spans, retry chain, timeout breakdown; provider outage handling (OpenAI 5XX storm, Anthropic capacity throttling, Gemini RPC errors); agent infinite loop + max-iteration safeguard pattern. Alerting + on-call: PagerDuty + Slack + Discord alerting integration; threshold tuning + alert-fatigue prevention; on-call rotation + escalation policy + runbook preparation. RCA + post-mortem: root cause analysis with 5-Whys + Ishikawa diagrams; blameless post-mortem template + action-item tracking; Linear / Jira ticket integration + incident retrospective. The operational maturity of AI systems depends on the rigor of this discipline.</p>

<p>In the capstone module, each participant designs an end-to-end AI observability stack tailored to their own production scenario: provider selection (Langfuse self-hosted, Phoenix, Helicone, Weave, Braintrust, LangSmith), integration approach (OpenTelemetry GenAI vs native SDK), eval framework (offline + online), cost + latency + quality monitoring dashboard, alerting + on-call setup, KVKK-compliant PII redaction, 90-day production roadmap. By the end of the training, participants reach a level of technical competence to clearly frame how LLM observability differs from classical APM; build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions; make team-appropriate choices among Langfuse / Phoenix / Helicone / Weave / Braintrust / LangSmith; integrate the eval-driven observability discipline into the CI/CD pipeline; build a cost + latency + quality three-dimensional monitoring dashboard; manage production incidents with failed-trace analysis + RCA + post-mortem framework; and build a KVKK + EU AI Act + GDPR-compliant Turkish-data handling pipeline. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.</p>