Skip to content

About this training

A 3-day advanced Turkish training that addresses end to end the observability discipline of production generative-AI and LLM applications. Includes Langfuse, Arize Phoenix + AX, Helicone, Weights & Biases Weave, Braintrust, LangSmith, OpenTelemetry GenAI Semantic Conventions, OpenLLMetry, OpenInference, LiteLLM observability, KVKK-compliant PII redaction, eval-driven observability, cost + latency + quality monitoring, production incident response.

This training is designed for: ML Engineers and ML Platform Engineers who want to tie production LLM applications to the observability and monitoring discipline Engineers seeking to bring MLOps + LLMOps maturity to teams scaling enterprise LLM products Senior backend developers establishing cost + latency + quality SLO/SLI discipline for AI-powered SaaS products AI/LLM SREs responsible for on-call rotation + production incident response Enterprise AI compliance teams that need to build a KVKK + EU AI Act + GDPR-compliant Turkish AI observability stack AI engineers who want to systematically monitor quality drift and hallucination in RAG + agent + reasoning-model deployments

Why this course matters: The only advanced program in Turkey that addresses AI observability discipline end to end + production-grade in Turkish. Instills the discipline of right selection via the six-platform comparison: Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith. Teaches the vendor-agnostic standardization approach with OpenTelemetry GenAI Semantic Conventions. Carries production quality regressions to CI/CD with eval-driven observability (offline + online + LLM-as-judge). Offers cost + latency + quality three-dimensional monitoring + Grafana / Prometheus / Datadog integration. Establishes operational maturity with production incident debugging + RCA + blameless post-mortem framework. Teaches KVKK + EU AI Act + GDPR-compliant Turkish PII redaction + self-hosted deployment discipline. Completes the six-training production-grade LLM engineering frontier set with RLHF + Reasoning + Mech Interp + CPT + Quantization + Observability.

Learning outcomes by the end of the programme: Clearly frame how LLM observability differs from classical APM. Build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions. Make team-appropriate choices among Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith. Provide KVKK-compliant observability by setting up self-hosted Langfuse + Helicone + Phoenix deployments. Integrate the eval-driven observability discipline into the CI/CD pipeline. Build a cost + latency + quality three-dimensional monitoring dashboard. Continuously measure production quality with an LLM-as-judge eval framework. Manage production incidents with failed-trace analysis + RCA + blameless post-mortem. Set up PagerDuty + Slack alerting + on-call rotation + escalation policy. Address the special observability needs of reasoning models (o3/R1/Claude Extended Thinking) and agents.

Prerequisites and recommended background: Active Python or Node.js experience (intermediate to advanced), REST API + JSON experience Basic experience using LLM APIs (OpenAI, Anthropic, Google, or self-hosted) Docker + Docker Compose + basic Kubernetes knowledge (for self-hosted deployment) Basic experience with PostgreSQL or ClickHouse + log analysis OpenTelemetry foundations (recommended, built in the training) Langfuse, Phoenix, Helicone, LangSmith accounts (free tier) before the training

  • The only production-grade advanced program in Turkey that addresses AI observability and LLM monitoring end to end in Turkish
  • Six-platform comparison: Langfuse + Arize Phoenix + Helicone + W&B Weave + Braintrust + LangSmith
  • OpenTelemetry GenAI Semantic Conventions + OpenLLMetry + OpenInference vendor-agnostic standard
  • Eval-driven observability (offline + online + LLM-as-judge + user feedback) discipline
  • Mathematical construction of trace + span + token + cost + quality + agent metric anatomy
  • Cost monitoring + latency SLO/SLI + quality drift detection three-dimensional monitoring
  • Production incident debugging + PagerDuty alerting + RCA + blameless post-mortem framework
  • KVKK + EU AI Act + GDPR-compliant Turkish PII redaction + self-hosted observability deployment

Key Takeaways

  1. Clearly frame how LLM observability differs from classical APM.
  2. Build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions.
  3. Make team-appropriate choices among Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith.
  4. Provide KVKK-compliant observability by setting up self-hosted Langfuse + Helicone + Phoenix deployments.
  5. Integrate the eval-driven observability discipline into the CI/CD pipeline.
  6. Build a cost + latency + quality three-dimensional monitoring dashboard.
  7. Continuously measure production quality with an LLM-as-judge eval framework.
  8. Manage production incidents with failed-trace analysis + RCA + blameless post-mortem.
  9. Set up PagerDuty + Slack alerting + on-call rotation + escalation policy.
  10. Address the special observability needs of reasoning models (o3/R1/Claude Extended Thinking) and agents.
Hero Background
Advanced Level3 Gün

AI Observability and LLM Monitoring Engineering Training (Langfuse + Phoenix + Helicone + Weave + Braintrust + LangSmith)

A 3-day advanced Turkish training that addresses end to end the observability discipline of production generative-AI and LLM applications. Includes Langfuse, Arize Phoenix + AX, Helicone, Weights & Biases Weave, Braintrust, LangSmith, OpenTelemetry GenAI Semantic Conventions, OpenLLMetry, OpenInference, LiteLLM observability, KVKK-compliant PII redaction, eval-driven observability, cost + latency + quality monitoring, production incident response.

About This Course

This training is designed to address end to end — in Turkish — AI observability: the discipline of placing generative-AI and LLM applications under observation in production, measuring them, evaluating them, and ensuring their operational sustainability. The 2024-2026 period witnessed the birth and standard-setting race of LLM observability platforms (Langfuse, Arize Phoenix, Helicone, W&B Weave, Braintrust, LangSmith); in the same period, the vendor-agnostic trace standard took shape with OpenTelemetry GenAI Semantic Conventions. In Turkey, a training that addresses this discipline end to end at the math + tool stack + production experience + KVKK compliance triangle is virtually nonexistent — existing content either stays at short single-tool tutorials or freezes from the APM perspective. This program is designed to fill that gap as Turkey's most comprehensive production-grade AI observability reference training.



The program's strategic backbone is the first module, which clarifies how LLM observability differs from the classical APM (Application Performance Monitoring) approach. Details why classical APM solutions like Datadog, New Relic, Dynatrace fall short on LLM applications, and the LLM-specific observability needs like semantic output (non-deterministic, semantic output), hallucination, prompt drift, cost explosion, token-level cost attribution, RAG retrieval quality, and agent tool-selection accuracy. The 4-pillar framework in generative-AI observability (trace + eval + cost + quality drift) is established. The 2026 ecosystem map compares Langfuse (open-source, 13K+ GitHub stars), Arize Phoenix + AX (ML observability tradition), Helicone (proxy-based, YC W23), W&B Weave + Braintrust (eval-first), and LangSmith (LangChain native). The decision framework: open-source vs SaaS vs enterprise hybrid; self-hosted Langfuse vs Helicone vs Phoenix; and selection from the KVKK + EU AI Act + GDPR compliance perspective is presented.



The second module covers in detail the OpenTelemetry GenAI Semantic Conventions specification that shaped the AI observability standard in the 2024-2026 period. The gen_ai.* attribute namespace (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens), span events (gen_ai.content.prompt, gen_ai.content.completion), metrics (gen_ai.client.token.usage histogram); auto-instrumentation in Python + Node.js with Traceloop OpenLLMetry SDK; Arize OpenInference: OpenAI / Anthropic / LlamaIndex / LangChain wrappers; custom span addition and context-propagation patterns. Multi-backend routing with the OpenTelemetry Collector (Langfuse + Phoenix in parallel), sampling strategies (head sampling vs tail sampling, cost vs visibility trade-off), and self-hosted OTLP gateway + KVKK-compliant PII redaction are done hands-on. Thanks to this standard, traces become portable across backends like Langfuse, Phoenix, Helicone, W&B Weave; providing a strategic anti-vendor-lock-in advantage.



The third module covers end to end Langfuse, the leading open-source LLM observability platform of the 2024-2026 period. The @observe decorator + low-level SDK integration of the Python SDK; Node.js + Java SDK + OpenTelemetry adapter usage; trace + span + generation + score hierarchy modeling. The prompt-management layer: prompt versioning + production label + A/B testing pipeline; dataset creation + ground truth + LLM-as-judge eval framework; custom evaluator (Python function) + scheduled eval runs. On the self-hosting side: Docker Compose + Kubernetes Helm chart deployment; PostgreSQL + Clickhouse + S3 storage architecture; PII redaction + masking + KVKK-compliant Turkish-data handling. The stack chosen by approximately 80% of enterprise AI teams in Turkey — open-source, flexible, on-premise deployable, and eval-first in philosophy.



The fourth module covers in detail the 2024-2026 versions of Arize — which has an ML observability heritage. Phoenix (open-source, MIT licensed, shaping the OpenInference standard) Docker + local setup, OpenInference instrumentation (OpenAI, Anthropic, Bedrock, LlamaIndex, LangChain auto-tracing); span tree visualization + RAG retrieval debugging. Phoenix LLM Evals (built-in evaluators: hallucination, toxicity, relevance, QA correctness, code readability); custom evaluator + LLM-as-judge prompt templates; batched eval + analysis via the Phoenix dashboard. Production embedding drift detection + UMAP visualization; RAG context relevance + retrieval quality monitoring; Arize AX SaaS enterprise scaling + multi-tenancy + RBAC. Phoenix's ML maturity in production embedding monitoring is the most important advantage carried over to LLM observability — ideal for RAG-heavy teams.



The fifth module covers in detail Helicone's (YC W23, open-source) differentiating proxy architecture. Tracing without SDK integration via a single base_url change (OpenAI / Anthropic / OpenRouter); async log ingestion + tagging via Helicone-Property headers; custom property + user-level cost attribution. Token usage + cost-tracking dashboard + budget alerts; the 30-50% cost-reduction recipe with semantic cache; rate limiting + retry logic + provider failover. Self-hosting: Helicone OSS Docker setup; sub-100ms overhead with Cloudflare Workers Edge deployment; Vault (API-key rotation + KVKK-compliant secret management). Ideal for fast-iteration teams preferring development speed + zero-config setup — especially Turkish startups.



The sixth module covers in detail Weave (W&B team's LLM-specific product launched 2024) and Braintrust (supported by Andrej Karpathy + Imbue team, eval-first paradigm). Weave: ML-experiment-tracking heritage + @weave.op() decorator auto-tracing + dataset versioning + interactive Jupyter / Colab integration + comparison view. Braintrust: offline + online eval with braintrust SDK + eval() function; AutoEvals library built-in LLM-as-judge prompts; production span analysis + prompt playground. Eval-first philosophy: 'regression test on every PR' approach; CI/CD pipeline integration with prompt-change gating. Which team should prefer Weave/Braintrust vs Langfuse/Phoenix — a detailed decision matrix is provided.



The seventh module addresses LangSmith, the LangChain team's commercial observability product (Plus $39/month, Enterprise SaaS + on-prem). LangChain / LangGraph native integration; zero-config tracing via LANGSMITH_TRACING=true; LangGraph + LangChain Runnable hierarchy trace visualization; production debugging with run metadata + custom tags. Dataset upload + ground truth + golden-answer management; built-in evaluators (correctness, conciseness, helpfulness); experiment compare view + A/B prompt regression test. Prompt Hub (shared prompt registry + versioning); self-hosted LangSmith (on-prem) Kubernetes deployment; enterprise tier SOC2 + RBAC + audit logging. The lowest-friction choice for teams using the LangChain / LangGraph ecosystem.



The eighth module mathematically addresses the foundational data model of LLM observability. Trace (user session) → root span (request) → child span (LLM call + tool call + retriever call + nested chain) → event hierarchy; span types (LLM call, tool call, retriever, custom function); distributed tracing with context propagation across microservices. LLM-specific metrics: TTFT (Time To First Token, critical metric for streaming UX), TPOT (Time Per Output Token, throughput measurement), prompt + completion + cached + reasoning token breakdown (reasoning-model billing matters). Cost calculation: model price table + dynamic price computation (OpenAI/Anthropic/Gemini up-to-date pricing); per-user + per-feature + per-endpoint cost attribution. Quality metrics: groundedness, faithfulness, relevance LLM-as-judge implementation. Agent-specific metrics: tool-selection accuracy, planning depth, max-iterations breach rate.



The ninth module is dedicated to the eval-driven observability discipline at the heart of systematically monitoring LLM quality in production. Offline eval pipeline: regression eval on prompt changes in CI/CD; GitHub Actions + Langfuse / Braintrust eval integration; golden-dataset versioning + drift detection. Online eval + user feedback: continuous LLM-as-judge scoring of production traces; thumbs up/down + structured feedback + NPS collection; the user feedback → dataset → eval improvement loop. LLM-as-judge discipline: judge prompt design + bias mitigation (position bias, length bias, verbosity bias); pairwise comparison + reference-based + reference-free judge; multi-judge ensemble + human-judge agreement validation. With this discipline, production quality regressions feed back to CI/CD.



The tenth module addresses the mandatory three-dimensional monitoring discipline for the economic and operational sustainability of production LLM applications. Cost monitoring: token-usage trend + model distribution + per-endpoint breakdown; user-level cost attribution + per-tenant budgeting; semantic cache hit-rate + cost-reduction effectiveness. Latency + SLO/SLI: P50/P95/P99 TTFT + TPOT histograms; SLO/SLI definition ('P95 TTFT < 1.5s, success rate > 99.5%'); error budget + alerting threshold management. Quality monitoring: hallucination rate + sycophancy drift + refusal rate tracking; Grafana dashboard + Prometheus metrics integration; Datadog LLM Observability + New Relic AI Monitoring overview. These three dimensions together provide production sustainability for enterprise AI applications.



The eleventh module focuses on the real-world use moment of AI observability — production incident debugging and resolution. Failed-trace analysis: error spans, retry chain, timeout breakdown; provider outage handling (OpenAI 5XX storm, Anthropic capacity throttling, Gemini RPC errors); agent infinite loop + max-iteration safeguard pattern. Alerting + on-call: PagerDuty + Slack + Discord alerting integration; threshold tuning + alert-fatigue prevention; on-call rotation + escalation policy + runbook preparation. RCA + post-mortem: root cause analysis with 5-Whys + Ishikawa diagrams; blameless post-mortem template + action-item tracking; Linear / Jira ticket integration + incident retrospective. The operational maturity of AI systems depends on the rigor of this discipline.



In the capstone module, each participant designs an end-to-end AI observability stack tailored to their own production scenario: provider selection (Langfuse self-hosted, Phoenix, Helicone, Weave, Braintrust, LangSmith), integration approach (OpenTelemetry GenAI vs native SDK), eval framework (offline + online), cost + latency + quality monitoring dashboard, alerting + on-call setup, KVKK-compliant PII redaction, 90-day production roadmap. By the end of the training, participants reach a level of technical competence to clearly frame how LLM observability differs from classical APM; build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions; make team-appropriate choices among Langfuse / Phoenix / Helicone / Weave / Braintrust / LangSmith; integrate the eval-driven observability discipline into the CI/CD pipeline; build a cost + latency + quality three-dimensional monitoring dashboard; manage production incidents with failed-trace analysis + RCA + post-mortem framework; and build a KVKK + EU AI Act + GDPR-compliant Turkish-data handling pipeline. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.

Training Methodology

The only production-grade advanced program in Turkey that addresses AI observability and LLM monitoring end to end in Turkish

Six-platform comparison: Langfuse + Arize Phoenix + Helicone + W&B Weave + Braintrust + LangSmith

OpenTelemetry GenAI Semantic Conventions + OpenLLMetry + OpenInference vendor-agnostic standard

Eval-driven observability (offline + online + LLM-as-judge + user feedback) discipline

Mathematical construction of trace + span + token + cost + quality + agent metric anatomy

Cost monitoring + latency SLO/SLI + quality drift detection three-dimensional monitoring

Production incident debugging + PagerDuty alerting + RCA + blameless post-mortem framework

KVKK + EU AI Act + GDPR-compliant Turkish PII redaction + self-hosted observability deployment

Who Is This For?

ML Engineers and ML Platform Engineers who want to tie production LLM applications to the observability and monitoring discipline
Engineers seeking to bring MLOps + LLMOps maturity to teams scaling enterprise LLM products
Senior backend developers establishing cost + latency + quality SLO/SLI discipline for AI-powered SaaS products
AI/LLM SREs responsible for on-call rotation + production incident response
Enterprise AI compliance teams that need to build a KVKK + EU AI Act + GDPR-compliant Turkish AI observability stack
AI engineers who want to systematically monitor quality drift and hallucination in RAG + agent + reasoning-model deployments

Why This Course?

1

The only advanced program in Turkey that addresses AI observability discipline end to end + production-grade in Turkish.

2

Instills the discipline of right selection via the six-platform comparison: Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith.

3

Teaches the vendor-agnostic standardization approach with OpenTelemetry GenAI Semantic Conventions.

4

Carries production quality regressions to CI/CD with eval-driven observability (offline + online + LLM-as-judge).

5

Offers cost + latency + quality three-dimensional monitoring + Grafana / Prometheus / Datadog integration.

6

Establishes operational maturity with production incident debugging + RCA + blameless post-mortem framework.

7

Teaches KVKK + EU AI Act + GDPR-compliant Turkish PII redaction + self-hosted deployment discipline.

8

Completes the six-training production-grade LLM engineering frontier set with RLHF + Reasoning + Mech Interp + CPT + Quantization + Observability.

Learning Outcomes

Clearly frame how LLM observability differs from classical APM.
Build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions.
Make team-appropriate choices among Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith.
Provide KVKK-compliant observability by setting up self-hosted Langfuse + Helicone + Phoenix deployments.
Integrate the eval-driven observability discipline into the CI/CD pipeline.
Build a cost + latency + quality three-dimensional monitoring dashboard.
Continuously measure production quality with an LLM-as-judge eval framework.
Manage production incidents with failed-trace analysis + RCA + blameless post-mortem.
Set up PagerDuty + Slack alerting + on-call rotation + escalation policy.
Address the special observability needs of reasoning models (o3/R1/Claude Extended Thinking) and agents.

Requirements

Active Python or Node.js experience (intermediate to advanced), REST API + JSON experience
Basic experience using LLM APIs (OpenAI, Anthropic, Google, or self-hosted)
Docker + Docker Compose + basic Kubernetes knowledge (for self-hosted deployment)
Basic experience with PostgreSQL or ClickHouse + log analysis
OpenTelemetry foundations (recommended, built in the training)
Langfuse, Phoenix, Helicone, LangSmith accounts (free tier) before the training

Course Curriculum

104 Lessons
01
Module 1: Strategic Introduction to LLM Observability — The Difference from Classical APM9 Lessons
02
Module 2: OpenTelemetry GenAI Semantic Conventions — The Standardizing Trace Format9 Lessons
03
Module 3: Langfuse Deep Dive — The Leader of Open-Source LLM Observability9 Lessons
04
Module 4: Arize Phoenix and Arize AX — The Generative-AI Continuation of ML Observability9 Lessons
05
Module 5: Helicone — Proxy-Based LLM Observability and Cost Tracking9 Lessons
06
Module 6: W&B Weave and Braintrust — Eval-First LLM Observability9 Lessons
07
Module 7: LangSmith — LangChain Native Observability9 Lessons
08
Module 8: Trace + Span Anatomy and LLM-Specific Metrics9 Lessons
09
Module 9: Eval-Driven Observability — Online and Offline Evaluation9 Lessons
10
Module 10: Cost + Latency + Quality Monitoring — Three-Dimensional LLM Surveillance9 Lessons
11
Module 11: Production Debugging + Alerting + Incident Response9 Lessons
12
Module 12: Capstone — Building a Multi-Provider Observability Stack5 Lessons

Instructor

Şükrü Yusuf KAYA

Şükrü Yusuf KAYA

AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant

Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.

Frequently Asked Questions