What is the clear difference between classical APM (Datadog, New Relic, Dynatrace) and LLM observability?

Classical APM was designed for deterministic applications: measuring per-endpoint latency, error rate, throughput. LLM applications are non-deterministic: different responses to the same prompt, response quality (hallucination, refusal, sycophancy), token-level cost (prompt + completion + reasoning), and semantic dimensions like prompt drift exist. Classical APM cannot measure these dimensions. AI observability platforms (Langfuse, Phoenix, Helicone): trace + span + LLM-specific attributes (model, tokens, cost), eval framework (LLM-as-judge), prompt management, semantic drift detection. The 2026 solution: AI observability + classical APM together (Datadog AI Monitoring went this direction in 2024). Module 1 covers in detail.

Among Langfuse, Phoenix, Helicone, LangSmith — which should I choose?

Depends on the scenario. Open-source + on-premise + KVKK critical → Langfuse (most widespread Turkish enterprise choice). RAG-heavy + ML observability heritage → Phoenix (Arize team's ML maturity). Zero-config + proxy + cost optimization → Helicone (semantic cache + Cloudflare Edge). LangChain / LangGraph native + commercial support → LangSmith. Fast iteration + eval-first + W&B user → Weave. Production span analysis + Karpathy team sponsorship → Braintrust. The Module 12 capstone makes the right choice for you.

Is OpenTelemetry GenAI Semantic Conventions really becoming the standard?

Yes — in the 2024-2026 wave, the CNCF + OpenTelemetry working group shaped the standard; Langfuse, Phoenix, Helicone, Datadog AI Monitoring all support the gen_ai.* namespace and OpenInference / OpenLLMetry. The Traceloop OpenLLMetry SDK and Arize OpenInference SDK provide auto-instrumentation (OpenAI, Anthropic, LangChain, LlamaIndex). The strategic benefit of this standard: anti-vendor-lock-in — if you want to migrate from Langfuse to Phoenix one day, you can carry your traces. Module 2 covers in detail.

What is eval-driven observability? How does it differ from trace-only tracking?

Trace-only: the 'log every LLM call, open it when needed' approach — passive. Eval-driven: each trace is automatically assigned a quality score (LLM-as-judge + custom evaluator), regression tests run in CI/CD, and prompt changes pass through eval gating. Result: bad prompts don't reach production, drift is caught early. Braintrust + Weave + Langfuse are eval-first in philosophy; Helicone is trace-first. Modules 9 and 6 cover in detail.

Self-hosted Langfuse vs SaaS Langfuse — which is more appropriate?

If KVKK + data sovereignty are required, self-hosted (mandatory in Turkey for finance, healthcare, public sector). SaaS Langfuse Cloud (us / eu region) is suitable for a fast start, but if Turkish production data must not go to the EU / US, self-hosted is mandatory. Docker Compose offers a 5-minute setup; Kubernetes Helm chart takes it to enterprise scale. Module 3.3 covers KVKK-compliant self-hosted deployment + PII redaction + Turkish-data handling in detail.

Are there special needs in observability for reasoning models (o3, R1, Claude Extended Thinking)?

Yes — 3 main needs: (1) Reasoning-token billing: prompt + completion + reasoning must be tracked as separate categories (in OpenAI o3, reasoning tokens are billed separately from output). (2) Thinking-trace storage: 16K-128K thinking traces are large; cost-aware sampling strategy needed. (3) Reasoning eval: answer correctness + reasoning quality (PRM-style step-by-step eval) must be measured separately. Langfuse + Phoenix 2025 releases added native reasoning-token tracking support. Module 8 covers reasoning-category breakdown in detail.

Is Helicone proxy architecture's latency overhead acceptable?

With Cloudflare Workers Edge deployment, sub-100ms (typically 30-80ms) overhead — negligible compared to production LLM calls (1-30 seconds). Network round-trip can be ~50-150ms in self-hosted Docker setup, but still acceptable. Advantages: async log ingestion, no SDK integration, instant tracing — zero-config. Practical rule: Helicone is ideal for production scenarios where stream-first + sub-second TTFT is not critical. Module 5 covers in detail.

How is observability done for agents (multi-step tool calling)?

Agent-specific metrics are critical: tool-selection accuracy (rate of calling the right tool), planning depth (how many steps), max-iterations breach rate (infinite-loop risk), tool-call latency breakdown (LLM call vs tool call distinction). Phoenix + Langfuse span tree visualization is ideal for agent debugging — each tool call is visualized as a separate span. LangGraph + LangSmith native agent trace; Langfuse + LlamaIndex agent integration. Module 8.3 covers agent-specific metric anatomy in detail.

Can I really reduce monthly LLM bills with cost monitoring?

Yes — practical experience shows 30-60% cost reduction. Keys: (1) Identifying the most expensive endpoints via per-endpoint + per-user attribution; (2) 30-50% cache hit rate with semantic cache (Helicone, LiteLLM); (3) Model routing (simple query → Haiku 4.5 / Gemini Flash, complex → Opus 4.7 / GPT-5); (4) Reducing token count via prompt optimization (typically 20-40% reduction); (5) Using reasoning models only when needed (mixed-mode router). Module 10 covers cost monitoring + budget alerting in detail.

How is PII redaction done? What should I watch out for KVKK / Turkish?

Critical PII categories for Turkish: TC ID number (11 digits, modulo 10 + 11 validation), IBAN (TR + 24 digits), phone (+90 / 0 5XX prefix), email, name-surname, address. A hybrid approach (regex + ML-based detection — Microsoft Presidio + custom Turkish patterns) is recommended. Langfuse data mask + Helicone vault + Phoenix custom interceptor perform PII redaction at the OTLP gateway layer — production data does not reach the platform unmasked. The KVKK Generative AI Guide (2024) explicitly recommends not sending PII to LLM providers. Modules 2.3, 3.3 and 12 cover in detail.

What concrete artifacts will I have at the end of the training?

The following artifacts are produced in the capstone project: (1) an observability stack tailored to your production scenario (Langfuse self-hosted + OpenTelemetry GenAI + Phoenix evaluator); (2) a Docker Compose / Kubernetes Helm chart deployment template; (3) a PII redaction + KVKK-compliant Turkish-data handling pipeline; (4) an eval framework (offline CI/CD + online LLM-as-judge); (5) a Grafana cost + latency + quality dashboard; (6) PagerDuty alerting + Slack integration + on-call runbook; (7) RCA + blameless post-mortem templates; (8) a 90-day observability roadmap (cost-reduction + quality-improvement + incident-response targets).

Can the training be customized for our enterprise team?

Yes. Beyond the standard 3-day program, we offer customized private-classroom versions for enterprise clients. Module weights and capstone scenarios are tailored to your team's existing LLM stack (OpenAI / Anthropic / Google / DeepSeek / your own CPT model), existing observability infrastructure (Datadog, New Relic, Grafana, ELK), domain (finance, healthcare, legal, public sector, e-commerce), compliance requirements (KVKK, EU AI Act, ISO/IEC 42001, HIPAA), production SLA goals, and cost-optimization priorities.

About this training

A 3-day advanced Turkish training that addresses end to end the observability discipline of production generative-AI and LLM applications. Includes Langfuse, Arize Phoenix + AX, Helicone, Weights & Biases Weave, Braintrust, LangSmith, OpenTelemetry GenAI Semantic Conventions, OpenLLMetry, OpenInference, LiteLLM observability, KVKK-compliant PII redaction, eval-driven observability, cost + latency + quality monitoring, production incident response.

This training is designed for: ML Engineers and ML Platform Engineers who want to tie production LLM applications to the observability and monitoring discipline Engineers seeking to bring MLOps + LLMOps maturity to teams scaling enterprise LLM products Senior backend developers establishing cost + latency + quality SLO/SLI discipline for AI-powered SaaS products AI/LLM SREs responsible for on-call rotation + production incident response Enterprise AI compliance teams that need to build a KVKK + EU AI Act + GDPR-compliant Turkish AI observability stack AI engineers who want to systematically monitor quality drift and hallucination in RAG + agent + reasoning-model deployments

Why this course matters: The only advanced program in Turkey that addresses AI observability discipline end to end + production-grade in Turkish. Instills the discipline of right selection via the six-platform comparison: Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith. Teaches the vendor-agnostic standardization approach with OpenTelemetry GenAI Semantic Conventions. Carries production quality regressions to CI/CD with eval-driven observability (offline + online + LLM-as-judge). Offers cost + latency + quality three-dimensional monitoring + Grafana / Prometheus / Datadog integration. Establishes operational maturity with production incident debugging + RCA + blameless post-mortem framework. Teaches KVKK + EU AI Act + GDPR-compliant Turkish PII redaction + self-hosted deployment discipline. Completes the six-training production-grade LLM engineering frontier set with RLHF + Reasoning + Mech Interp + CPT + Quantization + Observability.

Learning outcomes by the end of the programme: Clearly frame how LLM observability differs from classical APM. Build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions. Make team-appropriate choices among Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith. Provide KVKK-compliant observability by setting up self-hosted Langfuse + Helicone + Phoenix deployments. Integrate the eval-driven observability discipline into the CI/CD pipeline. Build a cost + latency + quality three-dimensional monitoring dashboard. Continuously measure production quality with an LLM-as-judge eval framework. Manage production incidents with failed-trace analysis + RCA + blameless post-mortem. Set up PagerDuty + Slack alerting + on-call rotation + escalation policy. Address the special observability needs of reasoning models (o3/R1/Claude Extended Thinking) and agents.

Prerequisites and recommended background: Active Python or Node.js experience (intermediate to advanced), REST API + JSON experience Basic experience using LLM APIs (OpenAI, Anthropic, Google, or self-hosted) Docker + Docker Compose + basic Kubernetes knowledge (for self-hosted deployment) Basic experience with PostgreSQL or ClickHouse + log analysis OpenTelemetry foundations (recommended, built in the training) Langfuse, Phoenix, Helicone, LangSmith accounts (free tier) before the training

The only production-grade advanced program in Turkey that addresses AI observability and LLM monitoring end to end in Turkish
Six-platform comparison: Langfuse + Arize Phoenix + Helicone + W&B Weave + Braintrust + LangSmith
OpenTelemetry GenAI Semantic Conventions + OpenLLMetry + OpenInference vendor-agnostic standard
Eval-driven observability (offline + online + LLM-as-judge + user feedback) discipline
Mathematical construction of trace + span + token + cost + quality + agent metric anatomy
Cost monitoring + latency SLO/SLI + quality drift detection three-dimensional monitoring
Production incident debugging + PagerDuty alerting + RCA + blameless post-mortem framework
KVKK + EU AI Act + GDPR-compliant Turkish PII redaction + self-hosted observability deployment

Key Takeaways

Clearly frame how LLM observability differs from classical APM.
Build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions.
Make team-appropriate choices among Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith.
Provide KVKK-compliant observability by setting up self-hosted Langfuse + Helicone + Phoenix deployments.
Integrate the eval-driven observability discipline into the CI/CD pipeline.
Build a cost + latency + quality three-dimensional monitoring dashboard.
Continuously measure production quality with an LLM-as-judge eval framework.
Manage production incidents with failed-trace analysis + RCA + blameless post-mortem.
Set up PagerDuty + Slack alerting + on-call rotation + escalation policy.
Address the special observability needs of reasoning models (o3/R1/Claude Extended Thinking) and agents.

Advanced Level3 Gün

AI Observability and LLM Monitoring Engineering Training (Langfuse + Phoenix + Helicone + Weave + Braintrust + LangSmith)

Enroll Now

About This Course

This training is designed to address end to end — in Turkish — AI observability: the discipline of placing generative-AI and LLM applications under observation in production, measuring them, evaluating them, and ensuring their operational sustainability. The 2024-2026 period witnessed the birth and standard-setting race of LLM observability platforms (Langfuse, Arize Phoenix, Helicone, W&B Weave, Braintrust, LangSmith); in the same period, the vendor-agnostic trace standard took shape with OpenTelemetry GenAI Semantic Conventions. In Turkey, a training that addresses this discipline end to end at the math + tool stack + production experience + KVKK compliance triangle is virtually nonexistent — existing content either stays at short single-tool tutorials or freezes from the APM perspective. This program is designed to fill that gap as Turkey's most comprehensive production-grade AI observability reference training.

The program's strategic backbone is the first module, which clarifies how LLM observability differs from the classical APM (Application Performance Monitoring) approach. Details why classical APM solutions like Datadog, New Relic, Dynatrace fall short on LLM applications, and the LLM-specific observability needs like semantic output (non-deterministic, semantic output), hallucination, prompt drift, cost explosion, token-level cost attribution, RAG retrieval quality, and agent tool-selection accuracy. The 4-pillar framework in generative-AI observability (trace + eval + cost + quality drift) is established. The 2026 ecosystem map compares Langfuse (open-source, 13K+ GitHub stars), Arize Phoenix + AX (ML observability tradition), Helicone (proxy-based, YC W23), W&B Weave + Braintrust (eval-first), and LangSmith (LangChain native). The decision framework: open-source vs SaaS vs enterprise hybrid; self-hosted Langfuse vs Helicone vs Phoenix; and selection from the KVKK + EU AI Act + GDPR compliance perspective is presented.

The second module covers in detail the OpenTelemetry GenAI Semantic Conventions specification that shaped the AI observability standard in the 2024-2026 period. The gen_ai.* attribute namespace (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens), span events (gen_ai.content.prompt, gen_ai.content.completion), metrics (gen_ai.client.token.usage histogram); auto-instrumentation in Python + Node.js with Traceloop OpenLLMetry SDK; Arize OpenInference: OpenAI / Anthropic / LlamaIndex / LangChain wrappers; custom span addition and context-propagation patterns. Multi-backend routing with the OpenTelemetry Collector (Langfuse + Phoenix in parallel), sampling strategies (head sampling vs tail sampling, cost vs visibility trade-off), and self-hosted OTLP gateway + KVKK-compliant PII redaction are done hands-on. Thanks to this standard, traces become portable across backends like Langfuse, Phoenix, Helicone, W&B Weave; providing a strategic anti-vendor-lock-in advantage.

The third module covers end to end Langfuse, the leading open-source LLM observability platform of the 2024-2026 period. The @observe decorator + low-level SDK integration of the Python SDK; Node.js + Java SDK + OpenTelemetry adapter usage; trace + span + generation + score hierarchy modeling. The prompt-management layer: prompt versioning + production label + A/B testing pipeline; dataset creation + ground truth + LLM-as-judge eval framework; custom evaluator (Python function) + scheduled eval runs. On the self-hosting side: Docker Compose + Kubernetes Helm chart deployment; PostgreSQL + Clickhouse + S3 storage architecture; PII redaction + masking + KVKK-compliant Turkish-data handling. The stack chosen by approximately 80% of enterprise AI teams in Turkey — open-source, flexible, on-premise deployable, and eval-first in philosophy.

The fourth module covers in detail the 2024-2026 versions of Arize — which has an ML observability heritage. Phoenix (open-source, MIT licensed, shaping the OpenInference standard) Docker + local setup, OpenInference instrumentation (OpenAI, Anthropic, Bedrock, LlamaIndex, LangChain auto-tracing); span tree visualization + RAG retrieval debugging. Phoenix LLM Evals (built-in evaluators: hallucination, toxicity, relevance, QA correctness, code readability); custom evaluator + LLM-as-judge prompt templates; batched eval + analysis via the Phoenix dashboard. Production embedding drift detection + UMAP visualization; RAG context relevance + retrieval quality monitoring; Arize AX SaaS enterprise scaling + multi-tenancy + RBAC. Phoenix's ML maturity in production embedding monitoring is the most important advantage carried over to LLM observability — ideal for RAG-heavy teams.

The fifth module covers in detail Helicone's (YC W23, open-source) differentiating proxy architecture. Tracing without SDK integration via a single base_url change (OpenAI / Anthropic / OpenRouter); async log ingestion + tagging via Helicone-Property headers; custom property + user-level cost attribution. Token usage + cost-tracking dashboard + budget alerts; the 30-50% cost-reduction recipe with semantic cache; rate limiting + retry logic + provider failover. Self-hosting: Helicone OSS Docker setup; sub-100ms overhead with Cloudflare Workers Edge deployment; Vault (API-key rotation + KVKK-compliant secret management). Ideal for fast-iteration teams preferring development speed + zero-config setup — especially Turkish startups.

The sixth module covers in detail Weave (W&B team's LLM-specific product launched 2024) and Braintrust (supported by Andrej Karpathy + Imbue team, eval-first paradigm). Weave: ML-experiment-tracking heritage + @weave.op() decorator auto-tracing + dataset versioning + interactive Jupyter / Colab integration + comparison view. Braintrust: offline + online eval with braintrust SDK + eval() function; AutoEvals library built-in LLM-as-judge prompts; production span analysis + prompt playground. Eval-first philosophy: 'regression test on every PR' approach; CI/CD pipeline integration with prompt-change gating. Which team should prefer Weave/Braintrust vs Langfuse/Phoenix — a detailed decision matrix is provided.

The seventh module addresses LangSmith, the LangChain team's commercial observability product (Plus $39/month, Enterprise SaaS + on-prem). LangChain / LangGraph native integration; zero-config tracing via LANGSMITH_TRACING=true; LangGraph + LangChain Runnable hierarchy trace visualization; production debugging with run metadata + custom tags. Dataset upload + ground truth + golden-answer management; built-in evaluators (correctness, conciseness, helpfulness); experiment compare view + A/B prompt regression test. Prompt Hub (shared prompt registry + versioning); self-hosted LangSmith (on-prem) Kubernetes deployment; enterprise tier SOC2 + RBAC + audit logging. The lowest-friction choice for teams using the LangChain / LangGraph ecosystem.

The eighth module mathematically addresses the foundational data model of LLM observability. Trace (user session) → root span (request) → child span (LLM call + tool call + retriever call + nested chain) → event hierarchy; span types (LLM call, tool call, retriever, custom function); distributed tracing with context propagation across microservices. LLM-specific metrics: TTFT (Time To First Token, critical metric for streaming UX), TPOT (Time Per Output Token, throughput measurement), prompt + completion + cached + reasoning token breakdown (reasoning-model billing matters). Cost calculation: model price table + dynamic price computation (OpenAI/Anthropic/Gemini up-to-date pricing); per-user + per-feature + per-endpoint cost attribution. Quality metrics: groundedness, faithfulness, relevance LLM-as-judge implementation. Agent-specific metrics: tool-selection accuracy, planning depth, max-iterations breach rate.

The ninth module is dedicated to the eval-driven observability discipline at the heart of systematically monitoring LLM quality in production. Offline eval pipeline: regression eval on prompt changes in CI/CD; GitHub Actions + Langfuse / Braintrust eval integration; golden-dataset versioning + drift detection. Online eval + user feedback: continuous LLM-as-judge scoring of production traces; thumbs up/down + structured feedback + NPS collection; the user feedback → dataset → eval improvement loop. LLM-as-judge discipline: judge prompt design + bias mitigation (position bias, length bias, verbosity bias); pairwise comparison + reference-based + reference-free judge; multi-judge ensemble + human-judge agreement validation. With this discipline, production quality regressions feed back to CI/CD.

The tenth module addresses the mandatory three-dimensional monitoring discipline for the economic and operational sustainability of production LLM applications. Cost monitoring: token-usage trend + model distribution + per-endpoint breakdown; user-level cost attribution + per-tenant budgeting; semantic cache hit-rate + cost-reduction effectiveness. Latency + SLO/SLI: P50/P95/P99 TTFT + TPOT histograms; SLO/SLI definition ('P95 TTFT < 1.5s, success rate > 99.5%'); error budget + alerting threshold management. Quality monitoring: hallucination rate + sycophancy drift + refusal rate tracking; Grafana dashboard + Prometheus metrics integration; Datadog LLM Observability + New Relic AI Monitoring overview. These three dimensions together provide production sustainability for enterprise AI applications.

The eleventh module focuses on the real-world use moment of AI observability — production incident debugging and resolution. Failed-trace analysis: error spans, retry chain, timeout breakdown; provider outage handling (OpenAI 5XX storm, Anthropic capacity throttling, Gemini RPC errors); agent infinite loop + max-iteration safeguard pattern. Alerting + on-call: PagerDuty + Slack + Discord alerting integration; threshold tuning + alert-fatigue prevention; on-call rotation + escalation policy + runbook preparation. RCA + post-mortem: root cause analysis with 5-Whys + Ishikawa diagrams; blameless post-mortem template + action-item tracking; Linear / Jira ticket integration + incident retrospective. The operational maturity of AI systems depends on the rigor of this discipline.

In the capstone module, each participant designs an end-to-end AI observability stack tailored to their own production scenario: provider selection (Langfuse self-hosted, Phoenix, Helicone, Weave, Braintrust, LangSmith), integration approach (OpenTelemetry GenAI vs native SDK), eval framework (offline + online), cost + latency + quality monitoring dashboard, alerting + on-call setup, KVKK-compliant PII redaction, 90-day production roadmap. By the end of the training, participants reach a level of technical competence to clearly frame how LLM observability differs from classical APM; build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions; make team-appropriate choices among Langfuse / Phoenix / Helicone / Weave / Braintrust / LangSmith; integrate the eval-driven observability discipline into the CI/CD pipeline; build a cost + latency + quality three-dimensional monitoring dashboard; manage production incidents with failed-trace analysis + RCA + post-mortem framework; and build a KVKK + EU AI Act + GDPR-compliant Turkish-data handling pipeline. The training consists of 3 days, 12 modules, and over 100 hands-on lessons.

Training Methodology

The only production-grade advanced program in Turkey that addresses AI observability and LLM monitoring end to end in Turkish

Six-platform comparison: Langfuse + Arize Phoenix + Helicone + W&B Weave + Braintrust + LangSmith

OpenTelemetry GenAI Semantic Conventions + OpenLLMetry + OpenInference vendor-agnostic standard

Eval-driven observability (offline + online + LLM-as-judge + user feedback) discipline

Mathematical construction of trace + span + token + cost + quality + agent metric anatomy

Cost monitoring + latency SLO/SLI + quality drift detection three-dimensional monitoring

Production incident debugging + PagerDuty alerting + RCA + blameless post-mortem framework

KVKK + EU AI Act + GDPR-compliant Turkish PII redaction + self-hosted observability deployment

Who Is This For?

ML Engineers and ML Platform Engineers who want to tie production LLM applications to the observability and monitoring discipline

Engineers seeking to bring MLOps + LLMOps maturity to teams scaling enterprise LLM products

Senior backend developers establishing cost + latency + quality SLO/SLI discipline for AI-powered SaaS products

AI/LLM SREs responsible for on-call rotation + production incident response

Enterprise AI compliance teams that need to build a KVKK + EU AI Act + GDPR-compliant Turkish AI observability stack

AI engineers who want to systematically monitor quality drift and hallucination in RAG + agent + reasoning-model deployments

Why This Course?

The only advanced program in Turkey that addresses AI observability discipline end to end + production-grade in Turkish.

Instills the discipline of right selection via the six-platform comparison: Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith.

Teaches the vendor-agnostic standardization approach with OpenTelemetry GenAI Semantic Conventions.

Carries production quality regressions to CI/CD with eval-driven observability (offline + online + LLM-as-judge).

Offers cost + latency + quality three-dimensional monitoring + Grafana / Prometheus / Datadog integration.

Establishes operational maturity with production incident debugging + RCA + blameless post-mortem framework.

Teaches KVKK + EU AI Act + GDPR-compliant Turkish PII redaction + self-hosted deployment discipline.

Completes the six-training production-grade LLM engineering frontier set with RLHF + Reasoning + Mech Interp + CPT + Quantization + Observability.

Learning Outcomes

Clearly frame how LLM observability differs from classical APM.

Build a vendor-agnostic trace pipeline with OpenTelemetry GenAI Semantic Conventions.

Make team-appropriate choices among Langfuse, Phoenix, Helicone, Weave, Braintrust, LangSmith.

Provide KVKK-compliant observability by setting up self-hosted Langfuse + Helicone + Phoenix deployments.

Integrate the eval-driven observability discipline into the CI/CD pipeline.

Build a cost + latency + quality three-dimensional monitoring dashboard.

Continuously measure production quality with an LLM-as-judge eval framework.

Manage production incidents with failed-trace analysis + RCA + blameless post-mortem.

Set up PagerDuty + Slack alerting + on-call rotation + escalation policy.

Address the special observability needs of reasoning models (o3/R1/Claude Extended Thinking) and agents.

Requirements

Active Python or Node.js experience (intermediate to advanced), REST API + JSON experience

Basic experience using LLM APIs (OpenAI, Anthropic, Google, or self-hosted)

Docker + Docker Compose + basic Kubernetes knowledge (for self-hosted deployment)

Basic experience with PostgreSQL or ClickHouse + log analysis

OpenTelemetry foundations (recommended, built in the training)

Langfuse, Phoenix, Helicone, LangSmith accounts (free tier) before the training

Course Curriculum

104 Lessons

Module 1: Strategic Introduction to LLM Observability — The Difference from Classical APM9 Lessons

Module 2: OpenTelemetry GenAI Semantic Conventions — The Standardizing Trace Format9 Lessons

Module 3: Langfuse Deep Dive — The Leader of Open-Source LLM Observability9 Lessons

Module 4: Arize Phoenix and Arize AX — The Generative-AI Continuation of ML Observability9 Lessons

Module 5: Helicone — Proxy-Based LLM Observability and Cost Tracking9 Lessons

Module 6: W&B Weave and Braintrust — Eval-First LLM Observability9 Lessons

Module 7: LangSmith — LangChain Native Observability9 Lessons

Module 8: Trace + Span Anatomy and LLM-Specific Metrics9 Lessons

Module 9: Eval-Driven Observability — Online and Offline Evaluation9 Lessons

Module 10: Cost + Latency + Quality Monitoring — Three-Dimensional LLM Surveillance9 Lessons

Module 11: Production Debugging + Alerting + Incident Response9 Lessons

Module 12: Capstone — Building a Multi-Provider Observability Stack5 Lessons

Instructor

Şükrü Yusuf KAYA

AI Architect | Enterprise AI & LLM Training | Stanford University | Software & Technology Consultant

Şükrü Yusuf KAYA is an internationally experienced AI Consultant and Technology Strategist leading the integration of artificial intelligence technologies into the global business landscape. With operations spanning 6 different countries, he bridges the gap between the theoretical boundaries of technology and practical business needs, overseeing end-to-end AI projects in data-critical sectors such as banking, e-commerce, retail, and logistics. Deepening his technical expertise particularly in Generative AI and Large Language Models (LLMs), KAYA ensures that organizations build architectures that shape the future rather than relying on short-term solutions. His visionary approach to transforming complex algorithms and advanced systems into tangible business value aligned with corporate growth targets has positioned him as a sought-after solution partner in the industry. Distinguished by his role as an instructor alongside his consulting and project management career, Şükrü Yusuf KAYA is driven by the motto of "Making AI accessible and applicable for everyone." Through comprehensive training programs designed for a wide spectrum of professionals—from technical teams to C-level executives—he prioritizes increasing organizational AI literacy and establishing a sustainable culture of technological transformation.

Frequently Asked Questions

Apply for Training

Boutique training with limited seats.

Pre-register for Next Groups

Leave your info to be the first to know when the next batch opens.

Live & Interactive Sessions

Project-Based Learning

Industry-Focused Curriculum

Professional Networking

1-on-1 Mentorship

Book a private session.

Enroll

About this training

Key Takeaways

AI Observability and LLM Monitoring Engineering Training (Langfuse + Phoenix + Helicone + Weave + Braintrust + LangSmith)