LLM Observability | Şükrü Yusuf Kaya

TL;DR — 2026 was the year LLM observability turned from a nice-to-have into a necessity. OpenTelemetry's GenAI Semantic Conventions standardized how LLM operations are recorded: which model was called, how many input/output tokens were spent, how tool calls and completions went. As of March 2026 most GenAI semantic conventions are still experimental, but Datadog, New Relic and Dynatrace support them natively; OTel-instrumented agent code sends data to these platforms without any SDK change. Langfuse, meanwhile, was acquired by ClickHouse on 16 January 2026 and reports more than 2,000 paying customers and tens of millions of SDK installs per month. In this piece I explain why LLM observability is different, what the GenAI semantic conventions bring, how to build an observability pipeline, and how to manage the cost-latency-quality triangle in a KVKK context.

Why Classic Observability Isn't Enough for LLMs

For years we've monitored software systems with metrics, logs and traces. CPU, memory, latency, error rate — familiar. But LLM systems bring a new layer classic observability doesn't see: the semantic layer. An HTTP endpoint returns either 200 or 500; easy to measure. An LLM returns 200 but the answer can be completely wrong. The system can look "healthy" while it's talking nonsense to the user.

That is why LLM observability is different. It must answer not just "is the system up" but "is the system correct, high-quality, and how much does it cost?" Classic metrics (latency, error) are still necessary but not sufficient. There are new LLM-specific dimensions: token usage, cost, hallucination rate, answer faithfulness, prompt version, model version, tool-call success.

The most dangerous situation I see in the field is the "seems to be working" confidence. The team watches metrics, latency is normal, error rate is low, everyone is relaxed. But no one is monitoring output quality. Then a customer says "this chatbot gave me wrong information and I made a decision based on it," and it turns out a quality problem has been running for weeks unnoticed. LLM observability exists precisely to illuminate this blind spot.

What the OpenTelemetry GenAI Semantic Conventions Brought

For years everyone did LLM monitoring in their own proprietary format. Every tool used different field names, data didn't port, moving from one platform to another meant rewriting everything. OpenTelemetry's GenAI Semantic Conventions ended this chaos: a common standard for how GenAI operations are recorded.

What does this standard record? The called model, input and output token counts, and (when opted in) the full content of prompts, completions, tool calls and tool results. Concrete fields for example: gen_ai.request.model (the model used), gen_ai.usage.input_tokens and gen_ai.usage.output_tokens (token counts for each LLM call), gen_ai.response.finish_reasons (why the model stopped generating).

To understand why this is a revolution, look at the lock-in perspective. Before the standard, changing your monitoring tool meant rewriting all instrumentation. After the standard, you instrument once with OTel and can send that data to any platform that supports it. Datadog, New Relic and Dynatrace now natively support the GenAI semantic conventions — so your OTel-instrumented agent code sends data to these platforms without any SDK change. This is a powerful architectural decision that breaks vendor lock-in.

"

Important caveat: as of March 2026 most GenAI semantic conventions are still experimental. This means field names and structure may change. Still, instrumenting to the standard is far safer than instrumenting to a proprietary format — because as the standard matures it evolves with you, whereas a proprietary format can leave you behind.

Langfuse and the OTel Ecosystem

One of the most visible players in LLM observability, Langfuse aims to be compliant with the OTel GenAI semantic conventions and to support major LLM instrumentation frameworks. It can operate as an OpenTelemetry backend: receiving traces via an OTLP endpoint. This is designed to increase compatibility with frameworks, libraries and languages beyond Langfuse's own SDKs and native integrations.

Langfuse maps received OTel traces to its own data model and supports additional fields popular in the OTel GenAI ecosystem. This flexibility matters because the semantic conventions are still evolving: even if the standard changes, Langfuse continues to make sense of the data.

A sign of the importance the industry attaches to this space: ClickHouse acquired Langfuse on 16 January 2026. Langfuse reports more than 2,000 paying customers and tens of millions of SDK installs per month. These figures show LLM observability is no longer a niche topic but an enterprise necessity. An open-source tool being acquired by a large data-infrastructure company confirms the strategic importance of this layer.

How to Build an Observability Pipeline

Let's leave theory for practice. An LLM observability pipeline is built in three stages: instrumentation, collection, analysis.

Stage 1 — Instrumentation. Instrument your LLM calls to the OTel GenAI semantic conventions. On every call record the model, token counts, latency, prompt version and (if KVKK allows) content. In agent systems trace every step as a span: tool calls, sub-agent delegations, replans. The goal is to see a request's end-to-end journey.

Stage 2 — Collection. Send the instrumented data to a backend: Langfuse, Datadog, New Relic or a self-hosted OTel collector. Because you use OTel, this choice is portable — changing the backend doesn't change the instrumentation.

Stage 3 — Analysis. Extract insight from the collected data. Which prompts burn the most tokens? Which model calls are slowest? In which scenarios does the hallucination rate rise? In which user segment does cost explode? This analysis is the map for improving the system.

These three stages are not a one-off setup but a continuously running loop. As the system changes, instrumentation is updated, new metrics are added, new analyses are done. Observability is not a project but a practice.

Trace, Span and LLM-Specific Structure

Concepts from classic observability — trace and span — apply in the LLM world too but take on new meaning. A trace is a user request's end-to-end journey. A span is a single step in that journey. In a classic web request, spans are HTTP calls and database queries. In an LLM agent, spans are LLM calls, tool calls, retrieval steps, replans.

When you look at an agent trace, what you should see: a user request arrived, the agent made this plan, called this tool (took this long, this many tokens), did this retrieval, produced this answer. This inner visibility turns "why is the agent slow" or "why did the agent answer wrongly" from a guess into a diagnosis. If you see a span is slow, you focus there; if you see a retrieval step returned empty, you fix your RAG pipeline.

The beauty of the GenAI semantic conventions is that they add standard fields to these spans. gen_ai.usage.input_tokens sits in every span under the same name, regardless of which tool instrumented it. This standardization lets different teams, systems and platforms speak the same language. An engineer looking at a trace can understand it without knowing which tool produced it.

Cost: The Most Neglected Dimension

The most practical benefit of LLM observability is cost visibility. Token-based pricing can make cost invisible and unpredictable. Adding an extra paragraph to a prompt can cost thousands of dollars across millions of calls, and no one notices — until the bill arrives.

Observability solves this. If you track each call's token usage and cost, you can answer: Which feature is most expensive? Which user generates the most cost? Which prompt is unnecessarily long? Is our context window bloating? This visibility is the foundation of cost optimization.

Typical gains I see in the field: trimming unnecessarily long system prompts, caching repeated context with prompt caching, routing simple tasks to a cheap model, and reducing token usage with context compaction. All of these are possible only if you can see the cost. You can't optimize what you don't measure. And LLM costs, when unmeasured, are an expense line that grows silently.

KVKK and Observability: The Content-Logging Dilemma

Here is a critical dilemma for Turkish companies. Observability delivers the most value when you log prompt and answer content — to diagnose a problem you need to see what was asked and what was answered. But this content often contains personal data. The GenAI semantic conventions make content logging opt-in for exactly this reason.

In a KVKK context, content logging must be carefully designed. Logging a customer support conversation means copying that customer's personal data into an observability system. This triggers data-minimization, purpose-limitation and retention-period principles. The solution is not blindly "log everything" or "log nothing," but a balanced approach.

"

Practical pattern: always log metadata (model, tokens, latency, prompt version); log content (prompt, answer) by masking, sampling or anonymizing personal data. Detect sensitive fields (national ID, health, finance) before logging and apply redaction. And limit log retention per KVKK. This preserves both diagnostic capability and compliance.

Another measure: access control for content logging. Not everyone should see the content of every trace; only those who need it for diagnosis, only as much as needed. This strengthens both KVKK and internal data security. Observability is a powerful tool but brings data responsibility proportional to its power.

Evaluation and Observability Work Together

Observability answers "what is happening in production"; evaluation (eval) answers "how good is the system." Together they are powerful. From production traces you collect interesting or failing cases and add them to an eval set. This set becomes the basis for measuring later improvements. So observability is a source that continuously feeds your eval set.

A mature pipeline works like this: in production a user gives a low score or an answer looks suspicious → that trace is flagged → a human reviews it → if it is truly an error it is added to the eval set → the next model/prompt change is tested against this case. This loop is a learning machine that improves the system over time. Without observability this loop can't be built because you can't see which cases are problematic.

That is why I build observability and eval not as separate projects but as two faces of a single quality infrastructure. One measures "what is happening," the other "how good"; together they answer "how do I improve."

Common Mistakes

Mistake 1 — Monitoring only classic metrics. Latency and error rate are necessary but insufficient for LLMs. Also track quality, cost and token dimensions.

Mistake 2 — Locking into a proprietary format. Embedding into a monitoring tool's own format makes migration impossible. Instrument to the OTel GenAI semantic conventions, stay portable.

Mistake 3 — Logging content without thinking. In a KVKK context, blind content logging is a breach risk. Metadata always, content balanced and redacted.

Mistake 4 — Adding observability later. Trying to add instrumentation after the system is in production is hard. Embed it from the start.

Mistake 5 — Observing but not acting. Collecting data isn't enough; if you don't extract insight and improve the system, observability is just an expensive storage cost.

Alerts and SLOs: From Observing to Monitoring

Observability must be active, not passive. Collecting data is a start but the real value is in alerting you when something goes wrong. In classic software we define SLOs (Service Level Objectives): "99.9% uptime," "p95 latency under 200ms." LLM systems need SLOs too, but in different dimensions.

LLM-specific SLO examples: "answer faithfulness above 95%," "average token cost per request below this limit," "hallucination flag rate below 2%," "p95 time-to-first-token below this many seconds." When these SLOs are breached, an automatic alert must fire. Then you learn of a quality drop from your system, not from a customer complaint.

A mature pattern I see in the field: closely monitoring quality metrics after a prompt or model change. Say you switched to a new model version — cost dropped, but did faithfulness drop too? The observability pipeline shows this instantly. In an alertless system you learn of this regression weeks later, with user churn. In an alerted system you notice it within minutes and roll back. The difference is proactive versus reactive.

Prompt and Model Versioning

An often-skipped but critical dimension of observability: knowing which prompt and model version produced each trace. Prompts change, models update, and these changes affect quality both for better and worse. If you don't tag each trace with the prompt version and model version, you can never answer "why did quality drop."

A concrete example: one Monday your quality metrics drop. If your observability pipeline versioned each trace, you see immediately: "we switched from prompt v12 to v13 on Friday and the drop started then." Instant diagnosis. If you didn't version, you search blindly for days. That is why prompt and model versioning is not a luxury of observability but a basic requirement.

Good practice: version prompts like code, tag each deployment, and keep these tags filterable in the observability pipeline. Then a "v13 vs v12" comparison is one click away. Tools like Langfuse combine prompt management with observability, making this workflow natural — you manage the prompt and measure its effect in the same place.

Distributed Tracing: In Agent Fleets

Tracing a single LLM call is relatively simple. But in an orchestrator-worker agent fleet, a user request spreads across dozens of spans, multiple agents and many tool calls. This is where distributed tracing becomes vital. A trace must combine all these scattered steps into a single view so you can see "where did the request slow down, which agent erred."

OTel's power shines exactly here. OTel was designed for distributed tracing; the GenAI semantic conventions add LLM-specific fields on top. So if you instrument an agent fleet with OTel, both classic distributed tracing (which service, which call) and LLM-specific dimensions (which model, how many tokens) merge into a single consistent view. This is what makes complex agent systems debuggable.

When you build a complex agent system in the field, debugging without distributed tracing is nearly impossible. A request fails but why? At which agent? At which tool call? Distributed trace shows the answer on a single timeline. This visibility is the core infrastructure that makes agent fleets manageable in production.

A Small Case: The Silent Cost Leak

Working with a SaaS company in Türkiye, we lived a classic "silent cost leak" case. The company's LLM bill was growing slowly but steadily month over month and no one knew why. User count was flat, feature set the same, but the bill was climbing. Management was worried.

When we built an observability pipeline, the cause emerged in a week. In one feature, the conversation history was appended to every call without any compaction. The longer a user extended a conversation, the more tokens each new message loaded. In long conversations a single message spent ten times more tokens than needed. No one had noticed because no one monitored token usage.

The fix was simple: context compaction. Past a certain length we summarized the conversation history and put the summary in its place. The result: that feature's cost dropped markedly while quality was preserved because the summary carried the conversation's essence. The lesson of this case is clear: you can't optimize a cost you can't see. Observability made this leak visible and made the fix possible.

Self-Host or Buy

Building an observability pipeline also brings an architectural-ownership decision: will you self-host the tool or use a managed service? This decision matters for both cost and KVKK.

Self-hosting lets you keep data in your own infrastructure — a strong advantage for KVKK, because prompt and answer content never leaves. Open-source tools like Langfuse make this possible. The price is operational load: servers, updates, scaling are your responsibility. For small teams this load can feel heavy.

A managed service removes the operational load but sends data to a third party. For KVKK this triggers data-processor agreements, data-residency and cross-border-transfer provisions. For Turkish companies the question becomes: where is the observability data (especially if content is logged) processed — the EU, the US, where? Like model choice, this is a compliance decision.

The balanced approach I see in the field: instrument with OTel (stay portable), you may send metadata to a managed service but keep sensitive content in a self-hosted layer. This hybrid balances operational ease and KVKK compliance. And because you use OTel, if the decision changes tomorrow, migration doesn't require rewriting instrumentation — you just change the destination.

Closing: Visibility Is the Foundation of Trust

LLM systems are powerful but opaque. Unlike an HTTP service that "either works or doesn't," they can be silently, insidiously wrong. The only thing that breaks this opacity is observability. Tokens, cost, latency, quality, versions — the team that can see these dimensions manages its system; the team that can't is at the system's mercy.

In 2026 this space matured. The OTel GenAI semantic conventions brought a standard, major platforms adopted it, open-source tools became accessible and the ecosystem converged. There is no longer a technical barrier; only discipline remains. Instrument your most critical pipeline, always log metadata, log content with KVKK balance, version prompts and models, define SLOs and set up alerts. The company that takes these steps turns LLM systems from a black box into a manageable engineering asset. Good engineering begins with managing what you can see. In the LLM world, seeing means observability. Start measuring today; because the day you measure is the day you truly begin to understand your system — and that understanding protects your cost, your quality and your user's trust.

Consulting Pathways

Consulting pages closest to this article

For the most logical next step after this article, you can review the most relevant solution, role, and industry landing pages here.

Solution Pages

AI Evaluation, Guardrails and Observability

A comprehensive evaluation layer to measure, observe and control AI accuracy, safety and performance.

observability

Open landing

Solution Pages

Enterprise RAG Systems Development

Production-grade RAG systems that provide grounded, secure and auditable access to internal knowledge.

Open landing

Role-Based Pages

Enterprise AI Architecture Consulting for CTOs

Technical leadership consulting to move AI initiatives from isolated PoCs into secure, scalable and production-ready architecture.

Open landing

Explore All Posts

LLM Observability: OpenTelemetry GenAI, Langfuse, and KVKK-Compliant Content Logging (2026)

Why Classic Observability Isn't Enough for LLMs

What the OpenTelemetry GenAI Semantic Conventions Brought

Langfuse and the OTel Ecosystem

How to Build an Observability Pipeline

Trace, Span and LLM-Specific Structure

Cost: The Most Neglected Dimension

KVKK and Observability: The Content-Logging Dilemma

Evaluation and Observability Work Together

Common Mistakes

Alerts and SLOs: From Observing to Monitoring

Prompt and Model Versioning

Distributed Tracing: In Agent Fleets

A Small Case: The Silent Cost Leak

Self-Host or Buy

Closing: Visibility Is the Foundation of Trust

Consulting pages closest to this article

AI Evaluation, Guardrails and Observability

Enterprise RAG Systems Development

Enterprise AI Architecture Consulting for CTOs

Comments

Comments

Pillar topics this article maps to

LLMOps: Production-Grade LLM Operations

AI Governance and EU AI Act Compliance

Subscribe to Newsletter