TL;DR — If you are running LLM and agent systems in production, observability is no longer a luxury — it is a survival requirement. The biggest problem I see in the field is this: every team uses its own tracing format — Langfuse one way, Helicone another, LangSmith completely differently — and six months later, when you want to switch providers, you have to rewrite your entire observability layer from scratch. To solve exactly this fragmentation, the OpenTelemetry community has been developing a standard called GenAI Semantic Conventions since April 2024. This standard lets you record — with uniform attribute names — which model was called, how many tokens were spent, cost, latency, tool calls, and optionally the prompt/completion content. As of early 2026, most of these conventions are still "experimental" (the API isn't fully settled) — but Datadog, Honeycomb, and New Relic support them, and frameworks like LangChain, CrewAI, AutoGen, and AG2 emit OTel spans natively. In the Turkish/KVKK world, the most critical point is logging prompt content: this is a personal-data processing decision and must not be done without opt-in and redaction. In this article I lay out the whole picture with examples from the field.
An Honest Confession First: We Fly LLM Systems Blind
The scene I encounter most often in enterprise projects is this: an agent system is built, the demo works beautifully, everyone is happy, it goes to production. Then three weeks pass, and a user complains, "the system gave me a nonsensical answer." The team lead turns to me and asks: "Şükrü, why did this happen?" And the team cannot answer that question.
Why can't they answer? Because no mechanism was ever built to see what was happening inside the system. Which tools did the agent call? Which documents did it retrieve? How many times was the model called? Which prompt was sent? How many tokens were burned? Why did the answer take 14 seconds? None of this was recorded. The system runs like a closed box, and at the moment of failure all we have left is the user's complaint.
This is a situation we rarely face in traditional software. When something breaks in a classic web application, you look at the logs, see the stack trace, find the offending line. But LLM systems are probabilistic. The same input can produce two different outputs. An agent can solve the same task in two different ways. So the answer to "what happened?" depends on how well you recorded the system's internal state at that moment.
Let me explain with an analogy. Suppose you have an aircraft, but it has no instruments — no altitude, no speed, no fuel gauge. While the plane is in the air, everything looks fine. But when a problem arises, there is no way to understand what went wrong. An unobservable LLM system in production is exactly this: an aircraft with no instruments. Observability is building that instrument panel.
The central thesis of this article is this: you must build LLM observability on a standard, not on your own proprietary format. And today, the most mature, vendor-neutral standard we have is OpenTelemetry's GenAI Semantic Conventions. Let's unpack this step by step.
The Concepts: Trace, Span, Metric, and Event
For those new to OpenTelemetry, let me first clarify the fundamental concepts, because if you confuse these four words, everything else gets muddled.
Trace: The entire journey of a request from start to finish. A user asked a question, the agent ran, called three tools, went to the model twice, and queried a vector database — this entire chain of events is a single trace. A trace is the full answer to "what happened in this request?"
Span: A single unit of work within a trace. A call to the model is a span. A tool call is a separate span. A vector database query is another span. Spans nest (parent-child): the agent span is the parent, and under it are model-call and tool-call spans. A span has a start time, an end time, a duration, and attributes.
Metric: Numerical measurements you aggregate over time. How many requests came in per minute, what's the average latency, how many tokens were spent in total, what's the error rate. Metrics deal not with individual requests but with aggregate behavior. The charts on your dashboard are usually fed by metrics.
Event: A marked point at a specific moment inside a span. In the GenAI world, the most important events are prompt and completion content — that is, the text sent to the model and the text the model returned. These are recorded optionally (opt-in) because they may contain personal data.
Bringing these four together gives us a table like this:
| Concept | What It Captures | GenAI Example | When You Look at It |
|---|---|---|---|
| Trace | The whole request journey | The full agent flow from user question to answer | "What exactly happened in this request?" |
| Span | A single unit of work | One model call, one tool call, one retrieval | "Which step was slow / failed?" |
| Metric | Aggregate measurement over time | Hourly token consumption, p95 latency | "How is the overall system doing?" |
| Event | A marked moment inside a span | Outgoing prompt, incoming completion | "What exactly went to / came from the model?" |
This distinction matters because it lets you ask the right question of the right data. When investigating a user complaint, you look at the trace and spans. When tracking overall cost and performance trends, you look at metrics. When asking "why did the model give this answer?", you look at the prompt/completion content in the events.
The Real Problem: Everyone Speaks Their Own Language
Now let me explain why we need a standard, because this isn't just about "let's record data."
Today there are plenty of great tools in the LLM observability market: Langfuse, Helicone, Traceloop, LangSmith, and many more. Each gives you a slick dashboard showing prompts, responses, tokens, and cost. So what's the problem? The problem is this: each one uses its own proprietary, mutually incompatible format.
Langfuse records a request with its own "trace/observation" model. LangSmith uses its own "run" structure. Helicone imposes its own proxy-based schema. Traceloop stays close to OTel but still has its own additions. The result is that when you pick a tool, your entire observability layer gets shaped around that tool's language. You embed its SDK in your code, write attribute names according to its rules, and wire your dashboards to its API.
What happens six months later? Suppose the pricing changed, or data sovereignty requires that data not leave the country, or another tool's features look better. You want to switch providers. And you realize this isn't a simple "swap the SDK" job — you have to rewrite all of your instrumentation. This is the classic vendor lock-in trap.
I lived through exactly this with a client. The team started with one tracing tool, the system grew, and then the enterprise security team required that data stay in their own environment (on-prem / VPC). The tool had to change. But that tool's calls were sprinkled throughout the entire agent codebase. The migration took weeks. Had they started with a vendor-neutral standard from the beginning, that transition would have come down to "change the exporter."
This is exactly why OpenTelemetry's GenAI Semantic Conventions exists. The idea is simple but powerful: let the instrumentation be standard, and let the backend be swappable. Let your code emit spans with standard OTel attribute names (like gen_ai.request.model, gen_ai.usage.input_tokens), and then decide in a separate layer which backend to send those spans to. Switching backends should mean changing exporter configuration, not code.
What Exactly Do GenAI Semantic Conventions Standardize?
Within OpenTelemetry, this work is led by the GenAI SIG (Special Interest Group) and has been active since April 2024. The goal is to unify attribute names and types for LLM calls, agent steps, vector database queries, token usage, cost, and quality metrics. It roughly covers four areas:
1. LLM client spans: The anatomy of a single call to a model. Which model (gen_ai.request.model), which system/provider (gen_ai.system), the requested parameters (temperature, max tokens), and most importantly the usage information — input token count (gen_ai.usage.input_tokens) and output token count (gen_ai.usage.output_tokens). This span carries the "a call went to the model, took this long, and burned this many tokens" information under standard names.
2. Agent spans: In agent-based systems, the execution of an agent or a step. The agent's name, role, and which task it is running. In a multi-agent system, each agent's span nests under the main workflow, so "which agent did what" becomes visible.
3. Events (prompt/completion content): Optionally, the full prompt sent to the model, the full completion returned, tool calls, and tool results. This is the most powerful but also the most sensitive part of observability — because here you have real user text, i.e. potential personal data. The standard recommends this content be opt-in, meaning off by default.
4. Metrics: Standard metric names for measurements like token usage, latency, and request counts. This lets you aggregate metrics from different frameworks and different models on the same dashboard.
There is one point where I must be honest: as of early/March 2026, most of these conventions are still "experimental." The API isn't fully stable, and attribute names can change between versions. This doesn't mean "don't use them" — on the contrary, adopting early and giving feedback to the community is valuable. But it does mean: track your OTel SDK versions, know that attribute names may change, and build your dashboards to be a bit flexible accordingly. Experimental doesn't mean "it doesn't work"; it means "the names aren't set in concrete yet."
What Should You Log? A Field Checklist
Saying "let's set up observability" is easy; the real question is "what exactly will we record?" Here is the practical list that works in the field. For every LLM/agent request, try to capture the following:
- Latency: Total request duration and the separate duration of each step (model call, tool call, retrieval). Broken down into p50, p95, p99 — the average lies, look at the tail.
- Tokens: Input and output token count for each call. This is the foundation of both cost and context-window fullness.
- Cost: Cost per call. Calculated by multiplying token count by the model's price. Carrying this as a span attribute makes later cost attribution much easier.
- Tool calls: Which tool the agent called, with which arguments, and what it returned. If you use MCP (Model Context Protocol) tools, each tool call should be a separate span.
- Retrieval hits: If you use RAG, which documents were retrieved, what the similarity scores were, and how many contributed to the answer.
- Errors: Model timeout, rate limit, invalid output, JSON parse error. The error span must be clearly marked.
- Guardrail triggers: Did a safety filter, content moderation, or policy check kick in? If so, which one, and on which input?
Let me put this list into a table, because it's important to see that each signal serves a different purpose:
| Signal | Which Question It Answers | Related OTel Area |
|---|---|---|
| Latency (p95/p99) | "Are we meeting the SLO?" | Span duration, metric |
| Input/output tokens | "How are cost and context?" | gen_ai.usage.* |
| Cost per call | "Where is the money going?" | Span attribute, metric |
| Tool calls | "Did the agent pick the right tools?" | Agent/tool spans |
| Retrieval hits | "Did we retrieve the right documents?" | Retrieval span, event |
| Errors | "Where does it break?" | Span status, metric |
| Guardrail triggers | "Are the safety filters working?" | Span event/attribute |
We used this table as the skeleton of a dashboard design at a client. Each row became a chart or an alert rule on the dashboard. So instead of a vague goal like "let's log everything," each signal had a concrete purpose.
Debugging Agent Reasoning: The Real Power of the Trace
Now let's look at the most critical scenario: debugging in an agent system. This is where observability truly shines.
Imagine a multi-step agent. The user says, "Check the stock status of the three best-selling products from last quarter and prepare a procurement recommendation for the low ones." To solve this, the agent does the following: first it queries the sales database (a tool), then it checks the stock of each of the three products separately (three tool calls), then it retrieves a procurement policy document (retrieval), then it synthesizes all of this and produces a recommendation (model call).
Now suppose the answer came out wrong — the agent picked the wrong three products. Without a trace, finding this error is nearly impossible. But with a properly instrumented trace, you can unpack it step by step:
"When you look at the trace, you see nested spans under the agent span. In the first tool call you look at the sales query — which date range was sent? Did it interpret "last quarter" correctly? Maybe the agent converted "last quarter" to the wrong date and retrieved the wrong products from the start. The root cause becomes visible right there, in the first span.
This is an extremely common pattern in agent systems: the error usually appears not in the most visible place (the final answer) but somewhere in the middle of the chain. The agent calls a tool with a wrong argument, that tool returns wrong data, all subsequent steps build on that wrong data, and in the end the answer is nonsense. Without a trace, you only see the nonsensical answer; with a trace, you see exactly where the chain broke.
When MCP tools are involved, this becomes even more critical. MCP is a protocol through which agents connect to external tools and data sources with a standard interface. If an agent is connected to three different MCP servers and one is responding slowly or returning bad data, you cannot know which one is the problem unless you instrument each MCP call as a separate span. The beauty of the GenAI Semantic Conventions is that they let you record these tool calls with standard attributes — so how long each MCP tool took and what it returned becomes trackable in a uniform way.
Cost Attribution: "Why Did This Bill Balloon?"
In enterprise LLM projects, there is a question that inevitably arrives one day: "Why did this month's OpenAI/Anthropic bill double?" And without observability, the answer to that question turns into a guessing game.
If you record token usage and cost in spans as standard, you can slice cost along any dimension you want (attribution). Which feature burns the most tokens? Which user segment? Which model? Which agent step? We did exactly this at a client: when we sliced cost by feature, we saw that 60% of the bill came from a single feature — and that feature was needlessly resending a giant system prompt on every call. Once we streamlined the prompt and put it into prompt caching, the cost dropped significantly.
This is the moment observability turns directly into money. Recording tokens and cost with standard attributes makes the later question "where is the money going?" as easy as a SQL query. Without a standard, you'd have to parse each model's and each framework's own cost format separately.
A word of warning here: output tokens are usually much more expensive than input tokens. So when looking at cost attribution, don't look only at total tokens but also at the input/output split. Sometimes a feature is called rarely but produces very long answers on each call, and that's where the bill swells.
Quality and Drift Monitoring: The System Degrades Silently
Now let's move to a subtler topic. LLM systems have an insidious property: they can degrade without crashing. When classic software crashes, an alarm goes off and everyone runs. But an LLM system can degrade without crashing, with answer quality slowly declining. We call this, broadly, drift.
Drift happens in several ways. The provider updates the model in the background and behavior changes. The distribution of questions users ask shifts over time (a new product launches, a new topic comes up) and your system isn't ready for this new distribution. Or the documents in your RAG grow stale and the system starts answering with outdated information. None of these appear as an "error" — the system keeps running, just increasingly worse.
Observability serves as an early-warning system here. If you track quality metrics over time (for example the distribution of answer length, the frequency of certain keywords, the guardrail trigger rate, user feedback/approval rates), you can catch drift before a crash. Because GenAI Semantic Conventions let you carry these metrics under standard names, comparing different time windows and different versions becomes easy.
Here's how I set this up in the field: a baseline is established for critical metrics, and an alarm fires when there's a certain deviation from that baseline. For example, if the guardrail trigger rate is normally 2% and suddenly jumps to 8%, something has changed — maybe a prompt-injection attack, maybe the model's behavior drifted, maybe user behavior changed. Whatever the cause, now you know, and you can investigate.
Sampling: You Don't Have to Record Everything
When scale grows, you face a reality: recording the full trace of every request, complete with prompt/completion content, is both expensive and unnecessary. This is where sampling comes in.
Sampling means keeping a detailed trace of a subset of requests instead of all of them. There are two main approaches:
- Head-based sampling: You decide "will I trace this?" at the moment the request starts, for example 10% of requests. Simple and cheap, but there's a risk of missing the problematic request — maybe the error was precisely in the 90% you didn't trace.
- Tail-based sampling: You decide after the request finishes. This lets you set smart rules like "record every failed and slow request, and only 5% of the successful and fast ones." More powerful, but the infrastructure is more complex.
My recommendation in the field is this: always record errors and anomalies, sample the normal flow. If a request errored, slowed down, or triggered a guardrail, keep its full trace without exception — because these are exactly the requests you'll be debugging. Sample the ordinary, successful, fast requests to keep cost under control. Since metrics already give the aggregate picture, you don't need a detailed trace of every request.
There's an added KVKK benefit too: sampling naturally reduces the amount of personal data you record. The less raw content you record, the closer you stay to the data minimization principle.
The Ecosystem: Who Supports It, How Do You Start?
The good news is this: this standard hasn't stayed theoretical — the ecosystem is growing fast. On the observability backend side, Datadog, Honeycomb, and New Relic support the GenAI Semantic Conventions — meaning if you send spans in OTel format, these tools understand them and offer LLM-specific views.
On the framework side, popular agent frameworks like LangChain, CrewAI, AutoGen, and AG2 emit OTel spans either natively or through instrumentation libraries. So if you're using one of these frameworks, you don't have to write observability from scratch — most of the time, adding the right instrumentation package and configuring the exporter is enough.
Tools like Traceloop already work in an OTel-centric way, so the migration friction is low. Tools with their own format, like Langfuse, Helicone, and LangSmith, are increasingly adding OTel compatibility too, because the market is shifting toward the standard. There is momentum in this direction, and if you want to be on the right side of it, starting with the standard makes sense.
For a practical start, here is the roadmap I recommend in the field:
- Add the OTel SDK and set up the basic trace/span infrastructure. Start without content, using only metadata (model, tokens, latency).
- Turn on framework instrumentation. If you use LangChain/CrewAI etc., add the relevant OTel instrumentation package so agent and model spans come automatically.
- Set up an OTel Collector. This is the intermediate layer that collects spans and routes them to the backend. When you want to switch providers, you change only this — you don't touch your code.
- Set up metrics and alerts. Based on the signal table above, create dashboards and alerts for tokens, cost, latency, and errors.
- Turn on content logging last and carefully. Do the KVKK assessment, add the redaction layer, and enable it opt-in with limited access.
I specifically recommend this order because it leaves the riskiest step (content logging) for last and delivers the most value earliest (metadata observability).
Concrete Steps to Take Today
After reading this article, sit down with your team and do these three things. First, identify which tracing format your current LLM system uses — a proprietary tool's format, or OTel? If you're locked into a proprietary format, put a migration plan to OTel on the agenda; if you have no observability yet, start directly with OTel.
Second, take the signal table above and mark, for each row, "are we recording this right now?" Latency, tokens, cost, tool calls, retrieval, errors, guardrails — which are missing? Prioritize the missing ones. Usually tokens and cost give the fastest return; start there.
Third, put your KVKK decision on prompt-content logging in writing. Are you logging content? If so, what is your legal basis, are you redacting, how long do you retain it, where do you keep it? Documenting this decision both makes your life easier in an audit and ensures the team acts with a shared understanding. If today you are logging raw prompts unredacted and keeping them indefinitely, stop that today — it is a risk quietly accumulating.
Observability is not a topic you can put off in LLM systems with a "we'll handle it later." You need it the moment the system goes to production, because when the first serious problem arrives, you'll either have a trace in hand or just a guess. I always prefer the trace — and I'll take a standard trace over a vendor-locked one any day.
Consulting Pathways
Consulting pages closest to this article
For the most logical next step after this article, you can review the most relevant solution, role, and industry landing pages here.
AI Evaluation, Guardrails and Observability
A comprehensive evaluation layer to measure, observe and control AI accuracy, safety and performance.
AI Governance, Risk and Security Consulting
A governance framework that makes enterprise AI usage more sustainable across data, access, model behavior and operational risk.
Secure and Auditable AI for Public Institutions
Enterprise AI systems designed around data sovereignty, auditability and citizen-facing service quality.