# Evaluating LLM and RAG Systems in 2026: From 'Seems to Work' to Measurable Quality (Eval Sets, LLM-as-Judge, RAGAS)

> Source: https://sukruyusufkaya.com/en/blog/llm-rag-degerlendirme-llm-as-judge-2026
> Updated: 2026-06-28T13:10:11.592Z
> Type: blog
> Category: yapay-zeka
**TLDR:** 'Seems to work' is the costliest sentence in AI. Moving to measurable quality with eval sets, LLM-as-judge, RAGAS, and retrieval metrics — including Turkish eval.

**TL;DR —** In AI projects, the most expensive sentence is "it seems to work in the demo." The only way to know whether a system is genuinely good is to measure it systematically. In this post I describe how to move LLM and RAG systems from "feels right" demos to measurable quality: building representative eval (golden) datasets, separating offline from online evaluation, using LLM-as-judge deliberately (pairwise vs rubric, position/verbosity/self-preference biases, calibration against human labels), RAG-specific RAGAS-style metrics (faithfulness, answer relevance, context precision/recall), retrieval metrics (recall@k, MRR, nDCG, hit rate), hallucination/groundedness detection, regression testing in CI, A/B testing and production monitoring, and finally the Turkish-language angle, KVKK, and the EU AI Act dimension. At the end there is a practical workflow, a metrics table, and a list of pitfalls.

## The cost of "it seems to work in the demo"

I have spent years consulting and training corporations on AI, and there is one sentence that sets off a quiet alarm inside me the moment I hear it: "We tried a few examples, and it seems to work pretty well." That sentence sounds positive; it even lifts the project team's morale. But in my experience it is the most expensive sentence you can utter in an AI project. The word "seems" is the load-bearing column of an entire structure built on unmeasured hope, and that column collapses sooner or later.

Why is it so dangerous? Because a language model gives genuinely great answers when you, in the demo, ask it the easiest, cleanest, most expected questions. Without realizing it, you warm the model up: you pick questions from areas where the model is strong, you recognize the correct answer because you already know it, and when it's wrong you tweak the prompt with "let's fix this a bit." The person running the demo is also the judge, and that judge is biased. When real users arrive, the picture changes: misspelled questions, missing context, contradictory requests, niche topics the model has never seen, multi-step requests stacked on top of each other. That is the moment the system that "seemed to work" confronts the reality that "it actually didn't" — and that confrontation usually happens in front of the customer, the executive, or the auditor.

My central claim in this post is this: before you put an AI system into production, you must be able to talk about its quality in numbers. "It looks good" is not a feeling; it is a measurement result. And setting up that measurement is both easier than you think and the highest-return investment in the project. Because you cannot improve what you cannot measure, and you cannot defend what you cannot improve.

## Why evaluation is its own discipline

People coming from the software world might say, "We already write tests." But there is a fundamental difference between classic software testing and AI evaluation. In classic testing the input is fixed and the expected output is single and exact: `2 + 2` always returns `4`. With language models, the same question can have dozens of different answers, all of which are "correct." To "What is the population of Istanbul?" the model might say "About 15-16 million" or "Close to 16 million," and both are acceptable. So evaluation is less about a binary pass/fail logic and more about measuring a quality spectrum.

That is why AI evaluation is a discipline in its own right. We must separate the answers to three distinct questions: (1) Is the system retrieving the right information (retrieval quality)? (2) Is it using the retrieved information correctly and faithfully (generation quality)? (3) Does it meet the user's actual need (end-to-end usefulness)? If you blur these three layers together, you won't know where to fix things when the system performs poorly. This is the most common mistake I see in RAG systems: when the answer comes out badly, everyone blames the prompt, while most of the time the problem is that the right document never reached the model in the first place.

## Eval sets: the backbone of evaluation

Everything starts with a good evaluation dataset. We call it a golden set, an eval set, or an evaluation collection. It is the reference collection in which the questions your system genuinely needs to answer, and what counts as a "good answer" for us, are written down. Without this set, every improvement you make is a step taken in the dark.

A good eval set has a few properties. **Representativeness** is the most important: your set should reflect the distribution of questions the system will face in real life. If 40 percent of users ask about invoices, then roughly 40 percent of your eval set should be invoice questions. The most robust way to ensure this is to feed it from **real user questions**. It's tempting to start by writing imaginary, "plausible" questions, but that misleads you, because real users ask in ways that would never occur to you. Feed it from live logs, call-center records, and old support tickets.

What do we store for each example? Ideally: the question, the relevant context/document if any, the **expected answer** or at least the **acceptance criteria**. Some questions have a single correct answer ("How many days is the return period?" → "14 days"). For others it is impossible to write a single golden answer; then you write criteria instead of an answer: "The answer must state 14 days, must say returns are not unconditional, and must use a polite tone." These criteria also form the basis for the LLM-as-judge I'll describe shortly.

Setting up the eval set once and leaving it is not enough. Every new failure the system produces is a new example that should be added to the set. I call this "domesticating the failure": when you see a bug, you capture it and put it into the eval set, so that bug can never quietly return. Over time, your eval set becomes a map of all your system's weak spots.

### Offline and online evaluation

We're talking about two different worlds. **Offline evaluation** is the measurement you do on a fixed eval set, before going to production, in a controlled environment. It is reproducible, cheap, and fast; within seconds you can see whether a prompt change or a new model version improved things. **Online evaluation** is the measurement you do while the system is live, on real user traffic: user feedback, click behavior, abandonment rates, A/B tests. Offline answers "how good is it under lab conditions?"; online answers "does it actually work in the real world?" You need both, and you cannot substitute one for the other.

## LLM-as-judge: making AI the referee

Checking the expected answer with a string comparison ("is the answer exactly this?") usually doesn't work for language models, because there are hundreds of different phrasings of the right answer. This is where **LLM-as-judge** comes in: you use a language model as a referee that evaluates the answer produced by another model. This method has democratized evaluation in recent years because it is fast, scalable, and far cheaper than human evaluation. But it's like a knife: hold it right and it does the job; hold it wrong and it cuts your hand.

There are two basic modes of use. **Pairwise comparison**: you give the judge two answers (say, model A's and model B's) and ask "which is better?" Like humans, models struggle to give absolute scores but are good at comparison; that's why pairwise usually yields more reliable results. It is ideal when choosing between two models or two prompt versions. **Rubric/criteria-based scoring**: you give the judge an answer and clear criteria ("Score from 1 to 5 on accuracy, faithfulness, tone, and completeness, with a rationale for each") and ask for an absolute score. This is more appropriate when you want to measure a single answer against a specific standard.

### Known biases of LLM-as-judge

I have to be honest here, because excessive optimism on this topic is dangerous. LLM judges have systematic biases, and using them without knowing these will mislead you:

- **Position bias:** In pairwise comparison, the judge tends to systematically favor the answer shown first. The fix: run each comparison twice, swapping the order; only count an answer as genuinely "better" if it wins in both orders.
- **Verbosity bias:** The judge tends to prefer the longer, more detailed-looking answer even when its content isn't better. This quietly pushes your systems toward producing needlessly long answers. Explicitly adding a "be concise and free of redundant repetition" criterion to your rubric counterbalances this.
- **Self-preference bias:** A model tends to favor answers it produced itself or that resemble its own style. So a judge from the GPT family may nudge GPT's answers up; another family may nudge its own. Where possible, choose the judge model from a different family than the model being evaluated.

In addition, judges can be inconsistent (different scores for the same answer at different times), overly generous (giving everything a 4-5), and wrong on topics requiring complex reasoning.

### Calibration: test the judge before you trust the judge

Instead of trusting LLM-as-judge blindly, you must **calibrate it against human labels**. In practice I do this: a few hundred examples are first evaluated by humans (the golden human label), then the same examples are evaluated by the LLM judge. I measure the agreement between the two. If the judge shows high agreement with human decisions, I can trust that judge for that task and scale up human evaluation. If agreement is low, I either need to clarify the rubric, change the judge model, or leave that task to humans. The key point: **the LLM judge does not replace human judgment; it scales it.** You must first capture that judgment at small scale, then verify how well the judge imitates it.

When do I trust it, and when don't I? On relatively surface-level and clear topics — tone, format, fluency, adherence to explicit criteria — the LLM judge is excellent. On deep domain expertise, legal/medical accuracy, subtle logical errors, security flaws, I never trust the judge alone; there a human expert is essential.

## RAG-specific evaluation: the RAGAS framework

In RAG (Retrieval-Augmented Generation) systems things are more layered, because there are two separate engines: the one that retrieves information (retrieval) and the one that produces the answer (generation). Frameworks like RAGAS offer metrics that let you measure these two layers separately. This separation is sacred to me, because half the diagnosis lives here.

I can describe the four core RAGAS-style metrics like this:

- **Faithfulness (groundedness):** Is the produced answer genuinely grounded in the retrieved context, or is the model making things up on its own? This is a direct measure of hallucination. It checks whether each claim in the answer is supported by the retrieved documents. Low faithfulness means the model is "fabricating."
- **Answer relevance:** Does the answer actually address the question asked, or does it circle the topic and leave the real question unanswered? Correct but irrelevant answers score low here.
- **Context precision:** Among the retrieved documents, are the genuinely relevant ones at the top? That is, is retrieval making noise with irrelevant documents, or is it on target? High precision means the model is handed a clean table.
- **Context recall:** Was all the information needed for the answer present in the retrieved context? If the document needed for the correct answer was never retrieved, the model cannot answer that question no matter how good it is. Low recall shows that retrieval fell short.

Reading these four together yields a diagnosis. Say the answer is bad. If faithfulness is low, the model is fabricating (a generation problem — fix the prompt). If context recall is low, the right document never arrived (a retrieval problem — look at indexing/chunking/embeddings). If context precision is low, the right document arrived but drowned in noise (look at re-ranking). You see: the same "bad answer" symptom has three completely different cures, and without metrics you won't know which one to apply.

## Retrieval metrics: measuring the retrieval layer

Independent of generation, we measure how good the retrieval layer alone is with classic information-retrieval metrics. For this you need a set that contains, for each query, the information of "which document(s) were genuinely correct."

- **Recall@k:** Among the first k documents, how many of the genuinely relevant ones did we catch? It is the most critical metric for catching cases where the right document never reached the system. In RAG, recall is usually more important than precision, because if the right document isn't in the first k, the model can never access that information.
- **MRR (Mean Reciprocal Rank):** On average, at what rank does the first correct document appear? It measures how high the correct answer rises; especially useful in single-correct-document scenarios.
- **nDCG (normalized Discounted Cumulative Gain):** The richest metric, taking into account both the degree of relevance and the ranking. It values hits at the top more; it is the most informative measure in multi-document, graded-relevance scenarios.
- **Hit rate:** In what fraction of queries did at least one relevant document appear in the first k? A simple but powerful health indicator; it answers "in how many queries did retrieval miss entirely?"

The value of measuring these separately is this: you can optimize retrieval on its own without ever running generation. Decisions like changing the embedding model, tuning the chunk size, or adding a re-ranker, you test quickly through these metrics; moreover, these tests are cheap and deterministic.

## Hallucination and groundedness detection

The thing corporations fear most is hallucination: the model fabricating wrong information in a confident tone. RAG, when set up correctly, reduces hallucination but does not eliminate it. Groundedness detection means taking each produced claim and checking, "is this claim supported in the retrieved sources?" I do this both automatically (giving an LLM judge the answer and the sources and asking whether each sentence is supported) and as human review on a sample. In high-risk areas (law, health, finance), it is vital to teach the system the behavior of "don't say anything you can't find in the source; if you're unsure, say 'I don't know'" — and to measure this in eval. A corporate assistant learning to say "I don't know" is often more valuable than it giving smarter answers.

## Regression testing in CI: catching silent breakages

AI systems have a sneaky side: while fixing one place you can quietly break another. You improve a prompt for one question, but that change breaks five other questions and you don't notice, because you didn't re-test those five. That is exactly why I strongly recommend putting the eval set into your **continuous integration (CI) pipeline**. Just as unit tests run on every commit in classic software, here the eval set should run automatically on every prompt or model change, and the change should be stopped if metrics fall below a threshold.

Think of this as "prompt regression testing." You're switching from GPT-4 to a newer version, or trying another provider's model: making that decision without running the eval set and laying the before-and-after metrics side by side is like changing lanes with your eyes closed. When model providers quietly update their models, the same eval protects you from the "worked yesterday, broken today" surprise. Automated eval is the price of sleeping soundly at night in AI projects, and it's a cheap price.

## A/B testing and production monitoring

No matter how good offline metrics are, the real decision is made with real users. **A/B testing** lets you show two versions (say, two prompts or two models) to different slices of real traffic and measure which performs better in real user behavior: resolution rate, user satisfaction, follow-up questions, abandonment. You'll surprisingly often see the version that won offline lose online; that's why you must run the two together.

Production monitoring (observability) is continuously keeping a finger on the pulse. I monitor three things: **drift** (is the query distribution changing over time, are users now asking very different things, is the model's performance dropping on certain clusters?), **user feedback** (like/dislike, free-text comments, escalation rate), and **traces** (the end-to-end trail of each request: which documents were retrieved, what was sent to the model, what came back, how long it took). A good trace infrastructure lets you answer "what exactly happened?" within minutes when a user complaint comes in. Running a live AI system without monitoring is like flying a plane without instruments.

## Human evaluation: the moment it is irreplaceable

After all this automation, let me say it clearly: human evaluation never becomes fully replaceable. When you enter a new use case, when you first establish the definition of a "good answer," in high-risk decisions, and when calibrating the LLM judge, humans are essential. Human evaluation is expensive and slow, but it provides something no other method can: the gold standard of real-world judgment.

Two things are critical to making human evaluation work. First, **good rubric design**: asking evaluators "is it good?" doesn't work, because everyone's "good" is different. Instead you describe it with clear dimensions and examples: "Accuracy: are the facts in the answer consistent with the source? 1 = serious error, 3 = minor gap, 5 = fully correct." Putting an example at each score level enormously increases consistency across evaluators. Second, **inter-annotator agreement**: giving the same examples to more than one person and measuring how much they agree. Low agreement usually doesn't mean your evaluators are bad; it means your rubric is ambiguous, and you clarify the rubric to raise agreement. The quality of human evaluation is the quality of your rubric.

## The Turkey and Turkish dimension: inherited off-the-shelf metrics will mislead you

This is the point I insist on most in the field. The overwhelming majority of the AI evaluation literature and off-the-shelf benchmarks are in English, and **these benchmarks do not transfer to Turkish.** Assuming that because a model scores great on an English benchmark it will be equally good in Turkish is one of the most common and most expensive fallacies I see in corporate projects. Turkish's agglutinative structure, rich morphology, corporate jargon, abbreviations, and users' real writing habits can only be captured with a Turkish eval set.

That is why **building your own Turkish eval set** is non-negotiable. Without a set fed from real Turkish user questions, containing your domain's terminology, and testing Turkish-specific subtleties (upper/lowercase, Turkish characters, formal/casual tone shifts), you cannot know which model is genuinely good for you. You must also separately verify that the LLM-as-judge is good at evaluating in Turkish; don't assume that because a judge is good in English it's good in Turkish — calibrate it too against Turkish human labels.

The second critical dimension is **KVKK** (Turkey's personal data protection law). Feeding eval sets from real production data is very valuable, but that data often contains personal data. When using production logs in eval, you must mask/anonymize personal data, ensure the data-processing purpose permits it, authorize access, and comply with retention periods. "Let's test with real data" is a good engineering instinct, but if applied without being disciplined within the KVKK framework, it creates legal risk. My practical approach: take a representative subset of production questions, clean out the personal data, and build a permanent, shareable eval set.

The third is the **EU AI Act**. For organizations in Turkey that provide products/services to the EU market or have EU customers, this is increasingly binding. The Act, especially for high-risk AI systems, brings expectations of systematic testing, accuracy/robustness measurement, continuous monitoring, and record-keeping. So everything I've described in this post — eval sets, metrics, monitoring, regression testing — becomes not just good engineering but also a compliance requirement. If you build your evaluation infrastructure today, tomorrow you'll have a documented story to tell an auditor: "this is how we measure our quality and this is how we monitor it." This is making early an investment that will be mandatory later.

## A practical evaluation workflow

Let me distill everything so far into a concrete flow. The sequence I try to set up in organizations is roughly this:

1. **Build the eval set from real questions.** Start with 100-300 representative examples from production logs and support records. Clean out personal data. Add an expected answer or acceptance criteria to each example. A Turkish set for Turkish; a separate English set if you have English-speaking customers.
2. **Separate the layers.** Accept from the start that you'll measure retrieval and generation separately. For each query, also mark "which document was correct" so you can compute retrieval metrics.
3. **Build an offline metrics dashboard.** Recall@k, MRR, nDCG, hit rate for retrieval; faithfulness, answer relevance, context precision/recall for RAG; LLM-as-judge rubric scores for end-to-end. Freeze the first measurement as the "baseline."
4. **Calibrate the LLM judge.** Label a few hundred examples with humans, compare against the judge, measure agreement. Swap the order for position bias, add a "be concise" criterion to the rubric for verbosity bias, and choose the judge from a different model family where possible.
5. **Make changes through eval.** Measure every prompt/model/retrieval change before and after on the eval set. Wire this into CI; don't let a change pass if metrics drop below threshold.
6. **Ship and monitor.** Collect traces, capture user feedback, watch for drift. Validate important changes with A/B testing on real traffic.
7. **Feed back.** Add every new failure type you see in production to the eval set. As the set grows, your system gets smarter; this is a loop, not a one-time project.

## Metrics table: which metric measures what

The table below is the summary I use as a quick reference in organizations.

| Metric | What it measures | Layer | When to watch out |
|---|---|---|---|
| Recall@k | Is the right document in the first k | Retrieval | If low: right info never reaches the model |
| MRR | Average rank of the first correct document | Retrieval | If low: correct document sits in later ranks |
| nDCG | Rank + degree of relevance | Retrieval | Most informative in multi-doc, graded scenarios |
| Hit rate | Did at least one relevant doc appear | Retrieval | Catches queries where retrieval missed entirely |
| Faithfulness | Is the answer faithful to the source | Generation | If low: hallucination, tighten the prompt |
| Answer relevance | Does the answer address the question | Generation | If low: correct but irrelevant answers |
| Context precision | Are relevant docs at the top | Retrieval/Re-rank | If low: noise, consider re-ranking |
| Context recall | Was needed info in the context | Retrieval | If low: index/chunking/embedding problem |
| LLM-as-judge (rubric) | End-to-end quality, tone, completeness | End-to-end | Calibrate against humans first; mind the biases |
| Human evaluation | Gold-standard judgment | End-to-end | Indispensable for high risk and calibration |

## Common pitfalls

Let me gather the mistakes I see again and again in the field; knowing these up front saves you a lot of time:

- **Building the eval set from imaginary questions.** A set not fed from real user questions pushes you to optimize a system that doesn't exist.
- **Not separating retrieval and generation.** Always blaming the prompt when the answer is bad; whereas the problem is often that the right document never reached the model.
- **Trusting the LLM judge blindly.** Fully trusting judge scores without calibration and without knowing the biases leads you to not actually measure what you think you're measuring.
- **Ignoring position and verbosity bias.** Pairwise comparison without swapping the order, and the "long = good" trap, produce systematically wrong results.
- **Relying on English benchmarks and not building a Turkish set.** Assuming a model good in English will be good in Turkish is the most expensive fallacy.
- **Not wiring eval into CI.** Manual, occasional evaluation cannot catch silent regressions.
- **Relying on offline only.** The winner in the lab can lose with real users; online validation is essential.
- **Skipping KVKK.** Using production data in eval without cleaning it is a legal risk.
- **Not teaching the model to say "I don't know" instead of hallucinating.** A model that speaks without being sure in a high-risk domain is the most dangerous mode.
- **Thinking eval is a one-time project.** Evaluation is a loop; it grows with every new failure.

## Where to go from here

If you have an AI system and you still talk about its quality in terms of "it looks good," I want you to take one thing from this post: this week, take 50 of your real user questions and write them into a table along with their expected answers. That table is your first eval set. Run your system over it once and see the numbers. Most likely your first reaction will be "wow, this is different from what I thought" — because the moment you measure, you also see the gap between your intuition and reality.

Evaluation is not the brake in AI projects; it is the steering wheel. It doesn't slow you down; it lets you drive in the right direction. The moment you move from "it seems to work" to "it works in this way, on this eval set, with these metrics," your project stops being a demo and turns into a product you can trust. And the most genuine thing I can tell you as a consultant is this: organizations that start measuring move much faster than those that don't, because they are no longer walking in the dark but in the light. The light you kindle today with a small eval set will, tomorrow, illuminate your entire AI strategy.