Skip to content

Reranking and Hybrid Search in RAG: Production Quality with Cross-Encoders (2026)

If your RAG talks nonsense, the problem is usually retrieval, not generation. How to lift quality with BM25 + dense hybrid search, RRF, and cross-encoder reranking.

SYK
Şükrü Yusuf KAYA
AI Expert · Enterprise AI Consultant

TL;DR — In RAG projects, most of the quality is lost not in generation but in retrieval. What I keep seeing in the field is simple: fix your search before you touch your model. The right recipe is a two-stage architecture — first, a fast bi-encoder combines BM25 + dense search, merged via Reciprocal Rank Fusion (RRF) into ~100 candidates; then a cross-encoder reranker narrows those down to the top 10. For Turkish, this isn't a preference, it's nearly mandatory: our language is agglutinative, so BM25 acts as a safety net while the cross-encoder captures meaning. In the KVKK (Turkish data protection) world, reranker choice becomes a data sovereignty decision: open-source, self-hostable models like BGE, or managed APIs where data leaves your environment? In this article I walk through the whole pipeline, piece by piece, with examples from the field.

First, an Honest Confession: The Problem Usually Isn't the Model

The sentence I hear most often in enterprise RAG projects is: "We upgraded to the latest GPT version, but the answers still aren't satisfying." Then a long debate begins — maybe we should change the prompt, maybe switch to a model with a bigger context window, maybe do fine-tuning.

After building dozens of RAG systems in the field and auditing many more, I can tell you something very clearly: the vast majority of problems are not in the generation model, but in the retrieval layer. In other words, if the model is giving bad answers, most of the time the document chunks you fed it were wrong from the start. If you feed the model garbage, even the best model in the world will hand you back polished garbage.

Let me explain with an analogy. Say you have a very talented lawyer, but you keep handing them the wrong files for the case. No matter how brilliant the lawyer is, if the documents in their hands are irrelevant, their defense will be irrelevant too. In RAG, the generation model is that lawyer; the retrieval layer is the assistant deciding which files reach them. We usually try to replace the lawyer, when what we actually need is to fix the assistant.

That is exactly the central thesis of this article: most of RAG quality is won or lost in retrieval. And the most effective, most concrete way to fix retrieval is to build a two-stage search architecture. Let's unpack this step by step.

Two-Stage Retrieval: Why Isn't One Stage Enough?

In classic RAG architecture there is a single stage: you feed the query to an embedding model, search for nearest neighbors in a vector database, take the top k chunks, and hand them to the model. We call this "search with a bi-encoder."

What does a bi-encoder do? It turns the query and the documents into vectors separately. It vectorizes documents in advance and stores them in the database; when a query arrives, it only vectorizes the query and computes similarity. This is very fast, because the documents were precomputed. You can search across millions of documents in milliseconds.

But this speed comes at a cost. The bi-encoder never sees the query and the document side by side. It vectorizes each on its own, unaware of the other. As a result, it can't capture subtle differences in meaning or the real interaction between query and document. "Restructuring credit card debt" and "rejection of a credit card application" can look dangerously similar to a bi-encoder because the words overlap, even though the intent is entirely different.

This is where the cross-encoder comes in. A cross-encoder processes the query and the document together, at the same time. It feeds both as a single input to the model, and the model produces a far more nuanced relevance score by seeing how the words interact. The result: a much more accurate ranking.

So why don't we do everything with a cross-encoder? Because it's very expensive. A cross-encoder requires a separate computation for every query-document pair. If you have a million documents, each query would require processing a million pairs — impossibly slow. That's why the smart solution is to combine the two:

  1. First stage (retrieve): Quickly pull ~100 candidates from millions of documents using a fast bi-encoder and BM25.
  2. Second stage (rerank): Re-rank only those 100 candidates with the slow but accurate cross-encoder, and pick the top 10.

Benchmarks clearly support this approach: two-stage retrieval (bi-encoder retrieve + cross-encoder rerank) significantly outperforms single-stage pure vector search across various test sets. Complex queries in particular benefit more from reranking — for a simple keyword search the difference may be small, but for multi-criteria, long, ambiguous questions, reranking rescues the answer.

Hybrid Search: Combining BM25 + Dense with RRF

Now let's go deeper on the first stage, because this is where most teams fall short. Many RAG systems rely on dense (vector) search alone. But pure vector search is weak in two situations:

  • Rare, specific terms: Product codes, statute article numbers, proper nouns, abbreviations. For a query like "KVKK Law No. 6698, Article 11," vector search captures the general meaning but may miss the exact term match.
  • Cases requiring exact phrase matching: If a user is searching for a specific sentence or code, semantic similarity is not enough.

Here an old but powerful friend enters the scene: BM25. BM25 is a classic keyword (lexical) search algorithm. It looks at exact word matches and gives more weight to rare words. It catches the exact term matches that vector search misses. Dense search captures "meaning," BM25 captures "the word." The two complement each other.

The pattern I recommend and apply in the field is this:

  • Top 50 documents from BM25
  • Top 50 documents from dense (vector) search
  • Merge these two via Reciprocal Rank Fusion (RRF) into ~100 candidates
  • Reduce those 100 candidates to the top 10 with a cross-encoder reranker

Why Does RRF Work So Well?

Reciprocal Rank Fusion is a surprisingly simple and robust way to merge two different ranked lists. The logic: for each document, look at its rank in each list and assign a score with the formula 1 / (k + rank) (k is usually taken as 60). The document's scores across the two lists are summed. This way, documents that rank high in both BM25 and dense naturally climb to the top.

The beauty of RRF is that it doesn't try to normalize the scores of the two systems. BM25 scores and cosine similarity scores live on completely different scales — converting one into the other is a headache. RRF sidesteps this problem entirely; it only looks at the ranks, not the raw scores. That's why it works very robustly in practice and requires no tuning.

ApproachStrengthWeaknessWhen to Use
BM25 onlyExact terms, rare words, code matchingMisses synonyms and meaningLegal text, product-code-heavy searches
Dense onlyMeaning, synonyms, paraphraseWeak on rare terms, exact matchingNatural language, conceptual questions
Hybrid (RRF)The strength of bothSlightly more complex infrastructureNearly every serious RAG system
Hybrid + RerankHighest accuracyExtra latency and costProduction systems where quality is critical

The Special Case of Turkish: Why an Agglutinative Language Changes Everything

Now we arrive at the most critical part of this article for anyone working in Turkey. Turkish is an agglutinative language. We produce new meanings by stacking suffixes onto a root word. From the word "ev" (house), we can derive a single word like "evlerimizdekilerden," which corresponds to a whole phrase in English.

This morphological richness seriously complicates exact term matching (lexical matching). A user searching for "sözleşme" (contract) will encounter inflected forms in the document such as "sözleşmenin," "sözleşmeye," "sözleşmelerdeki." A naive BM25 might treat these inflections as different words and miss the match. That's why setting up BM25 correctly for Turkish — stemming, a Turkish-aware analyzer, a tokenizer that normalizes suffixes — is very important.

The practical conclusion is clear: for Turkish, hybrid search is not a preference, it's nearly mandatory. Dense search is more resilient to morphological variation in Turkish because it captures meaning; but it weakens on rare terms and exact matching. BM25 serves as a safety net but must be carefully tuned for Turkish morphology. When you combine the two, you get search that actually works in Turkish.

One more warning, and I'll say it in bold: always evaluate rerankers on a Turkish test set. A reranker that shines on English benchmarks may disappoint in Turkish. How well does the model understand inflected Turkish sentences? Does it capture idioms, formal language, and sector jargon (banking, health, law)? You can only find out by testing with your own Turkish data. Don't trust someone else's English leaderboard.

Reranker Families in 2026: Which One to Choose?

There are three reranker families I encounter most in the field, and the choice usually has less to do with technical quality and more with data sovereignty. Especially for my clients under KVKK, this decision must be clarified upfront.

  • BGE (BAAI): Open source and self-hostable. The safest option for KVKK and data sovereignty. Data never leaves your environment. In sectors like banking, health, and public services where data cannot go abroad or to a third-party API, this is the first family you'll look at. Its quality/cost balance is also excellent.
  • Jina: Stands out with its multilingual support. Definitely worth testing for Turkish, because models trained multilingually sometimes capture Turkish morphology surprisingly well. Try it on your own Turkish test set and decide.
  • Cohere Rerank: Offered as a managed API. Easy to set up, high quality, but data leaves your environment. This requires serious consideration for KVKK. Where the data will be processed, transfer security, contractual guarantees — all must be on the table. For public or non-sensitive data it can be a practical choice; but if personal data or special-category personal data is involved, think twice.
"

A rule from the field: before recommending a reranker to a client, the first question I ask is not technical. I ask, "Can this data leave the environment?" If the answer is "no," the discussion runs through open source and self-hosting. If the answer is "yes, that's fine," managed APIs come to the table too. The law drives the architecture decision, and the technical part follows.

On the quality side, the expectation is this: a well-configured cross-encoder reranking typically brings a lift of +5 to +15 points on the NDCG@10 metric across benchmarks like MTEB/BEIR. That is not a small number — it directly reflects how accurate the top documents in the ranking are, which in turn directly determines the quality of the context that reaches the generation model.

The Invisible Foundation: Nothing Works Until You Chunk Correctly

Now we come to something most teams skip, but which I call the "invisible foundation": chunking, the strategy of splitting documents into pieces. Your reranker can be the best in the world, but if your chunks are bad, it can never rank a good chunk because there isn't one to begin with.

The principles I apply in the field for chunking are:

  • Structural chunking: Split the document not by a random character count, but by its natural structure — headings, paragraphs, sections. Don't cut in the middle of a sentence or half a table. Chunks that respect structure are chunks that preserve meaning.
  • Overlapping chunks: Leave some overlap between pieces. That way, if the start of a sentence is in one chunk and the end is in another, context isn't broken. Information lost at boundaries is the silent killer of RAG.
  • Contextual chunking: Add a prefix to the start of each chunk indicating where that piece came from — source document name, section heading, date. That way a chunk saying "according to Article 11..." carries within itself which law's Article 11 it means. This strengthens both retrieval and reranking noticeably.

Let me make it concrete with an example. Say you're chunking a bank's lending policy. The naive approach: cut every 500 characters. The result: a chunk ending with "...the interest rate is annually" and another chunk starting with "2.5% is applied..." Both are meaningless on their own. In the structural + overlapping + contextual approach, though: "[Lending Policy > Section 3: Interest] For consumer loans, the interest rate is applied at an annual 2.5%." A single chunk, fully meaningful, with a clear source. The reranker ranks that one with love.

Embedding Choice: The Triangle of Dimension, Cost, and Sovereignty

The heart of the first stage is the embedding model. Think about the choice along three axes:

  1. Dimension vs cost: Higher-dimensional embeddings theoretically carry more information, but storage and compute costs rise and search slows down. The sweet spot I see in the field is the 768-1024 dimension range. Going above that rarely justifies the cost in most enterprise scenarios.
  2. Turkish / multilingual support: How well does the model know Turkish? An embedding model trained only in English will be weak in Turkish. Prefer multilingual or Turkish-specific models and — again — test with your own Turkish data.
  3. Data sovereignty: In areas like banking, health, and public services, you may need to choose a self-hosted, open-source embedding model too. Even sending data to an embedding API is a transfer under KVKK and must be evaluated.

These three axes often conflict. The highest-dimensional, highest-quality model might live in a managed API, but if your data can't leave the environment, you can't use that model. Then you settle for a self-hosted, slightly lower-dimensional but sovereignty-safe model. This is not an engineering decision, but a legal-engineering balance decision.

Measuring Every Layer: You Can't Get Anywhere Without a Golden Test Set

Now we come to the most neglected but most value-producing topic: measurement. "The answers look better" is not a metric. You need to measure each layer of RAG separately, because only that way can you understand which layer the problem is coming from.

Layer-by-layer metrics:

  • Retrieval layer — recall@k: Did the document containing the correct answer appear among the top k candidates? If recall is low, the problem is not in the reranker but in the first stage. A reranker cannot rank a document that isn't there.
  • Reranking layer — NDCG@10 and MRR: How well were the top 10 results ranked? NDCG@10 measures the quality of the ranking; MRR (Mean Reciprocal Rank) measures how high up the first correct answer sits.
  • Generation layer — faithfulness and answer relevancy: Did the model stay faithful to the given context or did it make things up (faithfulness)? Does the answer actually answer the question (answer relevancy)?

For all of this to work you need one thing: a golden test set of 50-100 questions. Questions from your own domain, reflecting real user queries, with the correct answer and correct source document manually annotated. Without this test set, every change is a guess. With this test set, every change is a measurement.

"

I always tell my clients: starting a RAG project without building a golden test set is like setting sail without a compass. Preparing 50 questions in the first week feels tedious, but it saves you from the "did it actually improve?" uncertainty for the next three months.

How do you build a golden test set? Start from real user logs. Collect frequently asked questions. Deliberately include hard, ambiguous, multi-criteria questions — because that's exactly where the system breaks down. For each question, mark the correct answer and the source document where that answer appears. This is labor-intensive work, but it is the most valuable asset of your RAG project.

A Long Context Window Is No Excuse for Bad Retrieval

In 2026, model context windows have reached enormous sizes. Some teams draw the wrong conclusion from this: "Since I can fit so many documents, why bother with retrieval? I'll stuff everything that might be relevant into the model and let it handle it."

This is one of the most expensive fallacies I see in the field. Blindly stuffing documents into the model (blind stuffing) is harmful in several ways:

  • Cost and latency: Every token is money and time. Unnecessary documents inflate the bill and slow down the answer.
  • Distraction: The model may overlook what truly matters amid irrelevant documents. The fuller the context, the lower the signal-to-noise ratio.
  • The "lost in the middle" effect: In long contexts, models recall information in the middle less reliably than at the beginning and end.

Instead, adopt a context pruning and information gain approach: give the model only the highly relevant pieces that will genuinely help. Smart retrieval always beats blind stuffing. The reranker's top 10 is precisely the way to do this — filtering the genuinely valuable ones from the 100 candidates and offering the model a clean table.

In short: a long context window is a luxury, a safety margin — not an excuse to cover up bad retrieval.

Multi-Turn Dialogue: Query Rewriting

RAG systems usually work well on a single question, but real users have conversations. The second and third questions are usually ambiguous and lean on prior context:

  • User: "What are the obligations of the data controller under KVKK?"
  • User: "And what if there's a breach?"

The second question is meaningless on its own. If you feed "And what if there's a breach?" to retrieval as-is, the system won't know what breach you're talking about. This is where query rewriting comes in. Before retrieval, you turn ambiguous follow-up questions into standalone queries:

  • "And what if there's a breach?" → "What are the obligations of the data controller in the event of a personal data breach under KVKK?"

This rewritten, self-contained query goes to retrieval and brings back far more accurate documents. It looks like a small step, but in multi-turn dialogues this is exactly what rescues RAG quality from the cliff. In every system I build that supports multi-turn conversation, I put query rewriting in as a standard layer.

Cascading Architecture: Don't Treat Every Question the Same

Finally, we arrive at the secret of a mature RAG system: cascading architecture. Not every question deserves the same heavy processing. Running the full pipeline on a simple question is a waste of money and time.

The logic:

  • Simple queries: Hybrid search alone (BM25 + dense + RRF) is enough. No need for reranking. A fast and cheap answer.
  • Complex queries: The full pipeline — hybrid search + cross-encoder reranking + query rewriting if needed. Slow but accurate.

You can determine a query's complexity with a lightweight classifier or simple heuristics (length, whether it contains multiple criteria, whether it's ambiguous). You route the simple ones through the fast path and steer the hard ones onto the heavy path.

Alongside the cascading architecture, add two more performance practices:

  • Cache embeddings and frequent results: The same or similar queries come in again and again. Keeping the embedding computation and frequent results in a cache significantly reduces cost and latency.
  • Always show citations: Always show which document the model's answer is based on. This both increases trust and, in the KVKK world, is often mandatory for auditability. The user should be able to ask "where did this information come from?" and you should be able to show it.

When we put all of this together, the picture that emerges is: RAG is not a "model selection" problem, it's a retrieval engineering problem. Get chunking right, set up hybrid search, merge with RRF, re-rank with a cross-encoder, measure every layer with a golden test set, and make the Turkish-specific and KVKK-specific decisions from the very start. Change the model last — because most likely the problem was never there.

Where to Start: A Concrete Action Plan

Let's end not with a piece of advice, but with an ordered action list. If you return to your RAG system tomorrow morning, do the following in order:

  1. Build the golden test set. 50-100 real questions, correct answers, correct sources. This before anything else.
  2. Measure the recall@k of your current retrieval. If it's low, the problem is most likely in chunking and the first stage; fix that before adding a reranker.
  3. Make chunking structural + overlapping + contextual. Solidify the invisible foundation.
  4. Move to hybrid search. BM25 (tuned for Turkish) + dense, merged with RRF. For Turkish this is nearly mandatory.
  5. Add a cross-encoder reranker. Among BGE / Jina / Cohere, first eliminate based on your data sovereignty decision, then measure on your own Turkish test set.
  6. Add query rewriting (if you have multi-turn dialogue).
  7. Move to a cascading architecture and set up caching. Light path for simple questions, heavy path for hard ones.
  8. Always show citations. For trust and KVKK auditability.

If you proceed in this order, at each step your golden test set will tell you, in numbers, how far you've come. And most likely you'll realize this: the biggest jumps came not when you changed the model, but when you fixed the search. I'm telling you from the field — whoever repairs retrieval, repairs RAG.

Consulting Pathways

Consulting pages closest to this article

For the most logical next step after this article, you can review the most relevant solution, role, and industry landing pages here.

Comments

Comments

Connected pillar topics

Pillar topics this article maps to