All posts
RAGRetrieval-Augmented GenerationBM25TF-IDFSemantic SearchHybrid SearchRerankingCross-EncoderBi-EncoderColBERTLate ChunkingContextual RetrievalVector DatabaseEmbeddingsCosine SimilarityInverted IndexReciprocal Rank FusionInformation RetrievalLLMChunking

The Art of Retrieval

Every retrieval algorithm exists because the previous one had a flaw. This is a deep dive into how search evolved, and why it had to.

Gokul JS··41 min read

I started with the basics of RAG. Then I wanted to go deeper. What actually makes a retrieval pipeline good, what are the real problems, and how far can you push it. So I built a bare-metal implementation of the entire stack, from keyword search all the way to multimodal RAG, every layer by hand. You can find it at DeepRAG. Read the blog first. Then go to the code. You will have a much better mental model of why each piece exists.

But before any of that. Why does RAG exist at all?

The Problem RAG Solves

Every LLM has a training cutoff. Everything that happened after that date, the model does not know. And your internal documents, your org's wikis, support tickets, codebases, contracts, the model has never seen any of it. You need a way to get that information in.

The obvious answer: just put the documents in the context window. And honestly, for small documents, that is the right answer. If your data fits under roughly 200k tokens, pass it in directly. Simpler, no infrastructure to maintain, no retrieval pipeline to debug. The simplest solution wins.

But what if you have more data than fits? What if you have an infinite set of documents?

When the Simple Answer Breaks

Three things go wrong as your document set grows.

First, accuracy drops. When you push more than 200k tokens into the context, LLMs get worse at using it. The signal gets buried in the noise. The model starts missing things it should find.

Second, the re-reading tax. In a multi-turn conversation, you are sending all that text back to the model on every single turn. Token usage explodes. Cost scales with every message.

Third, the needle-in-the-haystack problem. Even with a perfectly stuffed context, models still hallucinate. They pick the wrong needle, or they invent one. The model does not magically become reliable just because you gave it more text.

RAG is the answer to all three. Instead of giving the model everything, you give it exactly what it needs. The relevant documents, retrieved at query time. You hand it the needle.

The basic idea is simple: retrieve relevant documents, attach them to your prompt, generate an answer. That is the version most people know. But the surface-level understanding hides a lot of problems. What counts as relevant? How do you retrieve it? What happens when retrieval is wrong?

That is what this blog is about. Every retrieval algorithm exists because the previous one had a flaw. Let's trace all of them.

Now when it comes to RAG, it is not perfect either. The worst failures are the silent ones. The system runs, returns an answer, no crash, no error. But the document with the right information never showed up in retrieval. The model answered without it. You will never know.

The whole game is how close you can get to surfacing the right documents every single time. And that is much harder than it sounds.

Most RAG systems start with semantic search. You embed the query, embed the documents, find the nearest vectors. It works well for meaning-based queries. But say the user asks about "GPT-4o", or "CVE-2024-1234", or "numpy==1.24.0", or an internal ticket ID like "ENG-4821". These are exact identifiers. They have no semantic meaning in the embedding space. The similarity search drifts toward general content and misses the exact thing the user needed. No error. Just the wrong documents, returned confidently.

This is the gap keyword search was built to fill.

Keyword Search

Keyword search is exactly what it sounds like. Does the term the user typed exist in the document or not. That is the entire idea. You look for the words. You find them or you do not.

But before you can match anything, you have to clean the query. Raw user input is messy. And if you match on raw text, you will miss things that should obviously match.

The first step is lowercasing. A user searching for "The Matrix" and a document that says "the matrix" should match. They are the same thing. Case should never be the reason a result is missed. So everything goes lowercase before anything else.

Next, remove punctuation. It adds nothing to the search. "Nolan's best film" and "Nolans best film" should hit the same documents. The apostrophe is noise.

Then remove stopwords. Words like "the", "is", "a", "of" carry no retrieval signal. If you keep them, they show up in almost every document and pollute your results. Strip them out.

Now you tokenize. Split the cleaned query into individual terms. "best sci-fi movies about space" becomes "best", "sci-fi", "movies", "space". Each token is what you search for.

Finally, stemming. Users do not search in base forms. They type naturally. "running", "runner", "ran" should all match documents about "run". Stemming reduces each word to its root so variations do not cause misses. "watching" becomes "watch". "jumping" becomes "jump". The query and the document meet at the same root, even if the surface form was different.

By the time you are done, "The Matrix is a great film" has become "matrix", "great", "film". Clean, consistent, ready to match.

That is the query side. But what about the documents? You cannot scan every document on every search. That does not scale. You need a data structure that makes lookups fast. That structure is the inverted index.

Inverted Index

A forward index maps document to words. Given a document, you can tell what words are in it. That is useful for display, not for search.

An inverted index flips it. It maps word to documents. Given a word, you instantly know every document that contains it. That is what makes search fast.

Think of it like the index at the back of a book. You do not read the whole book to find where "recursion" appears. You look it up in the index and it tells you exactly which pages. An inverted index is that, but for your entire document corpus.

At query time, you take your cleaned tokens, look each one up in the index, and get back the list of matching documents instantly. No scanning. No brute force. Just a lookup.

"matrix" → [doc_1, doc_4, doc_7]
"film"   → [doc_1, doc_2, doc_9]
"nolan"  → [doc_2, doc_4]
Keyword search index build pipelineKeyword search index build pipeline

A query for "nolan matrix" looks up both terms and intersects the lists. doc_4 appears in both. That is your result. This is called boolean search. A document either contains the term or it does not. AND intersects. OR unions. Simple, fast, predictable.

And the symmetry matters. The same cleaning pipeline runs on both sides. Every document goes through it once at index time. Every query goes through it at search time. Both land on the same root forms. A user types "running shoes", the query becomes "run", "shoe". A document indexed with "runner" and "shoes" also became "run", "shoe". They meet. The match happens.

The flaw: boolean search has no ranking. If 50 documents match "nolan matrix", they all come back as equal results. There is no score, no signal for which one is actually more relevant. You either matched or you did not.

When you have 5 results, that is fine. When you have 5000, it is useless. You need a way to score and rank. That is what TF-IDF was built to solve.

Term Frequency (TF)

The inverted index tells you which documents contain your search terms. But it does not tell you which document is most relevant. Two documents both contain the word "space". How do you pick the better one?

The first signal is how often the term appears. A document that mentions "space" fifteen times is probably more about space than one that mentions it once. But you cannot just use the raw count. A longer document will naturally contain more occurrences of any term. Raw counts are biased toward longer documents.

So you normalize. TF is the ratio of how many times the term appears to the total number of terms in the document. That makes it length-independent.

TF(term, doc) = (occurrences of term in doc) / (total terms in doc)

doc_1 (5 words):  "python is fast readable python"         → TF("python") = 2/5 = 0.40
doc_2 (6 words):  "java is verbose but runs everywhere"    → TF("python") = 0/6 = 0.0

doc_1 ranks higher for "python". But TF alone has a problem. It cannot tell the difference between a meaningful term and a common one. A document full of the word "code" scores high on TF for "code", but so does every other document in a programming dataset. The term carries no real signal.

Inverse Document Frequency (IDF)

IDF answers a different question. Not how often does this term appear in this document, but how rare is this term across all documents. The rarer the term, the more discriminating it is. The more common the term, the less it tells you.

IDF(term) = log((total_docs + 1) / (docs_containing_term + 1))

100 documents total:

"code"     → in 98 docs → ln(101/99) = 0.02  ← universal, tells you nothing
"function" → in 60 docs → ln(101/61) = 0.50  ← common, weak signal
"deadlock" → in 3 docs  → ln(101/4)  = 3.22  ← rare, strong signal

A term that appears in every document scores zero. It is noise. A term that appears in very few documents scores high. It is a genuine signal.

The +1 in both numerator and denominator is for numerical stability, to avoid division by zero in edge cases. It is not for handling terms that were never seen. If a term is not in the index, you simply do not compute IDF for it at all.

TF and IDF are never used alone. They are always multiplied together to produce a single score per term per document. That combined score is TF-IDF.

TF-IDF(term, doc) = TF(term, doc) × IDF(term)

TF-IDF

TF and IDF solve two opposite problems.

TF asks: does this document talk a lot about this word? IDF asks: is this word even worth caring about? Neither answer alone is enough. You need both to be true at the same time.

That is why you multiply. A word that appears often but means nothing scores zero on IDF. A word that is rare but absent from the document scores zero on TF. TF-IDF is high only when both hold. Miss either one and the score collapses.

That is the entire insight.

Here is a full example. Three documents, query is "python machine learning".

query: "python machine learning"

doc_1: "python python machine learning python deep learning model training python"
doc_2: "python web development flask django api"
doc_3: "java enterprise spring boot microservices deployment"

--- Step 1: TF ---
formula: TF = occurrences of term in doc / total terms in doc

              doc_1       doc_2       doc_3
"python"      4/10=0.40   1/6=0.17    0/6=0.0
"machine"     1/10=0.10   0/6=0.0     0/6=0.0
"learning"    2/10=0.20   0/6=0.0     0/6=0.0

--- Step 2: IDF ---
formula: IDF = ln((total_docs + 1) / (docs_containing_term + 1))

"python"   → in 2 docs → ln(4/3) = 0.29   ← common across corpus
"machine"  → in 1 doc  → ln(4/2) = 0.69   ← more specific
"learning" → in 1 doc  → ln(4/2) = 0.69   ← more specific

--- Step 3: TF-IDF ---
formula: TF-IDF = TF × IDF

              doc_1                        doc_2                doc_3
"python"      0.40 × 0.29 = 0.116         0.17 × 0.29 = 0.049  0.0
"machine"     0.10 × 0.69 = 0.069         0.0                  0.0
"learning"    0.20 × 0.69 = 0.138         0.0                  0.0

score         0.116+0.069+0.138 = 0.323   0.049                0.0

--- Final ranking ---

1. doc_1  0.323  ← talks about python AND machine learning
2. doc_2  0.049  ← only has python
3. doc_3  0.0    ← no match at all

doc_1 wins because it has all three query terms, and the rarer terms like "machine" and "learning" carry more weight than the common "python". doc_2 matches on python but contributes nothing for the other terms. doc_3 scores zero.

That is TF-IDF doing exactly what you want. Frequent, specific terms in the right document rise to the top.

But TF-IDF has a flaw. And it is a quiet one.

BM25

TF-IDF works. But it breaks in three specific ways that matter in production. BM25, formally called Okapi BM25, was built to fix all three. It is what Elasticsearch, Solr, and Lucene use as their default ranking function today.

The three problems: an unstable IDF, term frequency that grows without limit, and no awareness of document length. Take them one at a time.

Problem 1: Unstable IDF

Classic TF-IDF uses log(N / df) for IDF. It breaks at the edges in two ways.

When a term is extremely rare, df is tiny and the score explodes. A term appearing in just one document out of a million gets an absurdly high IDF that can dominate the entire score, regardless of how the term actually contributes to relevance.

When a term appears in every document, you get log(1) = 0. Fine so far. But with some implementations, very common terms produce negative IDF scores. A negative relevance score is meaningless.

BM25 replaces the formula entirely.

BM25 IDF = log((N - df + 0.5) / (df + 0.5) + 1)

Core ratio: (N - df) / df
  Numerator   (N - df)  documents that do NOT contain the term
  Denominator (df)      documents that DO contain the term

Instead of asking "how many docs have this term",
it asks "how many docs don't have it vs how many do".
Rare terms: large numerator, high score.
Common terms: small numerator, low score.

+0.5 on both sides   Laplace smoothing, prevents division by zero
+1 at the end        guarantees IDF is always positive, no exceptions

Rare terms score high but not infinitely. Common terms score low but never negative. Stable across all cases.

Problem 2: TF Never Saturates

In TF-IDF, TF is linear. A document that mentions "bear" 100 times scores 10x higher than one that mentions it 10 times. But is it 10x more relevant? No.

Consider these two documents for the query "bear hunting":

doc_A: "bear bear bear bear bear"              → tf("bear") = 5
doc_B: "bear hunting guide for beginners"      → tf("bear") = 1

Basic TF: doc_A scores 5x higher. But doc_B is clearly more useful.

After a term appears enough times to establish that a document is about that concept, more occurrences add almost no new relevance signal. TF-IDF does not model this. BM25 does, through saturation.

BM25 TF = (tf * (k1 + 1)) / (tf + k1)

k1 controls how fast saturation kicks in. Typical value: 1.2 to 2.0

tf    Basic TF    BM25 TF (k1=1.5)
1     1           1.0
2     2           1.4
5     5           1.9
10    10          2.2
20    20          2.3

Basic TF grows forever. BM25 TF flattens toward k1+1 = 2.5.
The first few occurrences matter. After that, almost nothing.
Term Frequency (tf)02468101251020↑ keeps goingk1+1Basic TF (linear)BM25 TF (k1=1.5)

Problem 3: Document Length Is Ignored

Longer documents naturally accumulate more term occurrences just by being long. A 2000-word article will almost always outscore a focused 200-word summary on raw TF, even if the summary is more relevant.

Query: "bear"

doc_A: "Boots is a silly bear wizard."
doc_B: "Ted is a wonderful human who has a stuffed bear that loves honey,
        salmon, picnics, and hanging out with other bears in the woods.
        Ted's bear is so nice to hang out with Ted all day long."

doc_B has more occurrences of "bear" — not because it is more relevant,
but because it is longer.

BM25 normalizes term frequency against the document's length relative to the average document length in the corpus.

length_norm = 1 - b + b * (doc_length / avg_doc_length)

BM25 TF (full) = (tf * (k1 + 1)) / (tf + k1 * length_norm)

b controls how aggressively length is penalized.
  b = 0   ignore length entirely
  b = 1   full normalization
  b = 0.75 (default)  partial, works well for most corpora

doc_length / avg_doc_length:
  = 1.0   average length, no change
  > 1.0   longer than average, penalized
  < 1.0   shorter than average, boosted

A focused short document that mentions the term twice will outscore a bloated long document that mentions it five times. That is the behaviour you want.

Tuning k1 and b

Both parameters have sensible defaults that work for most cases. But your corpus is not generic, and tuning them matters more than most people expect.

k1 controls how much term repetition still counts. A low k1 means the first occurrence of a term does almost all the work. A high k1 means multiple occurrences keep adding meaningful score before the curve flattens.

k1 = 0.5   saturation is aggressive. First occurrence dominates.
           Good for: short documents, product titles, code identifiers,
           queries where one mention is as strong as ten.

k1 = 1.2   default. Works well for most general text corpora.

k1 = 2.0   saturation is slow. Repetition keeps contributing longer.
           Good for: long technical documents, legal text, scientific papers
           where a term recurring throughout genuinely signals relevance.

b controls how hard you penalize long documents. A high b means document length matters a lot. A low b means you mostly ignore it.

b = 0.0   length is ignored entirely. Every document competes on raw TF.
          Good for: corpora where length correlates with coverage,
          encyclopedic content, or documents of uniform length.

b = 0.75  default. Partial normalization, works well in practice.

b = 1.0   full normalization. Length is fully accounted for.
          Good for: mixed corpora with huge length variance,
          FAQs mixed with long articles, or boilerplate-heavy documents.

When tuning, change one parameter at a time and evaluate against real queries from your domain. The defaults are a good starting point, not an endpoint.

The Complete BM25 Formula

Put the fixed IDF and the saturating, length-normalized TF together and you get the full BM25 score:

BM25(doc, query) = sum for each query term qt of:
  BM25_IDF(qt) * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * (doc_length / avg_doc_length)))

Steps:
1. Tokenize the query
2. For each token, compute BM25 score against each document
3. Sum the scores across all query tokens
4. Sort documents by total score, descending
5. Return top N

BM25 is fast, explainable, and hard to beat on exact keyword queries. Every score can be traced to specific term statistics, which matters when you need to debug why a result ranked where it did.

But it is still a bag-of-words model. It matches tokens, not meaning. Search for "how to fix a deadlock" and a document about "concurrency issues and thread contention" scores zero, even if it is exactly what you needed. The tokens did not match.

That is the limit of keyword search. That is where semantic search begins.

Keyword search query processing pipelineKeyword search query processing pipeline

Semantic Search

Keyword search matches words. Semantic search matches meaning. That is the fundamental difference.

Search for "heart attack" and keyword search will never find a document that says "myocardial infarction". Search for "how to make code run faster" and it will miss a document titled "performance optimization techniques". The words are different. The meaning is the same. Keyword search cannot cross that gap.

Semantic search can. It handles three things that keyword search fundamentally cannot.

Synonym matching. "Car" and "automobile" are different strings but the same concept. Keyword search treats them as unrelated. Semantic search knows they are the same.

Conceptual queries. A user searching for "movies about loneliness in space" is not looking for documents that contain those exact words. They are looking for documents about that idea. Keyword search needs the exact terms. Semantic search understands the concept.

Natural language. People do not search in keywords. They ask questions. "Why does my database keep locking up?" is a natural language query. The relevant document might talk about "deadlocks" and "contention". Semantic search bridges that gap.

The thing that makes all of this work is embeddings.

An embedding is a vector. A vector is a list of numbers. That is it. You take a piece of text, pass it through a model, and you get back a list of numbers that represents the meaning of that text. Not the words. The meaning.

Texts with similar meaning end up close together in this space. Texts with different meaning end up far apart.

And unlike keyword search, you do not need to preprocess anything. No lowercasing, no stemming, no stopword removal, no punctuation stripping. The embedding model handles all of that internally. It is context-aware. "Running a server" and "running in a park" produce different vectors because the model understands the difference. Keyword search would treat both as the same stem.

Embedding space: similar words cluster together, different words are far apartEmbedding space: similar words cluster together, different words are far apart

Choosing an Embedding Model

The embedding model you pick determines how good your semantic search is. Get this wrong and nothing downstream can fix it.

For most use cases, a general-purpose embedding model works fine. OpenAI and Gemini both offer embedding APIs that handle broad queries well out of the box. If you are building a standard RAG system over general text, start here.

But general models have limits. They were trained on broad data, not your data. If your domain has specialized vocabulary, medical records, legal contracts, codebases, the general model will miss nuances that matter. "Infarction" and "heart attack" might not land as close as they should. Domain-specific terms get blurred.

This is where fine-tuning or training your own embedding model pays off. Cursor did exactly this for code search. They trained a custom embedding model specifically for semantic code retrieval, and it made a measurable difference. You can read about their approach at cursor.com/blog/semsearch.

If you want to explore open-source embedding models, the MTEB leaderboard on HuggingFace ranks them across retrieval benchmarks. Pick one that scores well on tasks similar to yours.

One rule that is non-negotiable: the same embedding model must be used for both indexing and querying. If you embed your documents with model A and your queries with model B, the vector spaces will not align. The distances will be meaningless. Your results will be garbage and you will not get an error telling you why.

Also check what similarity metric your model was trained for. Most modern models are trained for cosine similarity. Some use dot product. If the model was optimized for cosine and you run dot product at query time, the rankings shift. Match the metric to the model.

Note: this is one of those silent failures. The system runs, results come back, no error anywhere. But the rankings are wrong because the metric does not match what the model was trained for. You will spend hours debugging retrieval quality before you think to check this.

Cosine Similarity

Cosine similarity measures the angle between two vectors. It does not care about length. Two vectors pointing in the same direction score 1, no matter how long or short they are. Two vectors at right angles score 0. Opposite directions score -1.

This makes it useful when you care purely about meaning. A short document and a long document about the same topic will have vectors pointing in the same direction. Cosine treats them equally.

cosine_similarity(A, B) = (A . B) / (|A| * |B|)

Range: -1 to 1
  1  = identical direction
  0  = perpendicular (unrelated)
 -1  = opposite direction
Cosine similarity: measures angle between vectors, ignores magnitudeCosine similarity: measures angle between vectors, ignores magnitude

Dot Product

Dot product cares about both direction and magnitude. Two vectors pointing the same way score high, but a longer vector scores even higher. It rewards not just similar meaning, but confidence in that meaning.

If your embedding model produces vectors where magnitude encodes importance or confidence, dot product captures that signal. Cosine throws it away.

dot_product(A, B) = A1*B1 + A2*B2 + ... + An*Bn

No fixed range. Scales with vector magnitude.
Same direction + longer vectors = higher score.
Dot product: measures both direction and magnitude of vectorsDot product: measures both direction and magnitude of vectors

Which One to Use

For most RAG systems, cosine similarity is the right default. You care about whether two texts mean the same thing, not how long or detailed they are. Cosine normalizes that away.

Use dot product when magnitude carries signal. Some models encode confidence or specificity in the vector length. A longer vector means the model is more certain about the meaning. In that case, you want magnitude to influence the score. Dot product preserves it. Cosine discards it.

In practice, check the model card. If it says cosine, use cosine. If it says inner product or dot product, use that. Do not mix them. The model was trained to optimize for one, and using the other will shift your rankings in ways that are hard to debug.

How Semantic Search Works

The pipeline has two phases. Indexing happens once. Search happens on every query.

Indexing (once):
1. Take each document
2. Convert it to a vector using your embedding model
3. Store the vector in a vector store

Search (per query):
1. Convert the user query to a vector using the same model
2. Compare the query vector against all stored vectors
3. Rank by similarity score
4. Return the top results

That is it. No text preprocessing, no inverted index, no BM25 scoring. You embed, you compare, you rank.

But this only works at small scale. Comparing the query vector against every stored vector one by one is O(n). With a million documents, that is a million comparisons per query. It does not scale.

Vector Databases

Looping through a list of embeddings works for a demo. At a million documents, it does not. You need a database built for vectors.

The key difference is indexing. Instead of comparing every vector, the database pre-organizes them so it only checks a small subset. The details of how these indexes work are beyond the scope of this post, but the tradeoff is simple: you might miss the absolute nearest neighbor, but you get very close ones in milliseconds instead of seconds.

Chunking

With keyword search, you index entire documents. With semantic search, you cannot. An embedding model compresses text into a fixed-size vector. The longer the text, the more meaning gets averaged together, and the less precise the vector becomes.

A long technical document covers architecture decisions, deployment steps, error handling, and performance benchmarks. If you embed all of that as one vector, a query about "how to handle timeout errors" matches weakly because the vector represents everything at once. The specific signal gets diluted.

Chunking fixes this. You split the document into smaller pieces and embed each piece separately. A query about box office hits the chunk that actually talks about box office, not the chunk about the cast.

Fixed-Size Chunking

The simplest approach. Split text every N words or N tokens. Predictable sizes, simple to implement, easy to control token limits. But it is dumb. It splits in the middle of sentences, in the middle of thoughts.

Chunk Overlap

Fixed-size chunking breaks context at boundaries. Consider this text:

"the bear attack was terrifying. The stunning special effects led to record breaking sales."

Without overlap:
  chunk 1: "the bear attack was"
  chunk 2: "terrifying. The stunning special effects led"
  chunk 3: "to record breaking sales."

What was terrifying? What had record breaking sales? Context is lost.

With overlap:
  chunk 1: "the bear attack was terrifying."
  chunk 2: "terrifying. The stunning special effects led to record breaking sales."

Now each chunk carries enough context to make sense on its own.

How much overlap? There is no universal answer. Make it configurable and test on your data.

Semantic Chunking

Even with overlap, word-based splitting cuts at arbitrary positions. "Ted explores themes" gets separated from "of friendship and growing up". The author's intended meaning is split across chunks.

Semantic chunking respects natural language structure. Instead of splitting at word counts, you split at sentence or paragraph boundaries. Each chunk contains a complete thought as the author organized it.

Word-based:
  "Ted explores themes"
  "of friendship and growing up"
  "growing up while John must"

Semantic:
  "Ted is a 2012 comedy film directed by Seth MacFarlane."
  "The story follows John Bennett and his magical teddy bear."
  "The film explores themes of friendship and growing up."

The Edge Cases

Chunking seems straightforward until you point it at real data. Then you spend a week fixing things you did not expect to break.

The first thing you notice with PDFs is the headers and footers. Every single page has the same company name, the same page number, the same disclaimer. And all of it ends up in your chunks. So you start looking for commonalities across pages and stripping them out. That helps, until you hit a document with two-column layouts. The text extractor reads straight across both columns and merges them into one garbled paragraph. You fix that, and then you find tables where the rows and columns got flattened into a single line of text that makes no sense at all. Then the font encoding breaks and half your text is random symbols.

Markdown and HTML have their own version of this. You chunk a technical doc and suddenly a code block is split in half across two chunks. Bullet points that made sense as a list now live in separate chunks with no connection to each other. A long quote lands in one chunk while the sentence that says who said it lands in the next one.

There is no library that handles all of this for you. You chunk your data, you look at what came out, and you fix what broke. Then you chunk again. The way you get reliable chunks is by manually inspecting the output over and over until it stops surprising you.

Contextual Retrieval

You spend all that effort getting your chunks clean. You fix the PDFs, you handle the edge cases, you get the boundaries right. And then you realize there is another problem you did not see coming.

When you break a document into pieces, each piece loses the context of the whole. A chunk that says "revenue grew by 3% over the previous quarter" is perfectly clean. But it does not say which company. It does not say which quarter. Someone queries for "ACME Corp Q2 2023 revenue" and this chunk does not match, even though it is exactly the right answer. The information is there. The context is not.

You start noticing this everywhere. A chunk says "the API returns a 429 status code" but does not mention which service. A chunk says "latency increased by 40ms" but does not say which endpoint or when. Every chunk is a sentence ripped out of a conversation. It makes sense if you read the whole document. It makes no sense on its own.

Anthropic published a solution for this called Contextual Retrieval. The idea is simple: before you embed a chunk, pass it to an LLM along with the full document and ask for a short context that situates the chunk. Then prepend that context to the chunk before embedding.

Before:
  "Revenue grew by 3% over the previous quarter."

After:
  "This chunk is from ACME Corp's Q2 2023 SEC filing.
   Previous quarter revenue was $314 million.
   Revenue grew by 3% over the previous quarter."

Now the chunk carries enough information to match the right queries. The embedding captures not just what the text says, but what it is about.

Anthropic tested this across codebases, scientific papers, and financial filings. Contextual embeddings alone reduced retrieval failures by 35%. Combining contextual embeddings with contextual BM25 reduced failures by 49%. Add reranking on top and failures dropped by 67%. Each layer fixes what the previous one missed.

The cost is a one-time preprocessing step. You run it once when you index your documents. After that, the chunks carry their context forever.

If you are building a RAG system today and want a solid starting point, this is it. Contextual retrieval with hybrid search gives you the best of both keyword and semantic, with the context problem already handled. Start here and optimize from this baseline.

ColBERT

There is another approach to the chunking problem that takes a completely different angle. Instead of creating one embedding per chunk, what if you created one embedding per word?

That is what ColBERT does. Every token in the text gets its own contextualized embedding. The word "bank" in "river bank" gets a different vector than "bank" in "bank account" because the surrounding words shape each embedding individually.

This is called multi-vector retrieval. Instead of compressing an entire chunk into one vector and losing detail, you keep the full granularity. When a query comes in, each query token is matched against each document token, and the best matches are aggregated into a final score.

The tradeoff is straightforward: storage and compute. A single chunk that used to be one vector is now hundreds of vectors. Multiply that across your entire corpus and the storage cost grows fast. Search gets more expensive too because you are comparing token-level vectors instead of chunk-level ones.

I have not used ColBERT in production myself. But it is worth knowing about because it solves a real problem, the loss of precision when you compress meaning into a single vector, in a fundamentally different way than chunking strategies do.

Late Chunking

There is yet another way to think about the chunking problem. With normal chunking, you split the text first, then embed each chunk separately. Each chunk has no idea what the other chunks say. The context is gone before the embedding model ever sees it.

Late chunking flips the order. You pass the entire document through the embedding model first. The model processes all the tokens at once, so every token gets a vector that is informed by the full document context. Then you chunk the token-level vectors after the fact and pool each chunk into a single embedding.

That is why it is called late chunking. The chunking happens late, after the model has already seen everything.

The difference matters. In normal chunking, a sentence that says "the city has 3.85 million inhabitants" gets embedded without knowing that "the city" refers to Berlin, because that was mentioned three chunks ago. In late chunking, the model already processed the full document, so the vector for "the city" carries the Berlin context with it.

The catch: you need a long-context embedding model that can handle the full document in one pass. If your document is 50,000 tokens and your model caps at 8,192, you cannot do late chunking on the whole thing. You can read more about this approach at Jina AI's writeup on late chunking.

Hybrid Search

By now you have seen what keyword search is good at and where it fails. And you have seen the same for semantic search. They fail in opposite ways. Keyword search misses meaning. Semantic search misses exact terms. No single approach wins on every query.

So you run both. That is hybrid search. You take the same query, run BM25 against your inverted index, and run a vector similarity search against your embeddings. You get two separate ranked lists of results. Then you merge them into one.

From here the game changes. You have results from both systems. BM25 scores and semantic scores on completely different scales. The job is to bring them to the same scale, rerank the combined list to get the best documents to the top, and pass those to the LLM. That is the rest of the pipeline.

Score Normalization

You cannot compare these scores directly. BM25 gives you numbers anywhere from 0 to 100 or higher. Cosine similarity gives you 0 to 1. A BM25 score of 12 and a cosine score of 0.85 mean nothing next to each other. You need them on the same scale first.

The simplest way to do this is min-max normalization. Take the list of scores from each system, find the lowest and highest, and rescale everything to sit between 0 and 1. The lowest score becomes 0, the highest becomes 1, and everything else falls proportionally in between.

normalized = (score - min_score) / (max_score - min_score)

Now both systems speak the same language. Every score is between 0 and 1.

Weighted Combination

Normalization puts both scores on the same scale. But you still have two separate scores per document. You need one final score to sort by. And you probably do not want to treat both systems equally for every query. Some queries are better served by keyword search, some by semantic. You need a way to control that balance. That is what the weighted combination does.

hybrid_score = alpha * bm25_score + (1 - alpha) * semantic_score

Alpha is a dial. Turn it all the way to 1 and you are doing pure keyword search. Turn it to 0 and you are doing pure semantic search. Anywhere in between is a blend.

alpha = 1.0   100% keyword, 0% semantic
alpha = 0.7   70% keyword, 30% semantic
alpha = 0.5   50/50 split
alpha = 0.2   20% keyword, 80% semantic
alpha = 0.0   0% keyword, 100% semantic

The right alpha depends on what your users are searching for. Someone typing "The Revenant" is looking for an exact title. Keyword should dominate. Alpha 0.8. Someone typing "feel good family movies" is searching by meaning. Semantic should dominate. Alpha 0.2. Someone typing "2015 comedies" needs both the year as a keyword and the concept of comedy. Alpha 0.5.

There is no universal right answer. Build your system so alpha is configurable, test it against real queries from your users, and tune from there.

But weighted combination has a flaw. Here is where it breaks.

Query: "python asyncio tutorial"

BM25 scores:                    Semantic scores:
  doc_A: 45.2                     doc_A: 0.92
  doc_B: 44.8                     doc_C: 0.85
  doc_C: 44.1                     doc_B: 0.41
  doc_D: 41.0                     doc_D: 0.38

After min-max normalization (0 to 1):

BM25 normalized:                Semantic normalized:
  doc_A: 1.00                     doc_A: 1.00
  doc_B: 0.90                     doc_C: 0.87
  doc_C: 0.74                     doc_B: 0.06
  doc_D: 0.00                     doc_D: 0.00

Weighted combination (alpha = 0.5):
  doc_A: 0.5 * 1.00 + 0.5 * 1.00 = 1.00
  doc_B: 0.5 * 0.90 + 0.5 * 0.06 = 0.48
  doc_C: 0.5 * 0.74 + 0.5 * 0.87 = 0.81
  doc_D: 0.5 * 0.00 + 0.5 * 0.00 = 0.00

Final ranking: doc_A, doc_C, doc_B, doc_D

Look at doc_B. In BM25, it scored 44.8, barely behind doc_A at 45.2. Almost identical. But after normalization, that tiny gap of 0.4 became a gap of 0.10. And semantic gave doc_B a 0.41, which normalized to 0.06, almost zero. So a document that was genuinely relevant by keyword match got crushed because normalization turned small score differences into large ones. The weighted combination punished doc_B for something the raw scores never said.

This is the fundamental problem. Min-max normalization is sensitive to the distribution of scores. When one system returns tightly clustered scores and the other returns spread out ones, the blending gets distorted. Your rankings end up shaped by the math, not by relevance.

Reciprocal Rank Fusion (RRF)

RRF sidesteps the problem entirely. It throws away the scores and uses only the ranks. It does not matter what the BM25 score was or what the cosine similarity was. All that matters is: where did each document land in each list?

RRF score = sum of 1 / (k + rank) for each system

k is a constant (typically 60)

Example: document X ranks #2 in BM25 and #5 in semantic
  RRF = 1/(60+2) + 1/(60+5) = 0.016 + 0.015 = 0.031

Document Y ranks #1 in BM25 but absent from semantic
  RRF = 1/(60+1) + 0 = 0.016

X wins because it showed up in both lists.

No normalization needed. No alpha to tune. Documents that rank well in both systems rise to the top naturally.

The k parameter controls how steeply the ranking drops off. A lower k like 20 gives much more weight to the top-ranked results and almost nothing to the rest. A higher k like 100 flattens the curve, so lower-ranked results still have meaningful influence. The default of 60 is a reasonable middle ground for most use cases.

Query Enhancement

Everything so far assumes the user typed a good query. They usually did not. People misspell words, write vague questions, and leave out context that would make the search work better. The retrieval pipeline can only find what the query asks for. If the query is bad, the results are bad.

You can fix this before the search even runs. Pass the raw query through an LLM first. Let it fix typos, expand abbreviations, break apart complex questions into simpler ones, and add missing context. Then search with the cleaned-up version instead of the original.

The user types "how do i fix timout erors in my api". The LLM rewrites it to "how to fix timeout errors in REST API requests". Now the search has the right spelling, the right terminology, and enough specificity to return useful results.

It is a small step that makes everything downstream work better. The retrieval does not get smarter. The query just stops being the bottleneck.

Semantic search pipelineSemantic search pipeline

Reranking

A search might return 50 or 100 documents. The user cares about the top 5 to 10. Getting those right is what matters.

Reranking does something that neither BM25 nor vector search can. It looks at the query and the full document together at the same time. BM25 matches keywords without understanding the query. Vector search compares pre-computed embeddings without seeing the actual text. Reranking sees both, side by side, and scores relevance based on that full picture.

That is why it is more accurate. It is also why it is slow. You cannot pre-compute anything. Every query-document pair has to be scored from scratch. You cannot afford to run this on your entire corpus.

So you do not. You use BM25 and vector search to quickly eliminate the majority of documents and get a rough top 50 to 100. Then you run the expensive reranker only on those candidates. Fast retrieval first, precise reranking second.

You can make this faster. Instead of scoring each document individually against the query in separate LLM calls, you batch them. Pass all the candidate documents to the LLM in one call and ask it to rank them together.

This is better for two reasons. Speed and cost, obviously. You are not re-sending the system prompt and query for every single document. But the bigger win is quality. When the LLM scores documents one at a time, each score is independent. It picks a number on some arbitrary scale with no reference point. When it sees all the documents together, it compares them against each other in the same context. The ranking is relative, which is what you actually want.

Cross-Encoder Reranking

There is a faster alternative to using an LLM for reranking. A cross-encoder.

The embeddings we used for semantic search came from a bi-encoder. It embeds the query and the document separately, then you compare them with cosine similarity. A cross-encoder does something different. It takes the query and the document together as a single input and outputs a relevance score directly. No separate embeddings, no cosine similarity. Just one number.

Because the cross-encoder sees the query and document side by side in the same pass, it catches subtle relationships that bi-encoders miss. It understands how the query interacts with the document, not just how similar their embeddings are.

The big advantage over LLM reranking is speed. A cross-encoder is a small regression model. It does one thing: take a query-document pair, output a score. No generation, no chain of thought, no token-by-token output. It is much faster and cheaper than calling an LLM.

The other advantage is that you can fine-tune it on your own data. Train it on query-document pairs from your domain, and it learns the exact relevance patterns your users care about. That is harder and more expensive to do with an LLM. You can explore pre-trained cross-encoders and how to use them at sbert.net.

Evaluation

This is the most important part of your retrieval pipeline. Everything else you have built so far, the chunking, the embeddings, the hybrid search, the reranking, all of it is guesswork until you measure it.

Evaluation is the knob. It is what tells you whether your chunk size is right. Whether your chunk overlap is too small. Whether your alpha is wrong. Whether the reranker is making things better or worse. Without evaluation, you are tuning blind.

And the quality of your evaluation depends entirely on the quality of your dataset. A golden dataset of queries paired with their expected results. The better the dataset, the more you can trust the numbers. The more you can trust the numbers, the more confidently you can change things. Spend time here. Build the dataset carefully. It is the foundation everything else sits on.

Three metrics matter.

Precision

Your system returns 10 documents for a query. You compare them against your golden dataset. 7 of the 10 are actually relevant. Precision is 7 out of 10. That is 0.7.

precision = relevant_retrieved / total_retrieved

In practice you measure this as Precision@K, where K is the number of results the user actually sees. If your UI shows the top 5, you care about Precision@5. The documents beyond that do not matter because nobody looks at them.

Recall

Precision asks how much of what you returned is relevant. Recall asks a different question. How much of what is relevant did you actually find?

recall = relevant_retrieved / total_relevant

Say there are 20 relevant documents in your corpus for a query. Your system retrieved 8 of them. Recall is 8 out of 20. That is 0.4. You found less than half of what was there.

This matters more than most people think. If your pipeline does not retrieve the relevant documents in the first stage, nothing downstream can save you. Your reranker cannot rerank documents it never received. In medical, legal, and safety applications, missing relevant information is not just bad UX. It has consequences.

I always optimize for recall first. If the pipeline is not surfacing all the relevant documents, that is the bigger problem. Precision you can improve later with reranking. Missing documents you cannot fix after the fact.

The Tradeoff

Higher recall usually means lower precision. You retrieve more documents to make sure you do not miss anything, but that means more irrelevant ones slip in. Tighten precision and you start dropping relevant results. They pull in opposite directions.

The right balance is a product question as much as a technical one. What is worse for your users, seeing some irrelevant results or missing the answer entirely?

F1 Score

When you need a single number that captures both precision and recall, you use F1. It is the harmonic mean of the two.

f1 = 2 * (precision * recall) / (precision + recall)

F1 punishes imbalance. A system with 0.95 precision and 0.10 recall gets an F1 of 0.18. It looks great on precision but it is barely finding anything. F1 exposes that. A system with 0.70 precision and 0.70 recall gets an F1 of 0.70. Balanced and honest.

Use F1 when precision and recall matter equally. Use it when you want to compare different pipeline configurations against each other with one number. But if one metric matters more than the other for your use case, optimize for that directly instead of hiding behind an average.

Error Analysis

Metrics tell you something is wrong. They do not tell you where. When a query returns bad results, you need to trace it through the entire pipeline and find exactly where it broke.

Did the preprocessing strip something it should not have? Did query rewriting change the meaning? Did keyword search miss it? Did semantic search rank it low? Did the reranker push it down? The failure could be at any stage, and each stage has different fixes.

Add debug logging at every step. Make it optional so it does not slow things down in production, but make sure you can turn it on and see what each stage received, what it returned, and what got dropped. When something goes wrong, that trail is how you find it.

Overall Architecture

Overall retrieval pipeline architectureOverall retrieval pipeline architecture

Conclusion

Every retrieval algorithm exists because the previous one had a flaw.

If you are starting from scratch, start with BM25 for keyword search and contextual retrieval for semantic search. That is a strong baseline. From there, spend time building a good evaluation dataset. Give recall more weight than precision. The documents your pipeline misses are the ones that hurt you, and no amount of reranking can fix what was never retrieved.

Good evals are the only way to make your pipeline better. They are the knob you turn. Without them, every change is a guess.

That is the full arc. Each layer exists for a reason. If you understand the reason, you know when to use it and when to skip it.

The bare-metal implementation of everything covered here is at DeepRAG. Read the code. Trace the trail.