Understanding Context and Contextual Retrieval in RAG

In my latest post, I how hybrid search could be utilised to considerably enhance the effectiveness of a RAG pipeline. RAG, in its primary model, utilizing simply semantic search on embeddings, could be very efficient, permitting us to utilise the ability of AI in our personal paperwork. Nonetheless, semantic search, as highly effective as it’s, when utilised in giant data bases, can generally miss precise matches of the person’s question, even when they exist within the paperwork. This weak spot of conventional RAG could be handled by including a key phrase search element within the pipeline, like BM25. On this means, hybrid search, combining semantic and key phrase search, results in way more complete outcomes and considerably improves the efficiency of a RAG system.

Be that as it could, even when utilizing RAG with hybrid search, we are able to nonetheless generally miss essential info that’s scattered in numerous elements of the doc. This could occur as a result of when a doc is damaged down into textual content chunks, generally the context — that’s, the encompassing textual content of the chunk that varieties a part of its that means — is misplaced. This could particularly occur for textual content that’s advanced, with that means that’s interconnected and scattered throughout a number of pages, and inevitably can’t be wholly included inside a single chunk. Suppose, for instance, referencing a desk or a picture throughout a number of completely different textual content sections with out explicitly defining to which desk we’re refering to (e.g., “as proven within the Desk, earnings elevated by 6%” — which desk?). In consequence, when the textual content chunks are then retrieved, they’re stripped down of their context, generally ensuing within the retrieval of irrelevant chunks and era of irrelevant responses.

This lack of context was a serious difficulty for RAG programs for a while, and several other not-so-successful options have been explored for enhancing it. An apparent try for enhancing this, is rising chunk measurement, however this usually additionally alters the semantic that means of every chunk and finally ends up making retrieval much less exact. One other strategy is rising chunk overlap. Whereas this helps to extend the preservation of context, it additionally will increase storage and computation prices. Most significantly, it doesn’t absolutely clear up the issue — we are able to nonetheless have essential interconnections to the chunk out of chunk boundaries. Extra superior approaches trying to unravel this problem embrace Hypothetical Document Embeddings (HyDE) or Document Summary Index. Nonetheless, these nonetheless fail to supply substantial enhancements.

In the end, an strategy that successfully resolves this and considerably enhances the outcomes of a RAG system is contextual retrieval, originally introduced by Anthropic in 2024. Contextual retrieval goals to resolve the lack of context by preserving the context of the chunks and, subsequently, enhancing the accuracy of the retrieval step of the RAG pipeline.

. . .

What about context?

Earlier than saying something about contextual retrieval, let’s take a step again and speak a bit of bit about what context is. Positive, we’ve all heard in regards to the context of LLMs or context home windows, however what are these about, actually?

To be very exact, context refers to all of the tokens which are obtainable to the LLM and primarily based on which it predicts the following phrase — bear in mind, LLMs work by producing textual content by predicting it one phrase at a time. Thus, that would be the person immediate, the system immediate, directions, abilities, or every other pointers influencing how the mannequin produces a response. Importantly, the a part of the ultimate response the mannequin has produced up to now can be a part of the context, since every new token is generated primarily based on all the things that got here earlier than it.

Apparently, completely different contexts result in very completely different mannequin outputs. For instance:

‘I went to a restaurant and ordered a‘ may output ‘pizza.‘
‘I went to the pharmacy and acquired a‘ may output ‘drugs.‘

A elementary limitation of LLMs is their context window. The context window of an LLM is the utmost variety of tokens that may be handed without delay as enter to the mannequin and be taken under consideration to supply a single response. There are LLMs with bigger or smaller context home windows. Trendy frontier fashions can deal with a whole bunch of hundreds of tokens in a single request, whereas earlier fashions usually had context home windows as small as 8k tokens.

In an ideal world, we might wish to simply cross all the knowledge that the LLM must know within the context, and we’d most certainly get superb solutions. And that is true to some extent — a frontier mannequin like Opus 4.6 with a 200k token context window corresponds to about 500-600 pages of textual content. If all the knowledge we have to present suits this measurement restrict, we are able to certainly simply embrace all the things as is, as an enter to the LLM and get an amazing reply.

The difficulty is that for many of real-world AI use circumstances, we have to make the most of some type of data base with a measurement that’s a lot past this threshold — suppose, as an example, authorized libraries or manuals of technical gear. Since fashions have these context window limitations, we sadly can not simply cross all the things to the LLM and let it magically reply — we now have to somwhow choose what is a very powerful info that ought to be included in our restricted context window. And that’s basically what the RAG methodology is all about — choosing the suitable info from a big data base in order to successfully reply a person’s question. In the end, this emerges as an optimization/ engineering downside — context engineering — figuring out the suitable info to incorporate in a restricted context window, in order to supply the very best responses.

That is essentially the most essential a part of a RAG system — ensuring the suitable info is retrieved and handed over as enter to the LLM. This may be achieved with semantic search and key phrase search, as already defined. However, even when bringing all semantically related chunks and all precise matches, there’s nonetheless an excellent likelihood that some essential info could also be left behind.

However what sort of info would this be? Since we now have lined the that means with semantic search and the precise matches with key phrase search, what different sort of data is there to think about?

Completely different paperwork with inherently completely different meanings might embrace elements which are comparable and even equivalent. Think about a recipe ebook and a chemical processing guide each instructing the reader to ‘Warmth the combination slowly’. The semantic that means of such a textual content chunk and the precise phrases are very comparable — equivalent. On this instance, what varieties the that means of the textual content and permit us to separate between cooking and chemnical engineering is what we’re reffering to as context.

Thus, that is the type of additional info we purpose to protect. And that is precisely what contextual retrieval does: preserves the context — the encompassing that means — of every textual content chunk.

. . .

What about contextual retrieval?

So, contextual retrieval is a technique utilized in RAG aiming to protect the context of every chunk. On this means, when a piece is retrieved and handed over to the LLM as enter, we’re in a position to protect as a lot of its preliminary that means as doable — the semantics, the key phrases, the context — all of it.

To attain this, contextual retrieval means that we first generate a helper textual content for every chunk — particularly, the contextual textual content — that enables us to situate the textual content chunk within the unique doc it comes from. In apply, we ask an LLM to generate this contextual textual content for every chunk. To do that, we offer the doc, together with the precise chunk, in a single request to an LLM and immediate it to “present the context to situate the particular chunk within the doc“. A immediate for producing the contextual textual content for our Italian Cookbook chunk would look one thing like this:

 
your complete doc Italian Cookbook doc the chunk comes from
 

Right here is the chunk we wish to place throughout the context of the total doc.

 
the precise chunk
 

Present a short context that situates this chunk throughout the general 
doc to enhance search retrieval. Reply solely with the concise 
context and nothing else.

The LLM returns the contextual textual content which we mix with our preliminary textual content chunk. On this means, for every chunk of our preliminary textual content, we generate a contextual textual content that describes how this particular chunk is positioned in its dad or mum doc. For our instance, this might be one thing like:

Context: Recipe step for simmering do-it-yourself tomato pasta sauce.
Chunk: Warmth the combination slowly and stir often to forestall it from sticking.

Which is certainly much more informative and particular! Now there isn’t any doubt about what this mysterious combination is, as a result of all the knowledge wanted for identiying whether or not we’re speaking about tomato sauce or laboratory starch options is conveniently included throughout the similar chunk.

From this level on, we cope with the preliminary chunk textual content and the contextual textual content as an unbreakable pair. Then, the remainder of the steps of RAG with hybrid search are carried out basically in the identical means. That’s, we create embeddings which are saved in a vector search and the BM25 index for every textual content chunk, prepended with its contextual textual content.

This strategy, so simple as it’s, leads to astonishing enhancements within the retrieval efficiency of RAG pipelines. In line with Anthropic, Contextual Retrieval improves the retrieval accuracy by an impressive 35%.

. . .

Lowering value with immediate caching

I hear you asking, “However isn’t this going to break the bank?“. Surprisingly, no.

Intuitively, we perceive that this setup goes to considerably improve the price of ingestion for a RAG pipeline — basically double it, if no more. In spite of everything we now added a bunch of additional calls to the LLM, didn’t we? That is true to some extent — certainly now, for every chunk, we make an extra name to the LLM so as to situate it inside its supply doc and get the contextual textual content.

Nevertheless, it is a value that we’re solely paying as soon as, on the stage of doc ingestion. Not like various strategies that try and protect context at runtime — comparable to Hypothetical Doc Embeddings (HyDE) — contextual retrieval performs the heavy work through the doc ingestion stage. In runtime approaches, further LLM calls are required for each person question, which might shortly scale latency and operational prices. In distinction, contextual retrieval shifts the computation to the ingestion section, that means that the improved retrieval high quality comes with no further overhead throughout runtime. On high of those, further strategies can be utilized for additional lowering the contextual retrieval value. Extra exactly, caching can be utilized for producing the abstract of the doc solely as soon as after which situating every chunk towards the produced doc abstract.

. . .

On my thoughts

Contextual retrieval represents a easy but highly effective enchancment to conventional RAG programs. By enriching every chunk with contextual textual content, pinpointing its semantic place inside its supply doc, we dramatically cut back the anomaly of every chunk, and thus enhance the standard of the knowledge handed to the LLM. Mixed with hybrid search, this method permits us to protect semantics, key phrases, and context concurrently.

Cherished this publish? Let’s be associates! Be part of me on:

📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!

All photos by the creator, besides talked about in any other case.

Source link

Understanding Context and Contextual Retrieval in RAG

Will Humans Live Forever? AI Races to Defeat Aging

KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

Dreaming in Cubes | Towards Data Science

AI Agents Need Their Own Desk, and Git Worktrees Give Them One

Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

Robot wins half marathon faster than human record

Analysis of 200 education dept-endorsed school apps finds most are selling BS when it comes to the privacy of children’s data

Spoofed Tankers Are Flooding the Strait of Hormuz. These Analysts Are Tracking Them

Polymarket is in talks to raise $400M at a ~$15B post-money valuation, up from $9B in October 2025, but below Kalshi’s $22B valuation from March 2026 (The Information)

Featured Picks

Today’s NYT Strands Hints, Answer and Help for Nov. 1 #608

Dinner on a Dime: Our 5 Favorite Cheap Meal Delivery Services

Estimating Disease Rates Without Diagnosis

Understanding Context and Contextual Retrieval in RAG

What about context?

What about contextual retrieval?

Lowering value with immediate caching

On my thoughts

Related Posts