Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Tim Cook takes his last bite leading Apple; veteran insider John Ternus named new CEO
    • The 10 Best Electrolyte Powders (We Tested Nearly 20)
    • Fanatics Markets launches Combos feature as basketball postseason trading starts
    • Dyson Just Launched a Hair Dryer That Fits in Your Carry-On
    • Your RAG Gets Confidently Wrong as Memory Grows – I Built the Memory Layer That Stops It
    • Ancient parrot feathers reveal vast Andes trade routes
    • After building global startup, two founders who met at uni are backing a new generation of Kiwi students
    • This Scammer Used an AI-Generated MAGA Girl to Grift ‘Super Dumb’ Men
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, April 21
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»RAG with Hybrid Search: How Does Keyword Search Work?
    Artificial Intelligence

    RAG with Hybrid Search: How Does Keyword Search Work?

    Editor Times FeaturedBy Editor Times FeaturedMarch 4, 2026No Comments11 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    , I’ve talked rather a lot about Reterival Augmented Era (RAG). Specifically, I’ve coated the basics of the RAG methodology, in addition to a bunch of related ideas, like chunking, embeddings, reranking, and retrieval evaluation.

    The normal RAG methodology is so helpful as a result of it permits for looking for related components of textual content in a big data base, primarily based on the which means of the textual content moderately than precise phrases. On this manner, it permits us to make the most of the facility of AI on our customized paperwork. Mockingly, as helpful as this similarity search is, it typically fails to retrieve components of textual content which are precise matches to the consumer’s immediate. Extra particularly, when looking in a big data base, particular key phrases (resembling particular technical phrases or names) could get misplaced, and related chunks might not be retrieved even when the consumer’s question comprises the precise phrases.

    Fortunately, this situation might be simply tackled by utilising an older keyword-based looking approach, like BM25 (Best Matching 25). Then, by combining the outcomes of the similarity search and BM25 search, we will primarily get the very best of each worlds and considerably enhance the outcomes of our RAG pipeline.

    . . .

    In info retrieval techniques, BM25 is a rating perform used to guage how related a doc is to a search question. In contrast to similarity search, BM25 evaluates the doc’s relevance to the consumer’s question, not primarily based on the semantic which means of the doc, however moderately on the precise phrases it comprises. Extra particularly, BM25 is a bag-of-words (BoW) model, which means that it doesn’t bear in mind the order of the phrases in a doc (from which the semantic which means emerges), however moderately the frequency with which every phrase seems within the doc.

    BM25 rating for a given question q containing phrases t and a doc d might be (not so) simply calculated as follows:

    😿

    Since this expression generally is a bit overwhelming, let’s take a step again and take a look at it little by little.

    . . .

    Beginning easy with TF-IDF

    The fundamental underlying idea of BM25 is TF-IDF (Time period Frequency – Inverse Doc Frequency). TF-IDF is a basic info retrieval idea aiming to measure how necessary a phrase is in a selected doc in a data base. In different phrases, it measures in what number of paperwork of the data base a time period seems in, permitting on this option to categorical how particular and informative a time period is a couple of particular doc. The rarer a time period is within the data base, the extra informative it’s thought-about to be for a selected doc.

    Specifically, for a doc d in a data base and a time period t, the Time period Frequency TF(t,d) might be outlined as follows:

    and Inverse Doc Frequency IDF(t) might be outlined as follows:

    Then, the TF-IDF rating might be calculated because the product of TF and IDF as follows:

    . . .

    Let’s do a fast instance to get a greater grip of TF-IDF. Let’s assume a tiny data base containing three films with the next descriptions:

    1. “A sci-fi thriller about time journey and a harmful journey throughout alternate realities.”
    2. “A romantic drama about two strangers who fall in love throughout surprising time journey.”
    3. “A sci-fi journey that includes an alien explorer compelled to journey throughout galaxies.”

    After eradicating the stopwords, we will take into account the next phrases in every doc:

    • doc 1: sci-fi, thriller, time, journey, harmful, journey, alternate, realities
      • measurement of doc 1, |d1| = 8
    • doc 2: romantic, drama, two, strangers, fall, love, surprising, time, journey
      • measurement of doc 2, |d2| = 9
    • doc 3: sci-fi, journey, that includes, alien, explorer, compelled, journey, galaxies
      • measurement of doc 3, |d3| = 8
    • complete paperwork in data base N = 3

    We will then calculate the f(t,d) for every time period in every doc:

    Subsequent, for every doc, we additionally calculate the Doc Frequency and the Inverse Doc Frequency:

    After which lastly we calculate the TF-IDF rating of every time period.

    So, we will we get from this? Let’s have a look, for instance, on the TF-IDF scores of doc 1. The phrase ‘journey’ just isn’t informative in any respect, since it’s included in all paperwork of the data base. On the flip aspect, phrases like ‘thriller’ and ‘harmful’ are very informative, particularly for doc 1, since they’re solely included in it.

    On this manner, TF-IDF rating gives a easy and easy option to establish and quantify the significance of the phrases in every doc of a data base. To place it in a different way, the upper the full rating of the phrases in a doc, the rarer the data on this doc is compared to the data contained in all different paperwork within the data base.

    . . .

    Understanding BM25 rating

    In BM25, we utilise the TF-IDF idea so as to quantify how imformative (how uncommon or necessary) every doc in a data base is, with respect to a selected question. To do that, for the BM25 calculation, we solely bear in mind the phrases of every doc which are contained within the consumer’s question, and carry out a calculation considerably just like TF-IF.

    BM25 makes use of the TF-IDF idea, however with just a few mathematical tweaks so as to enhance two essential weaknesses of TF-IDF.

    . . .

    The primary ache level of TF-IDF is that TF is linear with the variety of instances a time period t seems in a doc d, f(t,d), as any perform of the shape:

    Because of this the extra instances a time period t seems in a doc d, the extra TF grows linearly, which, as it’s possible you’ll think about, might be problematic for giant paperwork, the place a time period seems time and again with out essentially being correspondingly extra necessary.

    A easy option to resolve that is to make use of a saturation curve as a substitute of a linear perform. Because of this output will increase with the enter however approaches a most restrict asymptotically, not like the linear perform, the place the output will increase with the enter perpetually:

    Thus, we will attempt to rewrite TF on this type as follows, introducing a parameter k1, which permits for the management of the frequency scaling. On this manner, the parameter K1allows for introducing diminishing returns. That’s, the first incidence of the time period t in a doc has a big effect on the TF rating, whereas the twentieth look solely provides a small additional acquire.

    Noetheless, this could end in values within the vary 0 to 1. We will tweak this a bit extra and add a (k1 + 1) within the nominator, in order that the ensuing values of TF are comparable with the preliminary definition of TF utilized in TD-IDF.

    . . .

    To date, so good, however one important piece of knowledge that’s nonetheless lacking from this expression is the dimensions of the doc |d| that was included within the preliminary calculation of TF. Nonetheless, earlier than including the |d| time period, we additionally want to change it a bit of bit since that is the second ache level of the preliminary TF-IDF expression. Extra particularly, the difficulty is {that a} data base goes to include paperwork with variable lengths |d|, leading to scores of various phrases not being comparable. BM25 resolves this by normalizing |d|. That’s, as a substitute of |d|, the next expression is used:

    the place avg(dl) is the typical doc size of the paperwork within the data base. Moreover, b is a parameter in [0,1] that controls the size normalization, with b = 0 comparable to no normalization and b = 1 corresponding to finish normalization.

    So, including the normalised expressionof |d|, we will get the fancier model of TF utilized in BM25. This will likely be as follows:

    Normally, the used parameter values are k₁ ≈ 1.2 to 2.0 and b ≈ 0.75.

    . . .

    BM52 additionally makes use of a barely altered expression for the IDF calculation as follows:

    This expression is derived by asking a greater query. Within the preliminary IDF calculation, we ask:

    “How uncommon is the time period?”

    As a substitute, when making an attempt to calculate the IDF for BM25, we ask:

    “How more likely is that this time period in related paperwork than in non-relevant paperwork?”

    The chance of a doc containing the time period t, in a data base of N paperwork, might be expressed as:

    We will then categorical the percentages of a doc containing a time period t versus not containing it as:

    After which taking the inverse, we find yourself with:

    Equally to the standard IDF, we get the log of this expression to compress the acute values. An unique transformation known as Robertson–Sparck Jones smoothing can also be carried out, and on this manner, we lastly get the IDF expression utilized in BM25.

    . . .

    Finally, we will calculate the BM25 rating for a selected doc d for a given question q that comprises a number of phrases t.

    On this manner, we will rating the paperwork obtainable in a data base primarily based on their relevance to a selected question, after which retrieve probably the most related paperwork.

    All that is simply to say that the BM52 rating is one thing like the rather more simply understood TD-IDF rating, however a bit extra refined. So, BM52 may be very standard for performing key phrase searches and can also be utilized in our case for key phrase searches in a RAG system.

    RAG with Hybrid Search

    So, now that we have now an thought about how BM25 works and scores the assorted paperwork in a data base primarily based on the frequency of key phrases, we will additional check out how BM25 scores are included in a conventional RAG pipeline.

    As mentioned in a number of of my earlier posts, a quite simple RAG pipeline would look one thing like this:

    Such a pipeline makes use of a similarity rating (like cosine similarity) of embeddings so as to seek for, discover, and retrieve chunks which are semantically just like the consumer’s question. Whereas similarity search may be very helpful, it may well typically miss precise matches. Thus, by incorporating a key phrase search, on high of the similarity search within the RAG pipeline, we will establish related chunks extra successfully and comprehensively. This might alter our panorama as follows:

    For every textual content chunk, aside from the embedding, we now additionally calculate a BM25 index, permitting for fast calculation of respective BM25 scores on varied consumer queries. On this manner, for every consumer question, we will establish the chunks with the very best BM25 scores – that’s, the chunks that include probably the most uncommon, most informative phrases with respect to the consumer’s question compared to all different chunks within the data base.

    Discover how now we match the consumer’s question each with the embeddings within the vector retailer (semantic search) and the BM25 index (key phrase search). Totally different chunks are retrieved primarily based on the semantic search and the key phrase search – then the retrieved chunks are mixed, deduplicated, and ranked utilizing rank fusion.

    . . .

    On my thoughts

    Integrating BM25 key phrase search right into a RAG pipeline permits us to get the very best of each worlds: the semantic understanding of embeddings and the precision of actual key phrase matching. By combining these approaches, we will retrieve probably the most related chunks even from a bigger data base extra reliably, guaranteeing that important phrases, technical phrases, or names should not neglected. On this manner, we will considerably enhance the effectiveness of our retrieval course of and be sure that no necessary related info is left behind.


    Beloved this put up? Let’s be pals! Be part of me on:

    📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Your RAG Gets Confidently Wrong as Memory Grows – I Built the Memory Layer That Stops It

    April 21, 2026

    The LLM Gamble | Towards Data Science

    April 21, 2026

    Context Payload Optimization for ICL-Based Tabular Foundation Models

    April 21, 2026

    What Does the p-value Even Mean?

    April 20, 2026

    From Risk to Asset: Designing a Practical Data Strategy That Actually Works

    April 20, 2026

    Will Humans Live Forever? AI Races to Defeat Aging

    April 20, 2026

    Comments are closed.

    Editors Picks

    Tim Cook takes his last bite leading Apple; veteran insider John Ternus named new CEO

    April 21, 2026

    The 10 Best Electrolyte Powders (We Tested Nearly 20)

    April 21, 2026

    Fanatics Markets launches Combos feature as basketball postseason trading starts

    April 21, 2026

    Dyson Just Launched a Hair Dryer That Fits in Your Carry-On

    April 21, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Today’s NYT Connections: Sports Edition Hints, Answers for Jan. 20 #484

    January 20, 2026

    TDS Authors Can Now Edit Their Published Articles

    July 19, 2025

    Want to Stop Doomscrolling? You Might Need a Sleep Coach

    January 11, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.