Impressed by the ICLR 2026 blogpost/article, The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection
an Edinburgh-trained PhD in Data Retrieval from Victor Lavrenko’s Multimedia Data Retrieval Lab at Edinburgh, the place I educated within the late 2000s, I’ve lengthy seen retrieval by way of the framework of conventional IR considering:
- Did we retrieve at the very least one related chunk?
- Did recall go up?
- Did the ranker enhance?
- Did downstream reply high quality look acceptable on a benchmark?
These are nonetheless helpful questions. However after studying the current work on Bits over Random (BoR), I believe they’re incomplete for the Agentic methods many people are actually really constructing.
The ICLR blogpost sharpened one thing I had felt for some time in manufacturing LLM methods: retrieval high quality ought to have in mind each how a lot good content material we discover and likewise how a lot irrelevant materials we convey together with it. In different phrases, as we crank up our recall we additionally improve the danger of context air pollution.
What makes BoR helpful is that it provides us a language for this. BoR tells us whether or not retrieval is genuinely selective, or whether or not we’re reaching success largely by stuffing the context window with extra materials. When BoR falls, it’s a signal that the retrieved bundle is changing into much less discriminative relative to probability. In apply, that always correlates with the mannequin being compelled to learn extra junk, extra overlap, or extra weakly related materials.
The necessary nuance is that BoR doesn’t immediately measure what the mannequin “feels” when studying a immediate. It measures retrieval selectivity relative to random probability. However decrease selectivity usually goes hand in hand with extra irrelevant context, extra immediate air pollution, extra consideration dilution, and worse downstream efficiency. Put merely, BoR helps inform us when retrieval remains to be selective and when it has began to degenerate into context stuffing.
That concept issues rather more for RAG and brokers than it did for traditional search.
Why retrieval dashboards can mislead agent groups
One of many best traps in RAG is to take a look at your retrieval dashboard, see wholesome metrics, and conclude that the system is doing properly. You would possibly see:
- excessive Success@Okay,
- robust recall,
- an excellent rating metric,
- and a bigger Okay seeming to enhance protection.
On paper issues might look higher however, in actuality, the agent would possibly really behave worse. Your agent might have any variety of maladies akin to diffuse solutions to queries, unreliable instrument use or just an increase in latency and token price with none actual person profit.
This disconnect occurs as a result of most retrieval dashboards nonetheless mirror a human search worldview. They assume the patron of the retrieved set can skim, filter, and ignore junk. People are surprisingly good at this. LLMs will not be constantly good at it.
An LLM doesn’t “discover” ten retrieved gadgets and casually give attention to the perfect two in the best way a robust analyst would. It processes the total bundle as immediate context. Meaning the retrieval layer is surfacing proof that’s actively shaping the mannequin’s working reminiscence.
That is why I believe agent groups ought to cease treating retrieval as a back-office rating drawback and begin treating it as a reasoning-budget allocation drawback. When constructing performant agentic methods, the important thing query is each:
- Did we retrieve one thing related?
and:
- How a lot noise did we drive the mannequin to course of as a way to get that relevance?
That’s the lens BoR pushes you towards, and I’ve discovered it to be a really helpful one.
Context engineering is changing into a first-class self-discipline
One cause this paper has resonated with me is that it matches a broader shift already occurring in apply. Software program engineers and ML practitioners engaged on LLM methods are progressively changing into one thing nearer to context engineers.
Meaning designing methods that determine:
- what ought to enter the immediate,
- when it ought to enter,
- in what type,
- with what granularity,
- and what needs to be excluded totally.
In conventional software program, we fear about reminiscence, compute, and API boundaries. In LLM methods, we additionally want to fret about context purity. The context window is contested cognitive actual property.
Each irrelevant passage, duplicated chunk, weakly associated instance, verbose instrument definition, and poorly timed retrieval end result competes with the factor the mannequin most must give attention to. That’s the reason I just like the air pollution metaphor. Irrelevant context contaminates the mannequin’s workspace.
The BoR poster provides this instinct a extra rigorous form by telling us that we must always cease evaluating retrieval solely by whether or not it succeeds. We must also ask how significantly better the retrieval is in comparison with probability, on the depth (prime Okay retrieved gadgets) that we are literally utilizing. That could be a very practitioner-friendly query.
Why instrument overload breaks brokers
That is the place I believe the BoR work turns into particularly necessary for real-world agent methods.
In basic RAG, the corpus is commonly massive. Chances are you’ll be retrieving from tens of hundreds or thousands and thousands of chunks. In that regime, random probability stays weak for longer. Instrument choice could be very totally different.
In an agent, the mannequin could also be selecting amongst 20, 50, or 100 instruments. That sounds manageable till you understand that a number of instruments are sometimes vaguely believable for a similar activity. As soon as that occurs, dumping all instruments into context just isn’t thoroughness. It’s confusion disguised as completeness.
I’ve seen this sample repeatedly in agent design:
- the staff provides extra instruments,
- descriptions turn out to be longer,
- overlap between instruments will increase,
- the agent begins making brittle or inconsistent decisions,
- and the primary intuition is to tune the immediate tougher.
However usually the true problem is architectural, not prompt-level. The mannequin is being requested to select from an overloaded context the place distinctions are too weak and too quite a few.
What BoR provides here’s a helpful strategy to formalize one thing individuals usually really feel solely intuitively: there’s a level the place the choice activity turns into so crowded that the mannequin is now not demonstrating significant selectivity.
That’s the reason I strongly favor agent designs with:
- Staged instrument retrieval: narrowing the search in steps, first discovering a small set of believable instruments, then making the ultimate selection from that shortlist slightly than from the total library directly.
- Area routing: earlier than remaining instrument selection means first deciding which broad space the duty belongs to, akin to search, CRM, finance, or coding, and solely then choosing a particular instrument inside that area.
- Compressed functionality summaries: presenting every instrument with a brief, high-signal description of what it’s for, when it needs to be used, and the way it differs from close by instruments, as an alternative of dumping lengthy verbose specs into the immediate.
- Specific exclusion of irrelevant instruments: intentionally eradicating instruments that aren’t applicable for the present activity so the mannequin just isn’t distracted by believable however pointless choices.
In my expertise instrument selection needs to be handled extra like retrieval than like static immediate ornament.
Understanding BoR by way of instrument choice
Some of the helpful issues about BoR is that it sharpens what top-Okay actually means in tool-using brokers.
In doc retrieval, growing top-Okay usually means shifting from top-5 passages to top-20 or top-50 from a really massive corpus. In instrument choice, the identical transfer has a really totally different character. When an agent solely has a modest instrument library, growing top-Okay might imply shifting from a shortlist of three candidate instruments, to five, to eight, and finally to the acquainted however harmful fallback: simply give all of it 15 instruments to be protected.
That usually improves recall or Success@Okay, as a result of the right instrument is extra more likely to be someplace within the seen set. However that enchancment might be deceptive. As Okay grows, you aren’t solely serving to the router. You might be additionally making it simpler for a random selector to incorporate a related instrument.
So the true query just isn’t merely: Did top-8 include a useful gizmo extra usually than top-3? The extra necessary query is: Did top-8 enhance significant selectivity, or did it largely make the duty simpler by way of brute-force inclusion?That’s precisely the place BoR turns into helpful.
A easy instance makes the instinct clearer. Suppose you’ve got 10 instruments, and for a given class of activity 2 of them are genuinely related. In the event you present the mannequin just one instrument, random probability of surfacing a related one is 20 p.c. At 3 instruments, the random baseline rises sharply. At 5 instruments, random inclusion is already pretty robust. At 10 instruments, it’s one hundred pc, as a result of you’ve got proven every little thing. So sure, Success@Okay rises as Okay rises. However the that means of that success adjustments. At low Okay, success signifies actual discrimination. At excessive Okay, success might merely imply you included sufficient of the menu that failure turned troublesome.
That’s what I imply by serving to random probability slightly than significant selectivity.
This issues as a result of, with instruments, the issue is worse than a deceptive metric. Once you present too many instruments, the immediate will get longer, descriptions start to overlap, the mannequin sees extra near-matches, distinctions turn out to be fuzzier, parameter confusion rises, and the possibility of selecting a plausible-but-wrong instrument will increase. So regardless that top-Okay recall improves, the standard of the ultimate resolution might worsen. That is the small-tool paradox: including extra candidate instruments can improve obvious protection whereas lowering the agent’s means to decide on cleanly.
A sensible method to consider that is that instrument choice usually falls into three regimes. Within the wholesome regime, Okay is small relative to the variety of instruments, and the looks of a related instrument within the shortlist tells you the router really did one thing helpful. For instance, 30 complete instruments, 2 or 3 related, and a shortlist of three or 4 nonetheless looks like real choice. Within the gray zone, Okay is massive sufficient that recall improves, however random inclusion can be rising rapidly. For instance, 20 instruments, 3 related, shortlist of 8. Right here you should still acquire one thing, however it’s best to already be asking whether or not you’re really routing or merely widening the funnel. Lastly, there may be the collapse regime, the place Okay is so massive that success largely comes from exposing sufficient of the instrument menu that random choice would additionally succeed usually. When you’ve got 15 instruments, 3 related ones, and a shortlist of 12 or all 15, then “excessive recall” is now not saying a lot. You might be getting near brute-force publicity.
Operationally, this pushes me towards a greater query. In a small-tool system, I like to recommend avoiding the overexposure mindset that asks:
- How massive should Okay be earlier than recall seems to be good?
The higher query is:
- How small can my shortlist be whereas nonetheless preserving robust activity efficiency?
That mindset encourages disciplined routing.
In apply, that normally means routing first and selecting second, protecting the shortlist very small, compressing instrument descriptions so distinctions are apparent, splitting instruments into domains earlier than remaining choice, and testing whether or not growing Okay improves end-to-end activity accuracy, not simply instrument recall. A helpful sanity test is that this: if giving the mannequin all instruments performs about the identical as your routed shortlist, then your routing layer might not be including a lot worth. And if giving the mannequin extra instruments improves recall however worsens general activity efficiency, you’re doubtless in precisely the regime the place Okay helps random probability greater than actual selectivity.
When the failure mode adjustments: massive instrument libraries
The big-tool case is totally different, and that is the place an necessary nuance issues. A bigger instrument universe does not imply we must always dump a whole lot of instruments into context and anticipate the system to work higher. It simply means the failure mode adjustments.
If an agent has 1,000 instruments obtainable and solely a handful are related, then growing top-Okay from 10 to 50 and even 100 should signify significant selectivity. Random probability stays weaker for longer than it does within the small-tool case. In that sense, BoR remains to be helpful: it helps cease us from mistaking broader publicity for higher routing. It asks whether or not a bigger shortlist displays real selectivity, or whether or not it’s merely serving to by exposing a bigger slice of the search house.
But BoR does not capture the whole problem here. With very large tool libraries, the issue may no longer be that random chance has become too strong. The issue may be that the model is simply drowning in options. A shortlist of 200 tools can still be better than random in BoR terms and yet still be a terrible prompt. Tool descriptions overlap, near-matches proliferate, distinctions become harder to maintain, and the model is forced to reason over a crowded semantic menu.
So BoR is valuable, but it is not sufficient on its own. It is better at telling us whether a shortlist is genuinely discriminative relative to chance than whether that shortlist is still cognitively manageable for the model. In large tool libraries, we therefore need both perspectives: BoR to measure selectivity, and downstream measures such as tool-choice quality, latency, parameter correctness, and end-to-end task success to measure usability.
BoR tells us whether retrieval is genuinely selective, or whether we are achieving success mostly by stuffing the context window with more material. When BoR falls, it is a sign that the retrieved bundle is becoming less discriminative relative to chance. In practice, that often correlates with the model being forced to read more junk, more overlap, or more weakly relevant material. The nuance is that BoR does not directly measure what the model “feels” when reading a prompt. It measures selectivity relative to random chance. But low BoR is often a warning sign that the model is being asked to process an increasingly noisy context window.
The design implication is the same even though the reason differs. With small tool sets, broad exposure quickly becomes bad because it helps random chance too much. With very large tool sets, broad exposure becomes bad because it overwhelms the model. In both cases, the answer is not to stuff more into context. It is to design better routing.
My own rule of thumb: the model should see less, but cleaner
If I had to summarize the practical shift in one sentence, it would be this: for LLM systems, smaller and cleaner is often better than larger and more comprehensive.
That sounds obvious, but many systems are still designed as if “more context” is automatically safer. In reality, once a baseline level of useful evidence is present, additional retrieval can become harmful. It increases token cost and latency, but more importantly it widens the field of competing cues inside the prompt.
I have come to think about prompt construction in three layers:
Layer 1: mandatory task context
- The core instruction, constraints, and immediate user objective.
Layer 2: highly selective grounding
- Only the minimum supporting evidence or tool definitions needed for the next reasoning step.
Layer 3: optional overflow
- Material that is merely plausible, loosely related, or included “just in case.”
Most failures come from letting Layer 3 invade Layer 2. That is why retrieval should be judged not just by coverage, but by its ability to preserve a clean Layer 2.
Where I think BoR is especially useful
I do not see BoR as a replacement for all retrieval metrics. I see it as a very useful additional lens, especially in these cases:
1. Choosing K in production
- Many teams still increase top-K until recall looks good enough. BoR encourages a more disciplined question: at what point is increasing K mostly helping random chance rather than meaningful selectivity?
2. Evaluating agent tool routing
- This may be the most compelling use-case. Agents often fail not because no good tool exists, but because too many nearly relevant tools are presented simultaneously.
3. Diagnosing why downstream quality falls despite “better retrieval”
- This is the classic paradox. Coverage goes up. Final answer quality goes down. BoR helps explain why.
4. Comparing systems with different retrieval depths
- Raw success rates can be deceptive when one system retrieves far more material than another. BoR helps normalize for that.
5. Preventing overconfidence in benchmark results
- Some benchmarks may simply be too easy at the chosen retrieval depth. A strong-looking result may be closer to luck than we think.
Where I think BoR may be insufficient on its own
I like the paper, but I would not treat BoR as the final answer to retrieval evaluation. There are at least a few important caveats.
First, not every task only needs one good item. Some tasks genuinely require synthesis across multiple pieces of evidence. In those cases, a success-style view can understate the need for broader retrieval.
Second, retrieval usefulness is not binary. Two chunks may both count as “relevant,” while one is far more actionable, concise, or decision-useful for the model.
Third, prompt organization still matters. A noisy bundle that is carefully structured may perform better than a slightly cleaner bundle that is poorly ordered or badly formatted.
Fourth, the model itself matters. Different LLMs have different tolerance for clutter, different long-context behavior, and different tool-use reliability. A retrieval policy that pollutes one model may be acceptable for another.
Fifth, and this is especially relevant for large tool libraries, BoR tells us more about selectivity than about usability. A shortlist can still look meaningfully better than random and yet be too crowded, too overlapping, or too semantically messy for the model to use well.
So I would not use BoR in isolation. I would pair it with:
- downstream task accuracy,
- latency and token-cost analysis,
- tool-call quality,
- parameter correctness,
- and some explicit measure of prompt cleanliness or redundancy.
Still, even with those caveats, BoR contributes something important: it forces us to stop confusing coverage with selectivity.
How this changes evaluation practice for me
The biggest practical shift is that I would now evaluate retrieval systems more like this:
- First, look at standard retrieval metrics. They still matter. You should ideally consider a bag-of-metrics approach, leveraging multiple complementary metrics.
Then ask:
- What is the random baseline at this depth?
- Is higher Success@K actually demonstrating skill, or just easier conditions?
- How much extra context did we add to get that gain?
- Did downstream answer quality improve, stay flat, or worsen?
- Are we making the model reason, or merely making it read more?
For agents, I would go even further:
- How many tools were visible at decision time?
- How much overlap existed between candidate tools?
- Could the system have routed first and selected second?
- Was the model asked to choose from a clean shortlist, or from a crowded menu?
That is a more realistic evaluation setup for the kinds of systems many teams are actually deploying.
The broader lesson
The main lesson I took from the ICLR poster is much broader than a single new metric: it’s that LLM system quality depends heavily on the cleanliness of the context we construct around the model. That has consequences across the Agentic stack:
- retrieval,
- memory,
- tool routing,
- agent planning,
- multi-step workflows,
- and even UI design for human-in-the-loop systems.
The best LLM systems will be the ones that expose the right information, at the right moment, in the smallest clean bundle that still supports the task. This is the nature of what good context engineering looks like.
Final thought
For years, retrieval was mostly about finding needles in haystacks. For LLM systems, that is no longer enough. Now the job is also to avoid dragging half the haystack into the prompt along with the needle.
That is why I think the BoR idea matters and is so impactful. It gives practitioners a better language for a real production problem: how to measure when useful context has quietly turned into polluted context. And once you start looking at your systems that way, a lot of familiar agent failures begin to make much more sense.
BoR does not directly measure what the model “feels” when reading a prompt, but it does tell us when retrieval is ceasing to be meaningfully selective and starting to resemble brute-force context stuffing. In practice, that is often exactly the regime where LLMs begin to read more junk, reason less cleanly, and perform worse downstream.
More broadly, I think this points to an important emerging sub-field: developing better metrics for measuring LLM system performance in realistic settings, not just model capability in isolation. We have become reasonably good at measuring accuracy, recall, and benchmark performance, but much less good at measuring what happens when a model is forced to reason through cluttered, overlapping, or weakly filtered context.
That, to me, exposes a real gap. BoR helps measure selectivity relative to chance, which is valuable. But there is still a missing concept around what I would term cognitive overload: the point at which a model may still have the right information somewhere in view, yet performs worse because too many competing options, snippets, tools, or cues are presented at once. In other words, the failure is no longer just retrieval failure. It is a reasoning failure induced by prompt pollution.
I suspect that better ways of measuring this kind of cognitive overload will become increasingly important as agentic systems grow more complex. The next leap forward may not just come from larger models or bigger context windows, but from better ways of quantifying when the model’s working context has crossed the line from useful breadth into harmful overload.
Inspired by the ICLR 2026 blogpost/article, The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection.
Disclaimer: The views and opinions expressed in this article are solely my own and do not represent those of my employer or any affiliated organisations. The content is based on personal reflections and speculative thinking about the future of science and technology. It should not be interpreted as professional, academic, or investment advice. These forward-looking perspectives are intended to spark discussion and imagination, not to make predictions with certainty.

