a couple of failure that was one thing attention-grabbing.
For months, I — together with tons of of others — have tried to construct a neural community that would study to detect when AI techniques hallucinate — once they confidently generate plausible-sounding nonsense as a substitute of really participating with the data they got. The concept is easy: practice a mannequin to acknowledge the delicate signatures of fabrication in how language fashions reply.
But it surely didn’t work. The discovered detectors I designed collapsed. They discovered shortcuts. They failed on any information distribution barely totally different from coaching. Each strategy I attempted hit the identical wall.
So I gave up on “studying”. And I began to suppose, why we don’t convert this right into a geometry drawback? And that is what I did.
Backing Up
Earlier than I get into the geometry, let me clarify what we’re coping with. As a result of “hallucination” has turn out to be a type of phrases which means all the things and nothing. Right here’s the precise scenario. You will have a Retrieval-Augmented Technology system — a RAG system. Once you ask it a query, it first retrieves related paperwork from some data base. Then it generates a response that’s imagined to be grounded in these paperwork.
- The promise: solutions backed by sources.
- The truth: generally the mannequin ignores the sources totally and generates one thing that sounds affordable however has nothing to do with the retrieved content material.
This issues as a result of the entire level of RAG is trustworthiness. Should you needed inventive improvisation, you wouldn’t trouble with retrieval. You’re paying the computational and latency price of retrieval particularly since you need grounded solutions.
So: can we inform when grounding failed?
Sentences on a Sphere
LLMs characterize textual content as vectors. A sentence turns into some extent in high-dimensional house — 768 embedding dimensions for the primary fashions, although the precise quantity doesn’t matter a lot (DeepSeek-V3 and R1 have an embedding measurement of seven,168). These embedding vectors are normalized. Each sentence, no matter size or complexity, will get projected onto a unit sphere.
As soon as we predict on this projection, we will play with angles and distances on the sphere. For instance, we anticipate that related sentences cluster collectively. “The cat sat on the mat” and “A feline rested on the rug” find yourself close to one another. Unrelated sentences find yourself far aside. This clustering is how the embedding fashions are educated.
So now think about what occurs in RAG. Now we have three items of textual content (Determine 1):
- The query, q (one level on the sphere)
- The retrieved context, c one other level)
- The generated response, r (a 3rd level)
Three factors on a sphere kind a triangle. And triangles have geometry (Determine 2).
The Laziness Speculation
When a mannequin makes use of the retrieved context, what ought to occur? The response ought to depart from the query and transfer towards the context. It ought to choose up the vocabulary, framing, and ideas from the supply materials. Geometrically it implies that the response ought to be nearer to the context than to the query (Determine 1).
However when a mannequin hallucinates — when it ignores the context and generates one thing from its personal parametric data — the response keep within the query’s neighborhood. It continues the query’s semantic framing with out venturing into unfamiliar territory. I referred to as this semantic laziness. The response doesn’t journey. It stays house. Determine 1 illsutrates the laziness signature. Query q, context c, and response r, kind a triangle on the unit sphere. A grounded response ventures towards the context; a hallucinated one stays house close to the query. The geometry is high-dimensional, however the instinct is spatial: did the response truly go anyplace?
Semantic Grounding Index
To measure this, I outlined a ratio:

and I referred to as it Semantinc Grounding Index or SGI.

If SGI is bigger than 1, the response departed towards the context. If SGI is lower than 1, the response stayed near the query, that means that mannequin isn’t capable of finding a strategy to explare the solutions house and stays too near the query (a sort of security state). The SGI has simply two angles and a division. No neural networks, no discovered parameters, no coaching information. Pure geometry.

Does It Truly Work?
Easy concepts want empirical validation. I ran this on 5,000 samples from HaluEval, a benchmark the place we all know floor fact — which responses are real and that are hallucinated.

I ran the identical evaluation with 5 fully totally different embedding fashions. Completely different architectures, totally different coaching procedures, totally different organizations — Sentence-Transformers, Microsoft, Alibaba, BAAI. If the sign have been an artifact of 1 explicit embedding house, these fashions would disagree. They didn’t disagree. The typical correlation throughout fashions was r = 0.85 (from 0.80 to 0.95).

When the Math Predicted One thing
Up so far, I had a helpful heuristic. Helpful heuristics are wonderful. However what occurred subsequent turned a heuristic into one thing extra principled. The triangle inequality. You in all probability bear in mind this from college: the sum of any two sides of a triangle should be better than the third facet. This constraint applies on spheres too, although the formulation appears to be like barely totally different.

If the query and context are very shut collectively — semantically related — then there isn’t a lot “room” for the response to distinguish between them. The geometry forces the angles to be related no matter response high quality. SGI values get squeezed towards 1. However when the query and context are far aside on the sphere? Now there’s geometric house for divergence. Legitimate responses can clearly depart towards the context. Lazy responses can clearly keep house. The triangle inequality loosens its grip.
This suggests a prediction:
SGI’s discriminative energy ought to enhance as question-context separation will increase.
The outcomes confirms this prediction: monotonic enhance. Precisely because the triangle inequality predicted.
| Query-Context Separation | Impact Dimension (d) | AUC |
| Low (related) | 0.61 | 0.72 |
| Medium | 0.90 | 0.77 |
| Excessive (totally different) | 1.27 | 0.83 |
This distinction carries epistemic weight. Observing behaviour in information after the very fact affords weak proof — such baehaviour might mirror noise or analyst levels of freedom moderately than real construction. The stronger check is prediction: deriving what ought to occur from fundamental rules earlier than analyzing the info. The triangle inequality implied a selected relationship between θ(q,c) and discriminative energy. The empirical outcomes confirmed it.
The place It Doesn’t Work
TruthfulQA is a benchmark designed to check factual accuracy. Questions like “What causes the seasons?” with right solutions (“Earth’s axial tilt”) and customary misconceptions (“Distance from the Solar”). I ran SGI on TruthfulQA. The consequence: AUC = 0.478. Barely worse than random guessing.
Angular geometry captures topical similarity. “The seasons are attributable to axial tilt” and “The seasons are attributable to photo voltaic distance” are about the identical subject. They occupy close by areas on the semantic sphere. One is true and one is fake, however they’re each responses that have interaction with the astronomical content material of the query.
SGI detects whether or not a response departed towards its sources. It can’t detect whether or not the response acquired the information proper. These are essentially totally different failure modes. It’s a scope boundary. And figuring out your scope boundaries is arguably extra necessary than figuring out the place your technique works.
What This Means Virtually
Should you’re constructing RAG techniques, SGI appropriately ranks hallucinated responses under legitimate ones about 80% of the time — with none coaching or fine-tuning.
- In case your retrieval system returns paperwork which are semantically very near the questions, SGI could have restricted discriminative energy. Not as a result of it’s damaged, however as a result of the geometry doesn’t allow differentiation. Contemplate whether or not your retrieval is definitely including data or simply echoing the question.
- Impact sizes roughly doubled for long-form responses in comparison with quick ones. That is exactly the place human verification is costliest — studying a five-paragraph response takes time. Automated flagging is most dear precisely the place SGI works greatest.
- SGI detects disengagement. Pure language inference detects contradiction. Uncertainty quantification detects mannequin confidence. These measure various things. A response will be topically engaged however logically inconsistent, or confidently incorrect, or lazily right accidentally. Protection in depth.
The Scientific Query
I’ve a speculation about why semantic laziness occurs. I wish to be sincere that it’s hypothesis — I haven’t confirmed the causal mechanism.
Language fashions are autoregressive predictors. They generate textual content token by token, every selection conditioned on all the things earlier than. The query gives robust conditioning — acquainted vocabulary, established framing, a semantic neighborhood the mannequin is aware of nicely.
The retrieved context represents a departure from that neighborhood. Utilizing it nicely requires assured bridging: taking ideas from one semantic area and integrating them right into a response that began in one other area.
When a LLM is unsure about find out how to bridge, the trail of least resistance is to remain house. Fashions generate one thing fluent that continues the query’s framing with out venturing into unfamiliar territory as a result of is statistically secure. As a consequence, the mannequin turns into semantically lazy.
If that is proper, SGI ought to correlate with inner mannequin uncertainty — consideration patterns, logit entropy, that form of issues. Low-SGI responses ought to present signatures of hesitation. That’s a future experiment.
Takeaways
- First: easy geometry can reveal construction that advanced discovered techniques miss. I spent months making an attempt to coach hallucination detectors. The factor that labored was two angles and a division. Generally the suitable abstraction is the one which exposes the phenomenon most straight, not the one with probably the most parameters.
- Second: predictions matter greater than observations. Discovering a sample is straightforward. Deriving what sample ought to exist from first rules, then confirming it — that’s how you already know you’re measuring one thing actual. The stratified evaluation wasn’t probably the most spectacular quantity on this work, however it was a very powerful.
- Third: boundaries are options, not bugs. SGI fails fully on TruthfulQA. That failure taught me extra about what the metric truly measures than the successes did. Any software that claims to work all over the place in all probability works nowhere reliably.
Trustworthy Conclusion
I’m undecided if semantic laziness is a deep fact about how language fashions fail, or only a helpful approximation that occurs to work for present architectures. The historical past of machine studying is suffering from insights that appeared elementary and turned out to be contingent.
However for now, now we have a geometrical signature of disengagement: a sensible “hallucinations” detector. It’s constant throughout embedding fashions. It’s predictable from mathematical first rules. And it’s low cost to compute.
That looks like progress.

Notice: The scientific paper with full methodology, statistical analyses, and reproducibility particulars is on the market at https://arxiv.org/abs/2512.13771.
You may cite this work in BibText as:
@misc{marín2025semanticgroundingindexgeometric,
title={Semantic Grounding Index: Geometric Bounds on Context Engagement in RAG Methods},
writer={Javier Marín},
yr={2025},
eprint={2512.13771},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2512.13771},
}
Javier Marin is an impartial AI researcher primarily based in Madrid, engaged on reliability evaluation for manufacturing AI techniques. He tries to be sincere about what he doesn’t know. You may contact Javier at [email protected]. Any contribution might be wellcomed!
References
- Azaria, A. and Mitchell, T. (2023). The inner state of an LLM is aware of when it’s mendacity. In Findings of the Affiliation for Computational Linguistics: EMNLP 2023, pages 967–976.
- Bao, F., Chen, Y., and Wang, X. (2025). FaithBench: A various hallucination benchmark for summarization by trendy LLMs. arXiv preprint arXiv:2501.00942.
- Bridson, M.R. and Haefliger, A. (2013). Metric Areas of Non-Constructive Curvature, quantity 319 of Grundlehren der mathematischen Wissenschaften. Springer-Verlag, Berlin.
- Catak, F.O., Kuzlu, M., and Guler, O. (2024). Uncertainty quantification in massive language fashions by means of convex hull evaluation. arXiv preprint arXiv:2406.19712.
- Firth, J.R. (1957). A synopsis of linguistic concept, 1930–1955. In Research in Linguistic Evaluation, pages 1–32. Blackwell,Oxford.
- Fisher, R.A. (1953). Dispersion on a sphere. Proceedings of the Royal Society of London. Sequence A, 217(1130):295–305.
- Guu, Okay., Lee, Okay., Tung, Z., Pasupat, P., and Chang, M.-W. (2020). REALM: Retrieval-augmented language mannequin pre-training. In Proceedings of the thirty seventh Worldwide Convention on Machine Studying, pages 3929–3938.
- Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T. (2023). A survey on hallucination in massive language fashions: Rules, taxonomy, challenges, and open questions. ACM Transactions on Data Methods.
- Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y.J., Madotto, A., and Fung, P. (2023). Survey of hallucination in pure language era. ACM Computing Surveys, 55(12):1–38.
- Kovács, Á. and Recski, G. (2025). LettuceDetect: A hallucination detection framework for RAG purposes. arXiv preprint arXiv:2502.17125. 10 A PREPRINT — DECEMBER 15, 2025
- Kuhn, L., Gal, Y., and Farquhar, S. (2023). Semantic uncertainty: Linguistic invariances for uncertainty estimation in pure language era. In The Eleventh Worldwide Convention on Studying Representations.
- Li, X., Wang, Y., and Chen, Z. (2025). Semantic quantity estimation for uncertainty quantification in language fashions. arXiv preprint arXiv:2501.08765.
- Meng, Y., Huang, J., Zhang, G., and Han, J. (2019). Spherical textual content embedding. In Advances in Neural Data Processing Methods, quantity 32, pages 8208–8217.
- Pestov, V. (2000). On the geometry of similarity search: Dimensionality curse and focus of measure. Data Processing Letters, 73(1–2):47–51.
- Wang, T. and Isola, P. (2020). Understanding contrastive illustration studying by means of alignment and uniformity on the hypersphere. In Proceedings of the thirty seventh Worldwide Convention on Machine Studying, pages 9929–9939.

