RAG Is Blind to Time — I Built a Temporal Layer to Fix It in Production

, a learner messaged me a few mistaken reply.

She had requested the tutor a few idea from considered one of my Generative AI tutorials. The response seemed positive. Nevertheless it wasn’t. I had already rewritten that content material two months earlier. My RAG system pulled a model from six months in the past — not clearly mistaken, simply mistaken sufficient to mislead.

She thought she had misunderstood. She hadn’t. My very own system was instructing her from classes I had already changed.

I’m constructing a RAG-powered assistant for EmiTechLogic, my tech training platform — turning a content material library right into a system that generates solutions instantly from my very own articles. I wrote concerning the preliminary structure here. The preliminary structure was manageable. The true problem begins when actual learners hit a dwell system.

After I pulled the retrieval logs, I noticed precisely what occurred. Each variations had been within the vector retailer. The previous one ranked first as a result of it had extra matching tokens and the next cosine similarity rating. The up to date model got here in second. Generally third.

I anticipated the newer doc to win mechanically. That’s not how cosine similarity works.

The system was doing precisely what it was designed to do, which turned out to be the issue.

The sample held throughout different queries too. Python tutorials I had up to date, mannequin comparability guides I had revised. Outdated variations saved surfacing first. The AI software I used to be constructing was quietly instructing folks from classes I had already changed.

Right here’s what that seemed like in apply, similar question, similar corpus, naive RAG:

QUERY: What are the API charge limits? Will I get a 429 error?

NAIVE RAG
  1. [policy_v1]          age=540d | EXPIRED | sim=0.447
     "API charge limits are set to 100 requests per minute..."
  2. [announcement_today] age=0d   | legitimate   | sim=0.329
  3. [tutorial_old]       age=600d | EXPIRED | sim=0.303

A 540-day-old expired doc was sitting on the prime. The dwell announcement from 48 hours in the past was ranked second. The retriever didn’t care about freshness. It solely matched phrases.

I assumed freshness could be dealt with someplace within the pipeline. It wasn’t. No person had thought so as to add it.

This text is about how I fastened that. I constructed a temporal layer, a layer that sits between the vector search outcomes and the LLM, and makes the system care about time.

TL;DR

In the event you’re brief on time: vector search has no idea of when one thing was true. I fastened this by including a reranking step between the retriever and the LLM — one which hard-removes expired information, boosts energetic time-bounded alerts, and makes use of exponential decay to favor newer paperwork. The difficult half was ensuring “contemporary” didn’t override “related.”

The one-line model: naive RAG finds what’s comparable, temporal RAG finds what’s nonetheless true.

Full code: https://github.com/Emmimal/temporal-rag/

Who that is for

Any RAG system the place the information base adjustments over time. In case your system has ever given a assured reply from a doc you had already up to date, deprecated, or changed — that is for you.

It issues most for API and product documentation, incident and outage administration, buyer help information bases, inner wikis and coverage techniques, and training platforms the place content material evolves.

Skip it in case your information base is static and by no means adjustments. Skip it in case your content material has no idea of expiry, variations, or time-bounded alerts. Skip it if a stale reply carries no actual consequence.

Why Vector Search Has No Sense of Time

The usual RAG pipeline embeds paperwork, embeds the question, finds the closest matches, and sends them to the mannequin [1, 2]. That works positive in case your info by no means adjustments. However in case you are always publishing new guides and rewriting previous ones, this fails silently. You may not even discover till a consumer complains.

The vector retailer simply is aware of the angle between the vectors [10]. It has no concept which doc is six months previous and which one I revealed final week.

The same old fixes are deleting previous paperwork or including metadata filters. I attempted each. They helped for about two weeks, after which I up to date my content material once more and the identical drawback returned. A doc with a 20% penalty can nonetheless rank first if its phrase overlap is powerful sufficient.

After I seemed nearer, I spotted this wasn’t one massive drawback. It was really three separate issues, and each wants a unique repair.

I had been collapsing all three into one bucket referred to as “stale content material” and making use of the identical repair to all of them. That’s why nothing was sticking.

Three Time Issues, Three Totally different Fixes

1. Expiration: a reality that’s now false

Some paperwork have an expiry date. Displaying them after that date isn’t a freshness subject. It’s a lie. You’ll be able to’t simply down-rank these. You must take away them utterly earlier than the mannequin ever sees them.

2. Temporality: information which are solely true proper now

Some info issues intensely for a brief window. A dwell discover a few website outage or a 48-hour coverage change isn’t simply further context. It’s a very powerful doc in your information base whereas its window is open. An hour after it closes, it’s false.

3. Versioning: a incontrovertible fact that has been changed

This was my largest drawback. After I up to date a doc, each variations stayed within the vector retailer. The previous one saved profitable as a result of it had extra matching phrases. The repair right here is neither removing nor boosting. Let time decay deal with it. The newer doc ought to naturally outscore the older one when recency is a part of the rating sign.

Drawback	Nature	Fallacious repair	Proper repair
Expiration	Reality is now false	Down-weight	Laborious take away earlier than rating
Temporality	Reality is energetic and pressing	Deal with as regular	Enhance whereas window is open
Versioning	Reality is outdated	Laborious take away	Time decay ranks newer increased

I saved seeing the identical sample: previous paperwork, expired paperwork, and short-term alerts had been all handled like the identical drawback. In apply, it behaved extra like a set of guidelines than an precise temporal retrieval mannequin.

How This Pertains to Current Analysis

I checked out current approaches — graph-based retrieval, timestamped embeddings, recency priors baked into the retriever itself. Time-aware language fashions bake temporal alerts instantly into the mannequin weights [5], whereas internet-augmented approaches fetch dwell paperwork at question time [3]. RealTime QA [4] frames the issue as considered one of reply foreign money somewhat than retrieval rating. All of them required rebuilding infrastructure I didn’t have. I wanted one thing I may drop into the system I already had working.

So I constructed a post-retrieval layer as an alternative — a reranking step [6] utilized downstream of dense passage retrieval. No retriever adjustments. No new embedding mannequin. No new infrastructure. All it requires is a timestamp on every doc and one reranking step at question time.

I wanted one thing working inside days on a dwell platform, not a rebuild. This was that.

What I Constructed: A Temporal Layer

What I ended up constructing was a temporal layer that sits between the retriever and the LLM. The retriever stays unchanged. It nonetheless pulls the highest 20 candidates by cosine similarity. The temporal layer receives these candidates, reclassifies them, and reranks them earlier than any attain the mannequin.

The multi-stage temporal retrieval structure displaying the transformation of uncooked semantic search outcomes into time-aware context via hybrid scoring and validity filtering. Picture by Creator

That hole between retriever and LLM is the place all the actual work occurs.

The Core Design: Two Orthogonal Axes

The important thing design determination is 2 impartial classification axes, not one.

Axis 1: Validity State (3 States)

EXPIRED  -> was true, is now not true. Laborious take away earlier than rating.
VALID    -> true with no energetic time constraint. Regular scoring.
TEMPORAL -> true inside a at the moment energetic time window. Enhance.

Most techniques run on two states: legitimate and expired. What I used to be lacking was a separate TEMPORAL state for energetic time-bound alerts. A upkeep discover isn’t the identical as a everlasting rule. It’s pressing and must floor first. As soon as upkeep is over, the discover strikes to EXPIRED and is eliminated.

Yow will discover the total code for a way this works in my mission folder. Here’s a simplified model of the primary logic:

TEMPORAL state is gated on doc variety.

# TEMPORAL state is gated on doc variety.
# Solely EVENT paperwork attain TEMPORAL — not VERSIONED, not STATIC.
if self.variety == DocumentKind.EVENT:
    return ValidityState.TEMPORAL

return ValidityState.VALID  # VERSIONED docs with valid_from are nonetheless simply VALID

The whole implementation with all edge instances is linked within the “Run It Your self” part.

Axis 2: Doc Type (3 Varieties)

STATIC    -> timeless reality (definitions, math, reference materials)
VERSIONED -> changed by newer info (insurance policies, tutorials, specs)
EVENT     -> true solely inside a time window (bulletins, outages)

This distinction issues loads. With out it, my first model categorized a brand new firm coverage as a brief occasion and boosted it to the highest of each search. A coverage replace behaves in a different way from a dwell outage discover, even when each are current. It must be ranked usually and lose factors slowly over time.

The repair: I made it so solely “Occasions” (like information or alerts) can get the “Pressing” increase. Regular paperwork are by no means handled that means.

policy_v2:          variety=VERSIONED  state=legitimate    window=supersedes policy_v1
announcement_today: variety=EVENT      state=temporal  window=42h remaining

Even with similar timestamps, these paperwork behave in a different way as a result of they symbolize completely different varieties of data.

Grid showing how document validity state (Expired, Valid, Temporal) intersects with document kind (Static, Versioned, Event) to determine retrieval treatment in a temporal RAG system. The only cell that receives a boost is Temporal × Event. All Expired cells are hard removed. Static and Versioned documents cannot reach Temporal state. — The 2-axis classification system that drives the temporal layer — validity
state decides what to do with a doc, doc variety decides why. Picture by Creator

The Scoring Method

The ultimate rating for every doc combines vector similarity with temporal alerts:

final_score = semantic_penalty
            × [(1 − w) × vector_score
               + w × (decay_score × recency_score
                      × validity_multiplier × event_relevance_multiplier)]

The place:

vector_score: cosine similarity, normalized to fall between 0 and 1 relative to the candidate pool.

decay_score: exponential decay primarily based on doc age, a way utilized to doc freshness rating in info retrieval [10].

decay = 0.5 ^ (age_in_days / half_life_days)

It’s also possible to change how briskly the rating ought to drops primarily based on what the doc is. For instance, information fades away in simply 7 days, however authorized paperwork keep robust for twelve months.

recency_score: a relative comparability inside the present pool. The latest doc will get the highest rating, the oldest will get the underside. This ensures the system all the time prefers the freshest possibility obtainable, not simply the freshest possibility in absolute phrases.

validity_multiplier — utilized primarily based on validity state:

EXPIRED  -> 0.0  (security internet; ought to already be filtered)
VALID    -> 1.0  (regular)
TEMPORAL -> 1.2  (increase for energetic EVENT alerts)

event_relevance_multiplier — utilized to EVENT paperwork solely

EVENT + TEMPORAL + raw_cosine >= ground -> 1.0  (full increase)
EVENT + TEMPORAL + raw_cosine <  ground -> 0.5  (increase halved)

semantic_penalty — utilized to all doc varieties:

normalized_score >= min_threshold -> 1.0  (no penalty)
normalized_score <  min_threshold -> 0.3  (relevance penalty)

w is temporal_weight — the steadiness between semantic relevance and temporal alerts. I run it at 0.40 on my platform’s tutor, which means 60% of the rating nonetheless comes from which means, 40% from time.

Flowchart showing how the temporal RAG hybrid scoring formula is assembled. The final score splits into semantic penalty on the left and temporal component on the right, weighted 60% vector and 40% temporal. The temporal component feeds into four sub-components: decay score, recency score, validity multiplier, and event relevance gate. The decay score links down to all seven half-life profiles. — Each candidate doc passes via this pipeline earlier than reaching the LLM —
60% semantic match, 40% time, with freshness by no means allowed to override relevance. Picture by Creator.

The Failure That Revealed the EVENT Relevance Gate

After the primary model was working, I observed a brand new drawback. A consumer requested about “engineering staff well being,” however the prime end result was a discover about web site upkeep.

The discover was new. Nevertheless it had nothing to do with the query. It received just because it was the freshest factor within the system. Being new isn’t sufficient. The doc additionally must be related.

With out some relevance gating, contemporary alerts began displaying up in unrelated queries.

So I added a tough requirement: an occasion solely will get its increase if its uncooked cosine rating clears a minimal ground. If the content material doesn’t speak about the precise subject, the recency benefit disappears.

def _event_relevance_multiplier(self, doc, state, raw_vector_score) -> float:
    if doc.variety != DocumentKind.EVENT:
        return 1.0
    if state != ValidityState.TEMPORAL:
        return 1.0
    ground = self.config.event_min_raw_vector_score
    return 1.0 if raw_vector_score >= ground else 0.5

Why uncooked cosine and never normalized? As a result of it acts as an absolute ruler.

Normalized scores are relative. If all of your outcomes are weak, the “least dangerous” one may nonetheless rating 80%. That’s harmful. Uncooked cosine doesn’t care concerning the different paperwork. If a question about “staff well being” has virtually nothing in widespread with a “technical replace,” the rating stays close to zero regardless.

motive: EVENT sign current however low question relevance
        (uncooked sim 0.101 < 0.2) — temporal increase halved

Threshold calibration observe: The quantity you utilize as your “safety guard” threshold is determined by the kind of AI mannequin you utilize.

TF-IDF / sparse embeddings: use a ground round 0.20. Phrase-match scores are naturally decrease.
Dense fashions like text-embedding-3-small or all-MiniLM-L6-v2 [7]: use 0.35 to 0.50. These fashions rating increased by default, so the ground wants to maneuver up.

4 Situations: Earlier than and After

These are the precise outputs from working demo.py on the identical queries, two methods: naive RAG and temporal RAG.

State of affairs 1 — API Fee Limits (expired reply is harmful)

QUERY: What are the API charge limits? Will I get a 429 error?

NAIVE RAG
  1. [policy_v1]          age=540d | EXPIRED | sim=0.447
  2. [announcement_today] age=0d   | legitimate   | sim=0.329
  3. [tutorial_old]       age=600d | EXPIRED | sim=0.303

TEMPORAL RAG
  [announcement_today]
    age          : 0.3 days  |  variety: EVENT  |  state: temporal (energetic)
    window       : 42h remaining
    motive       : Lively EVENT sign (42h remaining) — overrides static sources
    FINAL SCORE  : 1.079

  [policy_v2]
    age          : 175.0 days  |  variety: VERSIONED  |  state: ✓ legitimate
    motive       : Newest model — supersedes policy_v1
    FINAL SCORE  : 0.573

  [news_recent]
    age          : 30.0 days  |  variety: STATIC  |  state: ✓ legitimate
    motive       : Contemporary, open-ended reality — excessive confidence
    FINAL SCORE  : 0.509

  eliminated  : ['policy_v1', 'tutorial_old']
  surfaced : ['policy_v2', 'news_recent']

Naive RAG tells the consumer they’ll hit 429 errors at 100 requests per minute. The precise restrict is 1,000. Temporal RAG leads with the dwell upkeep announcement (charge limiting is at the moment suspended) and follows with the present coverage.

State of affairs 2: LLM Scaling Analysis

QUERY: Do bigger language fashions maintain bettering with scale?

NAIVE RAG
  1. [tutorial_old]   age=600d | EXPIRED | sim=0.226
  2. [research_2022]  age=730d | legitimate   | sim=0.141
  3. [research_2026]  age=120d | legitimate   | sim=0.136

TEMPORAL RAG
  [research_2026]  STATIC ✓ legitimate  rating=0.662
    motive: Stale — semantically related however low freshness weight
  [research_2022]  STATIC ✓ legitimate  rating=0.600
    motive: Stale — semantically related however low freshness weight
  [news_old]       STATIC ✓ legitimate  rating=0.476
    motive: Stale — semantically related however low freshness weight

  eliminated : tutorial_old
  surfaced: news_old

Naive RAG ranks a useless doc first by phrase overlap. Temporal RAG removes it and places the 2026 analysis on the prime, the place it belongs. The corpus paperwork on this state of affairs replicate the actual shift in scaling analysis: the sooner plateau discovering [8] was later revised by compute-optimal scaling research [9].

State of affairs 3 — Firm Well being (one story vs the total image)

QUERY: What's the present state of the engineering staff and firm well being?

NAIVE RAG
  1. [news_old]      age=400d | legitimate   | sim=0.600
  2. [tutorial_new]  age=85d  | legitimate   | sim=0.385
  3. [tutorial_old]  age=600d | EXPIRED | sim=0.304

TEMPORAL RAG
  [news_old]     STATIC    ✓ legitimate  rating=0.602
    motive: Stale — semantically related however low freshness weight
  [news_recent]  STATIC    ✓ legitimate  rating=0.543
    motive: Contemporary, open-ended reality — excessive confidence
  [tutorial_new] VERSIONED ✓ legitimate  rating=0.519
    motive: Newest model — supersedes tutorial_old

  eliminated  : tutorial_old
  surfaced : news_recent

The dwell announcement didn’t seem right here as a result of it failed the relevance gate. Its uncooked cosine was 0.165, under the 0.20 ground. However each information articles confirmed up, which is strictly proper. The LLM can now learn each and perceive how issues have modified over time. Naive RAG solely surfaced the previous story and two unrelated guides.

State of affairs 4 — Stay Outages (pressing sign buried)

QUERY: Are there any present API outages or restrict suspensions I ought to learn about?

NAIVE RAG
  1. [policy_v1]          age=540d | EXPIRED | sim=0.390
  2. [policy_v2]          age=175d | legitimate   | sim=0.267
  3. [announcement_today] age=0d   | legitimate   | sim=0.101

TEMPORAL RAG
  [policy_v2]           VERSIONED ✓ legitimate     rating=0.641
    motive: Newest model — supersedes policy_v1
  [announcement_today]  EVENT     temporal  rating=0.465
    motive: EVENT sign current however low question relevance
            (uncooked sim 0.101 < 0.2) — temporal increase halved
  [news_recent]         STATIC    ✓ legitimate     rating=0.082
    motive: Penalized: normalized vector rating 0.000 under relevance threshold

  eliminated : policy_v1

Naive RAG buries the dwell replace at place 3 behind an expired coverage. Temporal RAG strikes it to place 2. It didn’t attain first as a result of the phrase overlap between “outages” and “upgrades” was low. With dense embeddings as an alternative of TF-IDF, it might have taken the highest spot simply.

What broke subsequent — and the way I fastened it

As soon as the core temporal layer was working, actual queries surfaced extra surprises. Right here’s what broke subsequent.

When a doc is simply too previous to face alone however too helpful to drop

Some paperwork weren’t mistaken, simply sufficiently old that I didn’t need them answering alone. So I added a 3rd motion between SOLO and DROP: Weak paperwork get retrieved provided that a brisker supply comes with them. Invalid ones by no means attain the mannequin.

[Invalid] research_old     decay=0.100  → DO NOT RETRIEVE
[Weak]    research_weak    decay=0.351  → PAIR WITH research_fresh (achieve=+0.540)
[Good]    research_fresh   decay=0.891  → RETRIEVE

When the rating appears to be like good however the reply isn’t sure

A excessive rating doesn’t imply excessive confidence. When two paperwork rating 0.73 and 0.72 however contradict one another, the system shouldn’t act sure. I added confidence tiers that examine the margin and flag conflicts — an in depth race or contradiction drops the end result to LOW whatever the uncooked rating.

policy_v3 — clear winner:         confidence 0.7485  → HIGH
policy_v3 — battle, slim margin: confidence 0.4727  → LOW
math_theorem:                     confidence 0.6992  → MEDIUM

The second policy_v3 row is the one which issues: rating went up from the battle increase, confidence went down as a result of the battle is a warning sign.

Figuring out why one thing was rejected

When the system rejects a doc I need to know precisely which rule fired and on which question. I added a failure log keyed by query_id.

Failure abstract (3 rejections — query_id=d211ffdc)
  EXPIRED_VERSIONED_DOC   × 1   doc=expired_policy
  STALE_STATIC_DOC        × 1   doc=stale_reference
  BELOW_RELEVANCE_GATE    × 1   doc=fresh_irrelevant

Codes in use: EXPIRED_VERSIONED_DOC, STALE_STATIC_DOC, HARD_EXPIRED_EVENT, BELOW_RELEVANCE_GATE, OUT_OF_TIME_RANGE, PAIR_PARTNER_NOT_FOUND. That is what I open first when one thing surfaces the mistaken doc.

When the actual fact modified considerably between variations

Changing “100 requests per minute” with “1,000 requests per minute” shouldn’t be a wording change. I added battle severity detection that enhances the winner’s rating and concurrently lowers its confidence — so the precise reply surfaces however the mannequin stays cautious.

'100'  → '5000'    severity=0.980   increase=+0.196   conf_pen=-0.098   (50× — extreme)
'1000' → '500'     severity=0.500   increase=+0.100   conf_pen=-0.050
'1000' → '1000'    severity=0.000   increase=0         conf_pen=0

When the consumer specifies a time vary

A learner typed “present me analysis from 2021 to 2023.” The system returned the three most up-to-date paperwork — none from that vary. Temporal decay made it worse, rating newer paperwork increased when older ones had been precisely what was requested for.

I added a time-range parser that applies a strict filter when the question alerts a date window, and steps apart solely when it doesn’t. I didn’t need it to guess.

'Present me analysis from 2021-2023'  → saved: research_2022
'What had been the findings in 2019?'  → saved: research_2019
'Newest embeddings analysis'       → no filter, all docs cross

When the question tells you ways a lot recency ought to matter

“What’s the present charge restrict?” wants the freshest reply obtainable. “How does cosine similarity work?” doesn’t care if I wrote it three years in the past. I used to be making use of the identical temporal weight to each. The burden now adjusts primarily based on sign phrases within the question.

'What's the present charge restrict?'          → temporal_weight: 0.70
'Has the speed restrict modified not too long ago?'     → temporal_weight: 0.55
'How does cosine similarity work?'         → temporal_weight: 0.20 (baseline)

Seeing contained in the system — and protecting model conflicts out of context

I needed to know not simply the place every doc ranked, however what to do with it. The freshness report provides kind-aware recommendation per doc:

fresh_event    [EVENT]     grade: A   → Confirm earlier than serving, window closes quickly
current_policy [VERSIONED] grade: D   → Examine for a more recent model
math_theorem   [STATIC]    grade: F   → Could have been outdated

The ultimate drawback was subtler. Even with good reranking, the LLM produced hedged or averaged solutions when v1 and v3 of the identical coverage each ended up in context. It doesn’t know which model to belief — it tries to reconcile the whole lot it sees. What solved it was deduplicating by model chain earlier than paperwork reached the temporal layer in any respect.

Enter: policy_v1 (v1), policy_v2 (v2), policy_v3 (v3)
  policy_v1 — EXPIRED → eliminated
  policy_v2 — outdated by v3 → eliminated
  policy_v3 — saved ✓

End result: ['policy_v3']

Coverage v3 goes in. The battle by no means comes up.

Not All Content material Decays on the Similar Fee

One factor grew to become clear shortly once I utilized this to the platform: a single half-life worth doesn’t work for all content material varieties. A breaking replace and a mathematical definition age very in a different way, and treating them the identical means was quietly sabotaging the rankings.

breaking_news:  half_life=1d,     temporal_weight=0.70
information:           half_life=7d,     temporal_weight=0.55
coverage:         half_life=90d,    temporal_weight=0.45
analysis:       half_life=180d,   temporal_weight=0.35
authorized:          half_life=365d,   temporal_weight=0.25
reference:      half_life=1825d,  temporal_weight=0.10
arithmetic:    half_life=36500d, temporal_weight=0.01

Line chart showing exponential decay scores over 365 days for seven content types used in a temporal RAG system. Breaking news drops to near zero within days. News follows within weeks. Policy, research, legal, and reference decay progressively slower. Mathematics stays nearly flat across the entire year, reflecting a 36500-day half-life. — Not all content material ages the identical means — a breaking information submit and a math theorem
are each “previous” after a yr, however solely considered one of them is mistaken. Picture by Creator.

For breaking information, being new is mainly the entire level. For a math proof, age doesn’t matter — a theorem from 70 years in the past is simply as legitimate as one from final week. On EmiTechLogic I group my content material into bands: tutorials use the “coverage” setting since newer is normally higher, and reference materials makes use of the “reference” setting because the information don’t expire. Getting this distinction proper is what really made the entire thing work.

There may be yet one more constraint layered on prime of half-life: a decay ground. With out it, a math theorem from 1954 will get a decay rating close to zero — not as a result of it’s mistaken, however just because it’s previous. The temporal element then drags its remaining rating down even when the semantic match is powerful. The ground prevents that. Within the implementation, DECAY_FLOORS maps a (doc_type, variety) pair to a minimal decay worth — arithmetic/STATIC flooring at 0.95, reference/STATIC at 0.70, analysis/STATIC at 0.10. Paperwork with no ground entry decay freely; paperwork with one by no means drop under their minimal. A cosine-similarity winner that occurs to be previous nonetheless competes on which means somewhat than shedding mechanically on age.

The implementation value is decrease than you’d count on. The temporal reranking step provides roughly 15 to 30 milliseconds per search — negligible subsequent to the 1 to 4 seconds LLM inference sometimes takes. You don’t want to vary your search engine, your knowledge, or your embedding mannequin. The complete temporal layer is a pure Python post-processing step that runs downstream of no matter vector search you’re already utilizing.

The one actual upfront requirement is metadata in your paperwork. At minimal, each doc wants a created_at timestamp. valid_from, valid_until, and variety provide the finest outcomes, however they’re non-compulsory — paperwork with none metadata fall again to STATIC/VALID with normal time-decay scoring, which is already higher than nothing. On my platform I automated the tagging solely. The system now distinguishes between an replace, an alert, and a everlasting reality with out me labeling something manually.

What This Does Not Resolve

A couple of sincere caveats earlier than you construct this.

Implicit expiration is the one I nonetheless haven’t totally solved. Most paperwork don’t announce after they go stale — a tutorial for a deprecated endpoint has no expiry date, so the system can’t realize it’s rotting. My heuristic guidelines catch the apparent instances, however edge instances slip via, and I discover them the identical means I discovered the unique drawback: a learner will get a solution that’s quietly mistaken.

Conflicting sources are outdoors the temporal layer’s scope solely. It surfaces the latest and related paperwork — resolving disagreements between them is the LLM’s drawback, not the retriever’s.

Calibration is model-specific in methods that may chunk you. The 0.20 uncooked cosine ground is tuned for TF-IDF. Dense fashions like text-embedding-3-small rating increased in absolute phrases, in order that ground wants to maneuver to 0.35–0.50. Check towards your personal queries earlier than you belief any threshold I’ve listed.

The half-life profiles are beginning factors, not constants. What “stale” means for a authorized staff shouldn’t be what it means for a information website. Run the system on actual queries out of your area and tune from there.

The Takeaway

The issue isn’t that RAG techniques retrieve mistaken paperwork — it’s that they don’t have any idea of when a doc was true, solely how comparable it’s to the question.

Two axes drove the entire design — the sort axis was the one I virtually missed solely. Validity state — whether or not a doc is expired (take away it), legitimate (rating usually), or temporal (increase it whereas its window is energetic). Doc variety — whether or not it’s a timeless reality (STATIC), one thing that has been changed (VERSIONED), or one thing that’s solely true inside a time window (EVENT).

With out the sort axis, a versioned coverage with an efficient date appears to be like similar to a time-bounded occasion and will get mislabeled. The system produces the mistaken end result for a right-sounding motive. That’s the toughest class of bug to catch in manufacturing, as a result of nothing appears to be like damaged.

The semantic threshold closes the final hole. Contemporary-but-irrelevant paperwork can take over rating when temporal scores are excessive. A minimal uncooked cosine ground for EVENT paperwork makes positive freshness by no means totally overrides relevance.

Similarity alone wasn’t sufficient anymore. I wanted the retriever to care about whether or not the knowledge was nonetheless legitimate.

Run It Your self

The total implementation (temporal_rag.py, demo.py, and superior.py) is accessible at:

GitHub: https://github.com/Emmimal/temporal-rag/

The repository contains the entire validity_state implementation, all decay profiles, the SequenceAwareRetriever, and the freshness report API. The demo runs with none API key utilizing a deterministic TF-IDF embedder so you possibly can reproduce the precise output proven above on any machine.

git clone https://github.com/Emmimal/temporal-rag/
cd temporal-rag
pip set up numpy
python demo.py

References

Foundational RAG

[1] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-T., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented era for knowledge-intensive NLP duties. Advances in Neural Info Processing Techniques, 33, 9459–9474. https://arxiv.org/abs/2005.11401

[2] Gao, Y., Xiong, Y., Gao, X., Jia, Ok., Pan, J., Bi, Y., Dai, Y., Solar, J., Wang, M., & Wang, H. (2024). Retrieval-augmented era for giant language fashions: A survey. arXiv preprint arXiv:2312.10997. https://doi.org/10.48550/arXiv.2312.10997

Temporal Reasoning in Language Fashions

[3] Lazaridou, A., Gribovskaya, E., Stokowiec, W., & Grigorev, N. (2022). Web-augmented language fashions via few-shot prompting for open-domain query answering. arXiv preprint arXiv:2203.05115. https://doi.org/10.48550/arXiv.2203.05115

[4] Kasai, J., Sakaguchi, Ok., Takahashi, Y., Le Bras, R., Asai, A., Yu, X., Radev, D., Smith, N. A., Choi, Y., & Inui, Ok. (2022). RealTime QA: What’s the reply proper now? arXiv preprint arXiv:2207.13332. https://doi.org/10.48550/arXiv.2207.13332

[5] Dhingra, B., Cole, J. R., Eisenschlos, J. M., Gillick, D., Eisenstein, J., & Cohen, W. W. (2022). Time-aware language fashions as temporal information bases. Transactions of the Affiliation for Computational Linguistics, 10, 257–273. https://doi.org/10.1162/tacl_a_00459

Dense Retrieval and Reranking

[6] Nogueira, R., & Cho, Ok. (2019). Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085. https://doi.org/10.48550/arXiv.1901.04085

[7] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings utilizing siamese BERT-networks. Proceedings of the 2019 Convention on Empirical Strategies in Pure Language Processing and the ninth Worldwide Joint Convention on Pure Language Processing (EMNLP-IJCNLP), 3982–3992. https://doi.org/10.18653/v1/D19-1410

Scaling Legal guidelines (referenced in State of affairs 2)

[8] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Youngster, R., Grey, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling legal guidelines for neural language fashions. arXiv preprint arXiv:2001.08361. https://doi.org/10.48550/arXiv.2001.08361

[9] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, Ok., van den Driessche, G., Damoc, B., Man, A., Osindero, S., Simonyan, Ok., Elsen, E., Rae, J. W., Vinyals, O., & Sifre, L. (2022). Coaching compute-optimal massive language fashions. arXiv preprint arXiv:2203.15556. https://doi.org/10.48550/arXiv.2203.15556

Info Retrieval Fundamentals

[10] Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to info retrieval. Cambridge College Press. https://nlp.stanford.edu/IR-book/

Disclosure

All code on this article was written by me and is authentic work, developed and examined on Python 3.12.6. Benchmark numbers and retrieval outputs are from precise demo runs on my native machine (Home windows 11, CPU solely) and are reproducible by cloning the repository and working demo.py and superior.py. The temporal layer, scoring formulation, doc classification system, and all design selections are impartial implementations not derived from any cited codebase. The demo runs with none API key utilizing a deterministic TF-IDF embedder; numpy is the one exterior dependency required to breed all outputs proven. I’ve no monetary relationship with any software, library, or firm talked about on this article.

Source link

RAG Is Blind to Time — I Built a Temporal Layer to Fix It in Production

Proxy-Pointer RAG — Structure-Aware Document Comparison at Enterprise Scale

Why My Coding Assistant Started Replying in Korean When I Typed Chinese

From Raw Data to Risk Classes

How I Continually Improve My Claude Code

Stop Evaluating LLMs with “Vibe Checks”

I Let CodeSpeak Take Over My Repository

Red Bull RB17 hypercar nearing completion

London’s PANTA raises €3.4 million to modernise how financial indices are built and managed

Sportsman’s Warehouse Promo Code: Save in May 2026

Kalshi has probed and flagged 400+ suspicious trades YTD, more than 2x the number it investigated in all of 2025; Polymarket has seen a similar uptick (Anirban Sen/Reuters)

Featured Picks

What Does “Following Best Practices” Mean in the Age of AI?

Nevada Rep. Titus unveils prediction markets bill targeting sports loopholes oversight push

Evoke launches strategic review exploring possible group sale or breakup after budget tax rises

RAG Is Blind to Time — I Built a Temporal Layer to Fix It in Production

TL;DR

Who that is for

Why Vector Search Has No Sense of Time

Three Time Issues, Three Totally different Fixes

1. Expiration: a reality that’s now false

2. Temporality: information which are solely true proper now

3. Versioning: a incontrovertible fact that has been changed

How This Pertains to Current Analysis

What I Constructed: A Temporal Layer

The Core Design: Two Orthogonal Axes

The Scoring Method

The Failure That Revealed the EVENT Relevance Gate

4 Situations: Earlier than and After

State of affairs 1 — API Fee Limits (expired reply is harmful)

State of affairs 2: LLM Scaling Analysis

State of affairs 3 — Firm Well being (one story vs the total image)

State of affairs 4 — Stay Outages (pressing sign buried)

What broke subsequent — and the way I fastened it

Not All Content material Decays on the Similar Fee

What This Does Not Resolve

The Takeaway

Run It Your self

References

Foundational RAG

Temporal Reasoning in Language Fashions

Dense Retrieval and Reranking

Scaling Legal guidelines (referenced in State of affairs 2)

Info Retrieval Fundamentals

Disclosure

Related Posts