Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Direct-to-Cell Technology: Enabling Satellite Connectivity for Legacy Devices
    • How small businesses can leverage AI
    • Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt
    • GM reimagines Hummer off-roader with California ideas unit
    • London’s DEScycle secures over €10 million in grant funding to scale critical metals recovery platform
    • How to Edit, Merge, and Split PDFs With Free Online Tools
    • Florida crackdown targets illegal machines in Sarasota
    • Audiophile-Oriented Noble Audio Debuts More Affordable Osprey Earbuds
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Tuesday, June 2
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»RAG Isn’t Enough — I Built the Missing Context Layer That Makes LLM Systems Work
    Artificial Intelligence

    RAG Isn’t Enough — I Built the Missing Context Layer That Makes LLM Systems Work

    Editor Times FeaturedBy Editor Times FeaturedApril 14, 2026No Comments15 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    TL;DR

    a full working implementation in pure Python, with actual benchmark numbers.

    RAG techniques break when context grows past a couple of turns.

    The true downside is just not retrieval — it’s what really enters the context window.

    A context engine controls reminiscence, compression, re-ranking, and token limits explicitly.

    This isn’t an idea. It is a working system with measurable habits.


    The Breaking Level of RAG Methods

    I constructed a RAG system that labored completely — till it didn’t.

    The second I added dialog historical past, all the things began breaking. Related paperwork have been getting dropped. The immediate overflowed. The mannequin began forgetting issues it had mentioned two turns in the past. Not as a result of retrieval failed. Not as a result of the immediate was badly written. However as a result of I had zero management over what really entered the context window.

    That’s the issue no one talks about. Most RAG tutorials cease at: retrieve some paperwork, stuff them right into a immediate, name the mannequin. What occurs when your retrieved context is 6,000 characters however your remaining finances is 1,800? What occurs when three of your 5 retrieved paperwork are near-duplicates, crowding out the one helpful one? What occurs when flip one among a twenty-turn dialog remains to be sitting within the immediate, taking on area, lengthy after it stopped being related?

    These aren’t uncommon edge circumstances. That is what occurs by default — and it begins breaking inside the first few turns.

    All outcomes under are from actual runs of the system (Python 3.12, CPU-only, no GPU), besides the place famous as calculated.

    The reply is a layer most tutorials skip completely. Between uncooked retrieval and immediate building, there’s a deliberate architectural step: deciding what the mannequin really sees, how a lot of it, and in what order. In 2025, Andrej Karpathy gave this a reputation: context engineering [2]. I’d been constructing it for months with out calling it that.

    That is the system I constructed from retrieval to reminiscence to compression with actual numbers and code you’ll be able to run.

    Full code: https://github.com/Emmimal/context-engine/


    What Context Engineering Really Is

    It’s value being exact, as a result of the phrases get muddled.

    Immediate engineering is the craft of what you say to the mannequin — your system immediate, your few-shot examples, your output format directions. It shapes how the mannequin causes.

    RAG is a method for fetching related exterior paperwork and together with them earlier than era. It grounds the mannequin in info it wasn’t educated on [1].

    Context engineering is the layer in between — the architectural choices about what data flows into the context window, how a lot of it, and in what type. It solutions: given all the things that would go into this immediate, what ought to really go in?

    All three are complementary. In a well-designed system they every have a definite job.


    Who This Is For

    This structure is value constructing in case you are engaged on multi-turn chatbots the place context accumulates throughout turns, RAG techniques with massive data bases the place retrieval noise is an actual downside, or AI copilots and brokers that want reminiscence to remain coherent.

    Skip it for single-turn queries with a small data base — the pipeline overhead doesn’t justify a marginal high quality acquire. Skip it for latency-critical providers underneath 50ms — embedding era alone provides ~85ms on CPU. Skip it for totally deterministic domains like authorized contract evaluation, the place keyword-only retrieval is commonly ample and extra auditable.

    When you’ve got limitless context home windows and limitless latency, plain RAG works positive. In manufacturing, these constraints don’t exist.


    Full Pipeline Structure

    A whole context engineering pipeline for RAG techniques, combining retrieval, reminiscence administration, compression, and token finances management to construct environment friendly and scalable LLM functions. Picture by Writer.

    Element 1: The Retriever

    Most RAG implementations decide one retrieval methodology and name it performed. The issue isn’t any single methodology dominates throughout all question sorts. Key phrase matching is quick and exact for actual phrases. TF-IDF handles time period weighting. Dense vector embeddings catch semantic relationships that key phrases miss completely.

    Key phrase vs. TF-IDF — Identical Question, Totally different Habits

    For the question: “how does reminiscence work in AI brokers”

    Each strategies agree on mem-001 as the highest doc. However there’s a important distinction: TF-IDF supplies extra nuanced scoring by weighting time period rarity, whereas key phrase retrieval solely counts uncooked overlap. On this question they converge — however they diverge badly on conceptual queries with totally different wording. That is exactly why hybrid retrieval turns into vital.

    The Retriever helps three modes: key phrase, tfidf, and hybrid. Hybrid mode runs each strategies and blends their scores with a single tunable weight:

    hybrid_score = alpha * emb_score + (1 - alpha) * tf_score

    The alpha=0.65 default weights embeddings barely greater than TF-IDF — empirical, not principled, however examined throughout totally different question types. Key phrase-heavy queries carry out higher round alpha=0.4; paraphrase-style queries profit from alpha=0.8 or larger.

    What Hybrid Retrieval Fixes That TF-IDF Misses

    For the question: “how do embeddings evaluate to TF-IDF for reminiscence in AI brokers”

    Mode Paperwork Retrieved Why
    TF-IDF mem-001, vec-001, ctx-001 Solely keyword-overlapping paperwork floor
    Hybrid mem-001, vec-001, tfidf-001, ctx-001 Conceptually related tfidf-001 now surfaces

    tfidf-001 doesn’t seem in TF-IDF outcomes as a result of it shares few question tokens. Hybrid mode surfaces it as a result of the embedding recognises its conceptual relevance. That is the precise failure mode of conventional RAG at scale.

    One implementation notice: sentence-transformers is elective. With out it, the system falls again to random embeddings with a warning. Manufacturing will get actual semantics; improvement will get a purposeful stub.


    Element 2: The Re-ranker

    Retrieval provides you candidates. Re-ranking decides the ultimate order.

    The re-ranker applies a two-factor weighted sum mixing retrieval rating with a tag-based significance worth. Paperwork tagged with reminiscence, context, rag, or embedding obtain a tag_importance of 1.4; all others obtain 1.0. Each feed into the identical formulation:

    final_score = base_score * 0.68 + tag_importance * 0.32

    A tagged doc with tag_importance=1.4 contributes 0.448 from that time period alone, versus 0.32 for an untagged one — a hard and fast bonus of 0.128 no matter retrieval rating. The weights replicate a selected prior: retrieval sign is major, area relevance is a significant secondary sign.

    Scores Earlier than and After Re-ranking

    Doc Earlier than Re-ranking After Re-ranking Change
    mem-001 0.4161 0.7309 +75.7%
    rag-001 exterior prime 4 0.5280 promoted
    vec-001 0.2880 0.5158 +79.1%
    tfidf-001 0.2164 0.4672 +115.9%

    rag-001 jumps from exterior the highest 4 to second place completely as a consequence of its tag enhance. These reorderings change which paperwork survive compression — they’re not beauty.

    Is the heuristic principled? Not completely. A cross-encoder re-ranker — scoring every query-document pair with a neural mannequin [7] — can be extra correct. However cross-encoders value one mannequin name per doc. At 5 paperwork, the heuristic runs in microseconds. At 500+, a cross-encoder turns into value the price.


    Element 3: Reminiscence with Exponential Decay

    That is the part most tutorials miss completely, and the one the place naive techniques collapse quickest.

    Conversational reminiscence has two failure modes: forgetting too quick (shedding context that’s nonetheless related) and forgetting too gradual (accumulating noise that crowds out helpful data). A sliding window drops outdated turns abruptly — flip 10 is totally current, flip 11 is gone. That’s not how helpful data works.

    The answer is exponential decay, the place turns fade repeatedly based mostly on three elements.

    The scoring formulation:

    efficient = significance * recency * freshness + relevance_boost

    The place every time period is:

    • recency = e^(−decay_rate × age_seconds) — older turns carry much less weight
    • freshness = e^(−0.01 × time_since_last_access) — just lately referenced turns get a lift
    • relevance_boost = (|question ∩ flip| / |question|) × 0.35 — turns with excessive query-token overlap are retained longer

    This mirrors how working reminiscence really prioritises data [4] — high-importance turns survive longer; off-topic turns fade rapidly no matter after they occurred.

    Auto-Significance Scoring

    Auto-importance scoring makes this sensible with out handbook annotation. The system scores every flip based mostly on content material size, area key phrases, and question overlap:

    Flip Content material Function Auto-Scored Significance
    “What’s context engineering and why is it essential?” consumer 2.33
    “Clarify how reminiscence decay prevents context bloat.” consumer 2.50
    “What’s the climate in Chennai at the moment?” consumer 1.10

    A climate query scores 1.10 — barely above the ground. A website query about reminiscence decay scores 2.50 and survives far longer earlier than decaying. In a protracted dialog, high-importance area turns keep in reminiscence whereas low-importance small-talk turns fade first — the precise ordering you need.

    Deduplication

    Deduplication runs earlier than any flip is saved, as a three-tier verify: actual containment (if the brand new flip is a substring of an current one, reject), robust prefix overlap (if the primary half of each turns match, reject), and token-overlap similarity >= 0.72 (if token overlap is excessive sufficient, reject as a paraphrase).

    At 0.72, you catch paraphrases with out falsely rejecting related-but-distinct questions on the identical subject. A follow-up like “Are you able to clarify context engineering and its position in RAG?” after “What’s context engineering and the way does it assist RAG techniques?” scores ~72% overlap — deduplication fires, one reminiscence slot saved, room made for genuinely new data.


    Token Funds Underneath Stress

    Token budget allocation across turns in an LLM system showing system prompt, conversation history, retrieved documents, and dynamic compression in a RAG pipeline
    How token finances is distributed throughout turns in a context-aware RAG system, balancing system prompts, reminiscence historical past, and retrieved paperwork. Picture by Writer.

    Element 4: Context Compression

    You’ve gotten 810 characters of retrieved context. Your remaining token finances permits 800. That 10-character hole means one thing both will get truncated badly or the entire thing overflows.

    The Compressor implements three methods. Truncate is the quickest — cuts every chunk proportionally. Sentence makes use of grasping sentence-boundary choice. Extractive is query-aware: each sentence throughout all retrieved paperwork will get scored by token overlap with the question, ranked by relevance, and greedily chosen inside finances. Then the chosen sentences are served again of their authentic doc order, not relevance rank order [5]. Relevance rank order produces incoherent context. Authentic order preserves the logical movement of the supply materials.

    Compression Technique Commerce-offs — Identical 810-Character Enter, 800-Character Funds

    Technique Output Dimension Compression Ratio What It Optimises
    Truncate 744 chars 91.9% Velocity
    Sentence 684 chars 84.4% Clear boundaries
    Extractive 762 chars 94.1% Relevance

    Extractive compression preserves which means higher — however saves fewer uncooked characters. Underneath tight budgets, it provides you the proper content material, not simply much less content material.


    Element 5: The Token Funds Enforcer

    Every little thing feeds into the TokenBudget — a slot-based allocator that tracks utilization throughout named context areas. Token estimation makes use of the 1 token ≈ 4 characters heuristic for English prose, in keeping with OpenAI’s documentation [6].

    The order of reservation is the entire design:

    def construct(self, question: str) -> ContextPacket:
        finances = TokenBudget(complete=self.total_token_budget)
        finances.reserve_text("system_prompt", self.system_prompt)          # 1. Fastened
    
        scored_docs = self._rerank(self._retriever.retrieve(question, ...), question)
    
        memory_turns = self._memory.get_weighted(question=question)
        finances.reserve_text("historical past", " ".be a part of(t.content material for t in memory_turns))  # 2. Reserved
    
        remaining_chars = finances.remaining_chars()
        compressor = Compressor(max_chars=remaining_chars, technique=self.compression_strategy)
        consequence = compressor.compress([sd.document.content for sd in scored_docs], question=question)
    
        finances.reserve_text("retrieved_docs", consequence.textual content)                # 3. What's left
        return ContextPacket(...)

    The system immediate is fastened overhead you’ll be able to’t negotiate away. Reminiscence is what makes multi-turn coherent. Paperwork are the variable — helpful, however the very first thing to compress when area runs out. Reserve within the incorrect order and paperwork silently overflow the finances earlier than historical past is even accounted for. The orchestrator enforces the proper order explicitly.


    What Occurs Underneath Actual Token Stress

    That is the place naive techniques fail — and this engine adapts.

    Setup: 5 paperwork (810 chars complete), 200 tokens reserved for system immediate, 800-token complete finances. Question: “How do embeddings and TF-IDF evaluate for reminiscence in brokers?”

    Flip 1 — no dialog historical past but: Paperwork retrieved: 5, re-ranked. Reminiscence turns: 0. Compression utilized: 48% discount. Outcome: matches inside finances.

    Flip 2 — after dialog begins: Paperwork retrieved: 5, re-ranked. Reminiscence turns: 2, now competing for area. Compression turns into extra aggressive: 45% discount. Outcome: nonetheless matches inside finances.

    What modified? The system didn’t fail — it tailored. Reminiscence turns consumed a part of the finances, so compression on retrieved paperwork tightened robotically. That’s the purpose of context engineering: the mannequin at all times receives one thing coherent, by no means a random overflow.


    Measuring What It Really Buys You

    The desk under compares 4 approaches on the identical question and 800-token complete finances. The primary three rows are calculated from identified inputs utilizing the identical 810-character doc set; the fourth row displays precise engine output verified towards demo runs.

    Method Docs Retrieved After Compression Reminiscence Suits Funds?
    Naive RAG 5 (full) 810 chars, none None No — 10 chars over
    RAG + Truncate 5 360 chars (43%) None Sure — however tail content material misplaced
    RAG + Reminiscence (no decay) 5 (full) 810 chars 3 turns, unfiltered No — historical past pushes it over
    Full Context Engine 5, reranked 400 chars (50%) 2 turns, decay-filtered Sure — all constraints met

    Naive RAG overflows instantly. Truncation matches however blindly cuts the tail. Reminiscence with out decay provides noise relatively than sign — older turns by no means fade, and dialog historical past turns into bloat. The complete system re-ranks, compresses intelligently, and consists of solely turns that also carry data.


    Reminiscence Decay by Significance Rating

    Memory decay chart showing effective score over time with decay_rate 0.001 and min_importance threshold 0.1. Three decay curves plotted across 24 hours — green curve importance 2.50 context bloat explanation, blue curve importance 2.33 context engineering query, amber curve importance 1.10 weather query dropped at 12 hours. Relevance boost annotation on blue curve at 6 hours.
    Efficient rating decay over 24 hours — high-importance context engineering turns survive the total session window whereas low-importance turns like climate queries fall under the 0.1 threshold at ~12 hr and are dropped. Relevance enhance from query-token overlap can quickly revive aged turns.

    Efficiency Traits

    Measured on Python 3.12.6, CPU solely, no GPU, 5-document data base:

    Operation Latency Notes
    Key phrase retrieval ~0.8ms Easy token matching
    TF-IDF retrieval ~2.1ms Vectorisation + cosine similarity
    Hybrid retrieval ~85ms Embedding era dominates
    Re-ranking (5 docs) ~0.3ms Tag-weighted scoring
    Reminiscence decay + filtering ~0.6ms Exponential decay calculation
    Compression (extractive) ~4.2ms Sentence scoring + choice
    Full engine.construct() ~92ms Hybrid mode dominates

    Hybrid retrieval is the bottleneck. Should you want sub-50ms response time, use TF-IDF or key phrase mode as an alternative. At 100 requests/sec in hybrid mode you want roughly 9 concurrent staff; with embedding caching, subsequent queries drop to ~2ms per request after the primary.


    Sincere Design Selections

    alpha=0.65 is empirical, not principled. I examined throughout a small question set from my data base. For a special area — authorized paperwork, medical literature, dense code — the proper alpha will likely be totally different. Key phrase-heavy queries do higher round 0.4; conceptual or paraphrased queries profit from 0.8 or larger.

    The re-ranking weights (0.68/0.32) are a heuristic. A cross-encoder re-ranker can be extra principled [7] however prices one mannequin name per doc. For five paperwork, the heuristic runs in microseconds. For 500+ paperwork, a cross-encoder turns into value the price.

    Token estimation (1 token ≈ 4 chars) is an approximation. Inside ~15% of precise token counts for English prose [6], however misfires for code and non-Latin scripts. For manufacturing, swap in tiktoken [8] — it’s a one-line change in compressor.py.

    The extractive compressor scores by query-token recall overlap: what number of question tokens seem within the sentence, as a fraction of the question size. That is quick and dependency-free however misses semantic similarity — a sentence that paraphrases the question with out sharing any tokens scores zero. Embedding-based sentence scoring would repair that at the price of an extra mannequin name per compression move.


    Commerce-offs and What’s Lacking

    Cross-encoder re-ranking. The _rerank() interface is already designed to be swapped out. Drop in a BERT-based cross-encoder for meaningfully higher pair-wise rankings.

    Embedding-based compression. Substitute the token-overlap sentence scorer in _extractive() with a small embedding mannequin. Catches semantic relevance that key phrase overlap misses. In all probability value it for 100+ doc techniques.

    Adaptive alpha. Classify the question sort dynamically and alter alpha relatively than utilizing a hard and fast 0.65. A brief question with uncommon area phrases most likely desires extra TF-IDF weight; a protracted natural-language query desires extra embedding weight.

    Persistent reminiscence. The present Reminiscence class is in-process solely. A light-weight SQLite backend with the identical add() / get_weighted() interface would survive restarts and allow cross-session continuity.


    Closing

    RAG will get you the proper paperwork. Immediate engineering will get you the proper directions. Context engineering will get you the proper context.

    Immediate engineering decides how the mannequin thinks. Context engineering decides what it will get to consider.

    Most techniques optimise the previous and ignore the latter. That’s why they break.

    The complete supply code with all seven demos is at: https://github.com/Emmimal/context-engine/


    References

    [1] Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Era for Data-Intensive NLP Duties. NeurIPS 33, 9459–9474. https://arxiv.org/abs/2005.11401

    [2] Karpathy, A. (2025). Context Engineering. https://x.com/karpathy/status/1937902205765607626

    [3] Pedregosa, F., et al. (2011). Scikit-learn: Machine Studying in Python. JMLR 12, 2825–2830. https://jmlr.org/papers/v12/pedregosa11a.html

    [4] Baddeley, A. (2000). The episodic buffer: a brand new part of working reminiscence? Traits in Cognitive Sciences, 4(11), 417–423.

    [5] Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Texts. EMNLP 2004. https://aclanthology.org/W04-3252/

    [6] OpenAI. (2023). Counting tokens with tiktoken. https://github.com/openai/tiktoken

    [7] Nogueira, R., & Cho, Okay. (2019). Passage Re-ranking with BERT. arXiv:1901.04085. https://arxiv.org/abs/1901.04085

    [8] OpenAI. (2023). tiktoken: Quick BPE tokeniser to be used with OpenAI’s fashions. https://github.com/openai/tiktoken

    [9] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings utilizing Siamese BERT-Networks. EMNLP 2019. https://arxiv.org/abs/1908.10084


    Disclosure

    All code on this article was written by me and is authentic work, developed and examined on Python 3.12.6. Benchmark numbers are from precise demo runs on my native machine (Home windows 11, CPU solely) and are reproducible by cloning the repository and operating demo.py, besides the place the article explicitly notes numbers are calculated from identified inputs. The sentence-transformers library is used as an elective dependency for embedding era in hybrid retrieval mode. All different performance runs on the Python commonplace library and numpy solely. I’ve no monetary relationship with any device, library, or firm talked about on this article.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Escaping the Valley of Choice in BI

    June 2, 2026

    Ensuring Data Integrity with Cryptographic Hashing and the Ethereum Blockchain

    June 1, 2026

    RAG Is Not Machine Learning, and the ML Toolkit Solves the Wrong Problem

    June 1, 2026

    How to Combine Claude Code and Codex for Maximum Coding Power

    June 1, 2026

    It’s the Lessons We Learned Along the Way. Or, Is It?

    June 1, 2026

    Proxy-Pointer RAG: Eliminating Wasteful Entity & Relations Extraction in Knowledge Graphs

    May 31, 2026

    Comments are closed.

    Editors Picks

    Direct-to-Cell Technology: Enabling Satellite Connectivity for Legacy Devices

    June 2, 2026

    How small businesses can leverage AI

    June 2, 2026

    Robots-Blog | Humanoide Robotik aus Deutschland: igus bringt neuen Serviceroboter auf den Markt

    June 2, 2026

    GM reimagines Hummer off-roader with California ideas unit

    June 2, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Samsung TVs Now Can Provide Weather Forecast, Find a Pizza Place

    October 24, 2025

    Zero Motorcycles enters scooter market with the LS1 electric scooter

    November 15, 2025

    Gas, Diesel, and EV drive comparison

    May 27, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.