Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • OneOdio Focus A1 Pro review
    • The 11 Best Fans to Buy Before It Gets Hot Again (2026)
    • A look at Dylan Patel’s SemiAnalysis, an AI newsletter and research firm that expects $100M+ in 2026 revenue from subscriptions and AI supply chain research (Abram Brown/The Information)
    • ‘Euphoria’ Season 3 Release Schedule: When Does Episode 2 Come Out?
    • Francis Bacon and the Scientific Method
    • Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval
    • Sulfur lava exoplanet L 98-59 d defies classification
    • Hisense U7SG TV Review (2026): Better Design, Great Value
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Prompt Caching with the OpenAI API: A Full Hands-On Python tutorial
    Artificial Intelligence

    Prompt Caching with the OpenAI API: A Full Hands-On Python tutorial

    Editor Times FeaturedBy Editor Times FeaturedMarch 22, 2026No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    In my previous post, Immediate Caching — what it’s, the way it works, and the way it can prevent some huge cash and time when operating AI-powered apps with excessive site visitors. In right this moment’s put up, I stroll you thru implementing Immediate Caching particularly utilizing OpenAI’s API, and we talk about some frequent pitfalls.


    A short reminder on Immediate Caching

    Earlier than getting our arms soiled, let’s briefly revisit what precisely the idea of Immediate Caching is. Immediate Caching is a performance supplied in frontier mannequin API companies just like the OpenAI API or Claude’s API, that enables caching and reusing components of the LLM’s enter which are repeated continuously. Such repeated components could also be system prompts or directions which are handed to the mannequin each time when operating an AI app, together with another variable content material, just like the consumer’s question or info retrieved from a data base. To have the ability to hit cache with immediate caching, the repeated components of the immediate have to be originally of it, specifically, a immediate prefix. As well as, to ensure that immediate caching to be activated, this prefix should exceed a sure threshold (e.g., for OpenAI the prefix ought to be greater than 1,024 tokens, whereas Claude has totally different minimal cache lengths for various fashions). So far as these two situations are glad — repeated tokens as a prefix exceeding the dimensions threshold outlined by the API service and mannequin — caching will be activated to attain economies of scale when operating AI apps.

    In contrast to caching in different elements in a RAG or different AI app, immediate caching operates on the token degree, within the inner procedures of the LLM. Specifically, LLM inference takes place in two steps:

    • Pre-fill, that’s, the LLM takes under consideration the consumer immediate to generate the primary token, and
    • Decoding, that’s, the LLM recursively generates the tokens of the output one after the other

    Briefly, immediate caching shops the computations that happen within the pre-fill stage, so the mannequin doesn’t must recompute it once more when the identical prefix reappears. Any computations happening within the decoding iterations part, even when repeated, aren’t going to be cached.

    For the remainder of the put up, I might be focusing solely on the usage of immediate caching within the OpenAI API.


    What concerning the OpenAI API?

    In OpenAI’s API, immediate caching was initially launched on the 1st of October 2024. Initially, it provided a 50% low cost on the cached tokens, however these days, this low cost goes as much as 90%. On prime of this, by hitting their immediate cache, further financial savings on latency will be achived as much as 80%.

    When immediate caching is activated, the API service makes an attempt to hit the cache for a submitted request by routing the submitted immediate to an acceptable machine, the place the respective cache is anticipated to exist. That is known as the Cache Routing, and to do that, the API service usually makes use of a hash of the primary 256 tokens of the immediate.

    Past this, their API additionally permits for explicitly defining a the prompt_cache_key parameter within the API request to the mannequin. That could be a single key defining which cache we’re referring to, aiming to additional improve the probabilities of our immediate being routed to the proper machine and hitting cache.

    As well as, OpenAI API supplies two distinct kinds of caching with reference to length, outlined by means of the prompt_cache_retention parameter. These are:

    • In-memory immediate cache retention: That is basically the default kind of caching, accessible for all fashions for which immediate caching is out there. With in-memory cache, cached information stay energetic for a interval of 5-10 minutes beteen requests.
    • Prolonged immediate cache retention: This accessible for specific models. Prolonged cache permits for maintaining information in cache for loger and as much as a most of 24 hours.

    Now, with reference to how a lot all these value, OpenAI expenses the identical per enter (non cached) token, both we have now immediate caching activated or not. If we handle to hit cache succesfully, we’re billed for the cached tokens at a enormously discounted value, with a reduction as much as 90%. Furthermore, the value per enter token stays the identical each for the in reminiscence and prolonged cache retention.


    Immediate Caching in Apply

    So, let’s see how immediate caching really works with a easy Python instance utilizing OpenAI’s API service. Extra particularly, we’re going to do a practical state of affairs the place a lengthy system immediate (prefix) is reused throughout a number of requests. In case you are right here, I suppose you have already got your OpenAI API key in place and have put in the required libraries. So, the very first thing to do could be to import the OpenAI library, in addition to time for capturing latency, and initialize an occasion of the OpenAI consumer:

    from openai import OpenAI
    import time
    
    consumer = OpenAI(api_key="your_api_key_here")

    then we will outline our prefix (the tokens which are going to be repeated and we’re aiming to cache):

    long_prefix = """
    You're a extremely educated assistant specialised in machine studying.
    Reply questions with detailed, structured explanations, together with examples when related.
    
    """ * 200  

    Discover how we artificially improve the size (multiply with 200) to ensure the 1,024 token caching threshold is met. Then we additionally arrange a timer in order to measure our latency financial savings, and we’re lastly able to make our name:

    begin = time.time()
    
    response1 = consumer.responses.create(
        mannequin="gpt-4.1-mini",
        enter=long_prefix + "What's overfitting in machine studying?"
    )
    
    finish = time.time()
    
    print("First response time:", spherical(finish - begin, 2), "seconds")
    print(response1.output[0].content material[0].textual content)

    So, what will we count on to occur from right here? For fashions from gpt-4o and newer, immediate caching is activated by default, and since our 4,616 enter tokens are nicely above the 1,024 prefix token threshold, we’re good to go. Thus, what this request does is that it initially checks if the enter is a cache hit (it isn’t, since that is the primary time we do a request with this prefix), and since it isn’t, it processes the complete enter after which caches it. Subsequent time we ship an enter that matches the preliminary tokens of the cached enter to some extent, we’re going to get a cache hit. Let’s test this in observe by making a second request with the identical prefix:

    begin = time.time()
    
    response2 = consumer.responses.create(
        mannequin="gpt-4.1-mini",
        enter=long_prefix + "What's regularization?"
    )
    
    finish = time.time()
    
    print("Second response time:", spherical(finish - begin, 2), "seconds")
    print(response2.output[0].content material[0].textual content)

    Certainly! The second request runs considerably sooner (23.31 vs 15.37 seconds). It’s because the mannequin has already made the calculations for the cached prefix and solely must course of from scratch the brand new half, “What’s regularization?”. In consequence, by utilizing immediate caching, we get considerably decrease latency and diminished value, since cached tokens are discounted.


    One other factor talked about within the OpenAI documentation we’ve already talked about is the prompt_cache_key parameter. Specifically, in response to the documentation, we will explicitly outline a immediate cache key when making a request, and on this approach outline the requests that want to make use of the identical cache. Nonetheless, I attempted to incorporate it in my instance by appropriately adjusting the request parameters, however didn’t have a lot luck:

    response1 = consumer.responses.create(
        prompt_cache_key = 'prompt_cache_test1',
        mannequin="gpt-5.1",
        enter=long_prefix + "What's overfitting in machine studying?"
    )

    🤔

    Plainly whereas prompt_cache_key exists within the API capabilities, it isn’t but uncovered within the Python SDK. In different phrases, we can not explicitly management cache reuse but, however it’s fairly automated and best-effort.


    So, what can go mistaken?

    Activating immediate caching and truly hitting the cache appears to be form of simple from what we’ve mentioned thus far. So, what could go mistaken, leading to us lacking the cache? Sadly, a variety of issues. As simple as it’s, immediate caching requires a variety of totally different assumptions to be in place. Lacking even a type of conditions goes to lead to a cache miss. However let’s take a greater look!

    One apparent miss is having a prefix that’s lower than the edge for activating immediate caching, specifically, lower than 1,024 tokens. Nonetheless, that is very simply solvable — we will at all times simply artificially improve the prefix token depend by merely multiplying by an acceptable worth, as proven within the instance above.

    One other factor could be silently breaking the prefix. Specifically, even after we use persistent directions and system prompts of acceptable dimension throughout all requests, we have to be exceptionally cautious to not break the prefixes by including any variable content material originally of the mannequin’s enter, earlier than the prefix. That could be a assured technique to break the cache, regardless of how lengthy and repeated the next prefix is. Ordinary suspects for falling into this pitfall are dynamic information, as an example, appending the consumer ID or timestamps originally of the immediate. Thus, a greatest observe to observe throughout all AI app growth is that any dynamic content material ought to at all times be appended on the finish of the immediate — by no means originally.

    In the end, it’s value highlighting that immediate caching is barely concerning the pre-fill part — decoding is rarely cached. Which means that even when we impose on the mannequin to generate responses following a particular template, that beggins with sure fastened tokens, these tokens aren’t going to be cached, and we’re going to be billed for his or her processing as ordinary.

    Conversely, for particular use instances, it doesn’t actually make sense to make use of immediate caching. Such instances could be extremely dynamic prompts, like chatbots with little repetition, one-off requests, or real-time personalised techniques.

    . . .

    On my thoughts

    Immediate caching can considerably enhance the efficiency of AI functions each by way of value and time. Specifically when trying to scale AI apps immediate caching comes extremelly helpful, for sustaining value and latency in acceptable ranges.

    For OpenAI’s API immediate caching is activated by default and prices for enter, non-cached tokens are the identical both we activate immediate caching or not. Thus, one can solely win by activating immediate caching and aiming to hit it in each request, even when they don’t succeed.

    Claude additionally supplies in depth performance on immediate caching by means of their API, which we’re going to be exploring intimately in a future put up.

    Thanks for studying! 🙂

    . . .

    Cherished this put up? Let’s be associates! Be part of me on:

    📰Substack 💌 Medium 💼LinkedIn ☕Buy me a coffee!

    All photographs by the writer, besides talked about in any other case.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    How to Learn Python for Data Science Fast in 2026 (Without Wasting Time)

    April 18, 2026

    Comments are closed.

    Editors Picks

    OneOdio Focus A1 Pro review

    April 19, 2026

    The 11 Best Fans to Buy Before It Gets Hot Again (2026)

    April 19, 2026

    A look at Dylan Patel’s SemiAnalysis, an AI newsletter and research firm that expects $100M+ in 2026 revenue from subscriptions and AI supply chain research (Abram Brown/The Information)

    April 19, 2026

    ‘Euphoria’ Season 3 Release Schedule: When Does Episode 2 Come Out?

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Liquid metal gives joint implants infection-fighting powers

    October 10, 2025

    New CRISPR therapy to lower cholesterol enters human trials

    January 26, 2026

    Iranian Hackers Breached Kash Patel’s Email—but Not the FBI’s

    March 27, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.