Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K
    • US-sanctioned currency exchange says $15 million heist done by “unfriendly states”
    • This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»4 Techniques to Optimize Your LLM Prompts for Cost, Latency and Performance
    Artificial Intelligence

    4 Techniques to Optimize Your LLM Prompts for Cost, Latency and Performance

    Editor Times FeaturedBy Editor Times FeaturedOctober 29, 2025No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    of automating a major variety of duties. Because the launch of ChatGPT in 2022, we have now seen increasingly AI merchandise available on the market using LLMs. Nevertheless, there are nonetheless a variety of enhancements that ought to be made in the best way we make the most of LLMs. Bettering your immediate with an LLM immediate improver and using cached tokens are, for instance, two easy strategies you’ll be able to make the most of to vastly enhance the efficiency of your LLM utility.

    On this article, I’ll talk about a number of particular strategies you’ll be able to apply to the best way you create and construction your prompts, which is able to cut back latency and value, and likewise enhance the standard of your responses. The aim is to current you with these particular strategies, so you’ll be able to instantly implement them into your personal LLM utility.

    This infographic highlights the principle contents of this text. I’ll talk about 4 completely different strategies to drastically enhance the efficiency of your LLM utility, with regard to value, latency, and output high quality. I’ll cowl using cached tokens, having the consumer query on the finish, utilizing immediate optimizers, and having your personal custom-made LLM benchmarks. Picture by Gemini.

    Why it is best to optimize your immediate

    In a variety of circumstances, you might need a immediate that works with a given LLM and yields sufficient outcomes. Nevertheless, in a variety of circumstances, you haven’t spent a lot time optimizing the immediate, which leaves a variety of potential on the desk.

    I argue that utilizing the precise strategies I’ll current on this article, you’ll be able to simply each enhance the standard of your responses and cut back prices with out a lot effort. Simply because a immediate and LLM work doesn’t imply it’s performing optimally, and in a variety of circumstances, you’ll be able to see nice enhancements with little or no effort.

    Particular strategies to optimize

    On this part, I’ll cowl the precise strategies you’ll be able to make the most of to optimize your prompts.

    At all times hold static content material early

    The primary method I’ll cowl is to at all times hold static content material early in your immediate. With static content material, I consult with content material that continues to be the identical if you make a number of API calls.

    The rationale it is best to hold the static content material early is that each one the massive LLM suppliers, reminiscent of Anthropic, Google, and OpenAI, make the most of cached tokens. Cached tokens are tokens which have already been processed in a earlier API request, and that may be processed cheaply and rapidly. It varies from supplier to supplier, however cached enter tokens are normally priced round 10% of regular enter tokens.

    Cached tokens are tokens which have already been processed in a earlier API request, and that may be processed cheaper and quicker than regular tokens

    Meaning, if you happen to ship in the identical immediate two occasions in a row, the enter tokens of the second immediate will solely value 1/tenth the enter tokens of the primary immediate. This works as a result of the LLM suppliers cache the processing of those enter tokens, which makes processing your new request cheaper and quicker.


    In observe, caching enter tokens is completed by protecting variables on the finish of the immediate.

    For instance, in case you have a protracted system immediate with a query that varies from request to request, it is best to do one thing like this:

    immediate = f"""
    {lengthy static system immediate}
    
    {consumer immediate}
    """

    For instance:

    immediate = f"""
    You're a doc skilled ...
    You must at all times reply on this format ...
    If a consumer asks about ... it is best to reply ...
    
    {consumer query}
    """

    Right here we have now the static content material of the immediate first, earlier than we put the variable contents (the consumer query) final.


    In some eventualities, you wish to feed in doc contents. In case you’re processing a variety of completely different paperwork, it is best to hold the doc content material on the finish of the immediate:

    # if processing completely different paperwork
    immediate = f"""
    {static system immediate}
    {variable immediate instruction 1}
    {doc content material}
    {variable immediate instruction 2}
    {consumer query}
    """

    Nevertheless, suppose you’re processing the identical paperwork a number of occasions. In that case, you may make certain the tokens of the doc are additionally cached by guaranteeing no variables are put into the immediate beforehand:

    # if processing the identical paperwork a number of occasions
    immediate = f"""
    {static system immediate}
    {doc content material} # hold this earlier than any variable directions
    {variable immediate instruction 1}
    {variable immediate instruction 2}
    {consumer query}
    """

    Observe that cached tokens are normally solely activated if the primary 1024 tokens are the identical in two requests. For instance, in case your static system immediate within the above instance is shorter than 1024 tokens, you’ll not make the most of any cached tokens.

    # do NOT do that
    immediate = f"""
    {variable content material} < --- this removes all utilization of cached tokens
    {static system immediate}
    {doc content material}
    {variable immediate instruction 1}
    {variable immediate instruction 2}
    {consumer query}
    """

    Your prompts ought to at all times be constructed up with essentially the most static contents first (the content material various the least from request to request), the essentially the most dynamic content material (the content material various essentially the most from request to request)

    1. In case you have a protracted system and consumer immediate with none variables, it is best to hold that first, and add the variables on the finish of the immediate
    2. In case you are fetching textual content from paperwork, for instance, and processing the identical doc twice, it is best to

    Could possibly be doc contents, or in case you have a protracted immediate -> make use of caching

    Query on the finish

    One other method it is best to make the most of to enhance LLM efficiency is to at all times put the consumer query on the finish of your immediate. Ideally, you manage it so you’ve got your system immediate containing all the final directions, and the consumer immediate merely consists of solely the consumer query, reminiscent of beneath:

    system_prompt = ""
    
    user_prompt = f"{user_question}"

    In Anthropic’s immediate engineering docs, the state that features the consumer immediate on the finish can enhance efficiency by as much as 30%, particularly in case you are utilizing lengthy contexts. Together with the query in the long run makes it clearer to the mannequin which job it’s attempting to attain, and can, in lots of circumstances, result in higher outcomes.

    Utilizing a immediate optimizer

    Numerous occasions, when people write prompts, they develop into messy, inconsistent, embrace redundant content material, and lack construction. Thus, it is best to at all times feed your immediate by way of a immediate optimizer.

    The best immediate optimizer you should utilize is to immediate an LLM to enhance this immediate {immediate}, and it’ll offer you a extra structured immediate, with much less redundant content material, and so forth.

    An excellent higher method, nonetheless, is to make use of a selected immediate optimizer, reminiscent of one you could find in OpenAI’s or Anthropic’s consoles. These optimizers are LLMs particularly prompted and created to optimize your prompts, and can normally yield higher outcomes. Moreover, it is best to make certain to incorporate:

    • Particulars in regards to the job you’re attempting to attain
    • Examples of duties the immediate succeeded at, and the enter and output
    • Instance of duties the immediate failed at, with the enter and output

    Offering this extra info will normally yield manner higher outcomes, and also you’ll find yourself with a significantly better immediate. In lots of circumstances, you’ll solely spend round 10-Quarter-hour and find yourself with a far more performant immediate. This makes utilizing a immediate optimizer one of many lowest effort approaches to bettering LLM efficiency.

    Benchmark LLMs

    The LLM you employ will even considerably affect the efficiency of your LLM utility. Completely different LLMs are good at completely different duties, so you want to check out the completely different LLMs in your particular utility space. I like to recommend no less than establishing entry to the largest LLM suppliers like Google Gemini, OpenAI, and Anthropic. Setting this up is kind of easy, and switching your LLM supplier takes a matter of minutes if you have already got credentials arrange. Moreover, you’ll be able to contemplate testing open-source LLMs as properly, although they normally require extra effort.

    You now have to arrange a selected benchmark for the duty you’re attempting to attain, and see which LLM works greatest. Moreover, it is best to frequently examine mannequin efficiency, because the large LLM suppliers sometimes improve their fashions, with out essentially popping out with a brand new model. You must, in fact, even be able to check out any new fashions popping out from the massive LLM suppliers.

    Conclusion

    On this article, I’ve coated 4 completely different strategies you’ll be able to make the most of to enhance the efficiency of your LLM utility. I mentioned using cached tokens, having the query on the finish of the immediate, utilizing immediate optimizers, and creating particular LLM benchmarks. These are all comparatively easy to arrange and do, and might result in a major efficiency enhance. I imagine many related and easy strategies exist, and it is best to at all times attempt to be looking out for them. These subjects are normally described in several weblog posts, the place Anthropic is among the blogs that has helped me enhance LLM efficiency essentially the most.

    👉 Discover me on socials:

    📩 Subscribe to my newsletter

    🧑‍💻 Get in touch

    🔗 LinkedIn

    🐦 X / Twitter

    ✍️ Medium

    It’s also possible to learn a few of my different articles:



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026

    Comments are closed.

    Editors Picks

    Canyon Spectral:ON CF 8 Electric Mountain Bike: Beginner-Friendly, Under $5K

    April 18, 2026

    US-sanctioned currency exchange says $15 million heist done by “unfriendly states”

    April 18, 2026

    This New Air Purifier Filter Can Remove Cannabis Smoke Odor, Just in Time for 4/20

    April 18, 2026

    Portable water filter provides safe drinking water from any source

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Deploy agentic AI faster with DataRobot and NVIDIA

    March 21, 2025

    Hands-On With Nano Banana 2, the Latest Version of Google’s AI Image Generator

    February 27, 2026

    OpenAI and Jony Ive’s Voice-First AI Device Faces Major Delays—Privacy, Personality, and Compute Snags Stall 2026 Launch

    October 9, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.