Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • I Replaced GPT-4 with a Local SLM and My CI/CD Pipeline Stopped Failing
    • Humanoid data: 10 Things That Matter in AI Right Now
    • 175 Park Avenue skyscraper in New York will rank among the tallest in the US
    • The conversation that could change a founder’s life
    • iRobot Promo Code: 15% Off
    • My Smartwatch Gives Me Health Anxiety. Experts Explain How to Make It Stop
    • How to Call Rust from Python
    • Agent orchestration: 10 Things That Matter in AI Right Now
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Wednesday, April 22
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»How to Make Your AI App Faster and More Interactive with Response Streaming
    Artificial Intelligence

    How to Make Your AI App Faster and More Interactive with Response Streaming

    Editor Times FeaturedBy Editor Times FeaturedMarch 26, 2026No Comments9 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    In my latest posts, talked quite a bit about prompt caching in addition to caching in general, and the way it can enhance your AI app when it comes to price and latency. Nevertheless, even for a completely optimized AI app, generally the responses are simply going to take a while to be generated, and there’s merely nothing we are able to do about it. Once we request giant outputs from the mannequin or require reasoning or deep pondering, the mannequin goes to naturally take longer to reply. As cheap as that is, ready longer to obtain a solution could be irritating for the consumer and decrease their total consumer expertise utilizing an AI app. Fortunately, a easy and easy method to enhance this concern is response streaming.

    Streaming means getting the mannequin’s response incrementally, little by little, as generated, somewhat than ready for all the response to be generated after which displaying it to the consumer. Usually (with out streaming), we ship a request to the mannequin’s API, we look ahead to the mannequin to generate the response, and as soon as the response is accomplished, we get it again from the API in a single step. With streaming, nevertheless, the API sends again partial outputs whereas the response is generated. This can be a somewhat acquainted idea as a result of most user-facing AI apps like ChatGPT, from the second they first appeared, used streaming to point out their responses to their customers. However past ChatGPT and LLMs, streaming is basically used in every single place on the net and in fashionable functions, resembling as an example in stay notifications, multiplayer video games, or stay information feeds. On this submit, we’re going to additional discover how we are able to combine streaming in our personal requests to mannequin APIs and obtain an identical impact on customized AI apps.

    There are a number of totally different mechanisms to implement the idea of streaming in an software. Nonetheless, for AI functions, there are two extensively used kinds of streaming. Extra particularly, these are:

    • HTTP Streaming Over Server-Despatched Occasions (SSE): That could be a comparatively easy, one-way kind of streaming, permitting solely stay communication from server to shopper.
    • Streaming with WebSockets: That could be a extra superior and sophisticated kind of streaming, permitting two-way stay communication between server and shopper.

    Within the context of AI functions, HTTP streaming over SSE can assist easy AI functions the place we simply must stream the mannequin’s response for latency and UX causes. Nonetheless, as we transfer past easy request–response patterns into extra superior setups, WebSockets turn into notably helpful as they permit stay, bidirectional communication between our software and the mannequin’s API. For instance, in code assistants, multi-agent programs, or tool-calling workflows, the shopper could must ship intermediate updates, consumer interactions, or suggestions again to the server whereas the mannequin remains to be producing a response. Nevertheless, for most straightforward AI apps the place we simply want the mannequin to offer a response, WebSockets are often overkill, and SSE is enough.

    In the remainder of this submit, we’ll be taking a greater take a look at streaming for easy AI apps utilizing HTTP streaming over SSE.

    . . .

    What about HTTP Streaming Over SSE?

    HTTP Streaming Over Server-Sent Events (SSE) is predicated on HTTP streaming.

    . . .

    HTTP streaming implies that the server can ship no matter it’s that it has to ship in elements, somewhat than suddenly. That is achieved by the server not terminating the connection to the shopper after sending a response, however somewhat leaving it open and sending the shopper no matter further occasion happens instantly.

    For instance, as a substitute of getting the response in a single chunk:

    Good day world!

    we may get it in elements utilizing uncooked HTTP streaming:

    Good day
    
    World
    
    !

    If we had been to implement HTTP streaming from scratch, we would want to deal with all the things ourselves, together with parsing the streamed textual content, managing any errors, and reconnections to the server. In our instance, utilizing uncooked HTTP streaming, we must in some way clarify to the shopper that ‘Good day world!’ is one occasion conceptually, and all the things after it could be a separate occasion. Fortuitously, there are a number of frameworks and wrappers that simplify HTTP streaming, certainly one of which is HTTP Streaming Over Server-Despatched Occasions (SSE).

    . . .

    So, Server-Sent Events (SSE) present a standardized approach to implement HTTP streaming by structuring server outputs into clearly outlined occasions. This construction makes it a lot simpler to parse and course of streamed responses on the shopper aspect.

    Every occasion sometimes consists of:

    • an id
    • an occasion kind
    • a information payload

    or extra correctly..

    id: 
    occasion: 
    information: 

    Our instance utilizing SSE may look one thing like this:

    id: 1
    occasion: message
    information: Good day world!

    However what’s an occasion? Something can qualify as an occasion – a single phrase, a sentence, or 1000’s of phrases. What truly qualifies as an occasion in our specific implementation is outlined by the setup of the API or the server we’re related to.

    On prime of this, SSE comes with numerous different conveniences, like routinely reconnecting to the server if the connection is terminated. One other factor is that incoming stream messages are clearly tagged as textual content/event-stream, permitting the shopper to appropriately deal with them and keep away from errors.

    . . .

    Roll up your sleeves

    Frontier LLM APIs like OpenAI’s API or Claude API natively assist HTTP streaming over SSE. On this method, integrating streaming in your requests turns into comparatively easy, as it may be achieved by altering a parameter within the request (e.g., enabling a stream=true parameter).

    As soon as streaming is enabled, the API not waits for the total response earlier than replying. As an alternative, it sends again small elements of the mannequin’s output as they’re generated. On the shopper aspect, we are able to iterate over these chunks and show them progressively to the consumer, creating the acquainted ChatGPT typing impact.

    However, let’s do a minimal instance of this utilizing, as regular the OpenAI’s API:

    import time
    from openai import OpenAI
    
    shopper = OpenAI(api_key="your_api_key")
    
    stream = shopper.responses.create(
        mannequin="gpt-4.1-mini",
        enter="Clarify response streaming in 3 quick paragraphs.",
        stream=True,
    )
    
    full_text = ""
    
    for occasion in stream:
        # solely print textual content delta as textual content elements arrive
        if occasion.kind == "response.output_text.delta":
            print(occasion.delta, finish="", flush=True)
            full_text += occasion.delta
    
    print("nnFinal collected response:")
    print(full_text)

    On this instance, as a substitute of receiving a single accomplished response, we iterate over a stream of occasions and print every textual content fragment because it arrives. On the identical time, we additionally retailer the chunks right into a full response full_text to make use of later if we need to.

    . . .

    So, ought to I simply slap streaming = True on each request?

    The quick reply is not any. As helpful as it’s, with nice potential for considerably enhancing consumer expertise, streaming shouldn’t be a one-size-fits-all answer for AI apps, and we should always use our discretion for evaluating the place it ought to be applied and the place not.

    Extra particularly, including streaming in an AI app could be very efficient in setups after we anticipate lengthy responses, and we worth above all of the consumer expertise and responsiveness of the app. Such a case could be consumer-facing chatbots.

    On the flip aspect, for easy apps the place we anticipate the offered responses to be quick, including streaming isn’t seemingly to offer important beneficial properties to the consumer expertise and doesn’t make a lot sense. On prime of this, streaming solely is smart in circumstances the place the mannequin’s output is free-text and never structured output (e.g. json information).

    Most significantly, the most important disadvantage of streaming is that we’re not capable of assessment the total response earlier than displaying it to the consumer. Keep in mind, LLMs generate the tokens one-by-one, and the which means of the response is shaped because the response is generated, not upfront. If we make 100 requests to an LLM with the very same enter, we’re going to get 100 totally different responses. That’s to say, nobody is aware of earlier than the responses are accomplished what it’s going to say. Because of this, with streaming activated is far more troublesome to assessment the mannequin’s output earlier than displaying it to the consumer, and apply any ensures on the produced content material. We are able to at all times attempt to consider partial completions, however once more, partial completions are tougher to guage, as we have now to guess the place the mannequin goes with this. Including that this analysis must be carried out in actual time and never simply as soon as, however recursively on totally different partial responses of the mannequin, renders this course of much more difficult. In observe, in such circumstances, validation is run on all the output after the response is full. Nonetheless, the difficulty with that is that at this level, it might already be too late, as we could have already proven the consumer inappropriate content material that doesn’t cross our validations.

    . . .

    On my thoughts

    Streaming is a characteristic that doesn’t have an precise affect on the AI app’s capabilities, or its related price and latency. Nonetheless, it may have an awesome affect on the best way the consumer’s understand and expertise an AI app. Streaming makes AI programs really feel sooner, extra responsive, and extra interactive, even when the time for producing the whole response stays precisely the identical. That mentioned, streaming shouldn’t be a silver bullet. Totally different functions and contexts could profit kind of from introducing streaming. Like many choices in AI engineering, it’s much less about what’s attainable and extra about what is smart to your particular use case.

    . . .

    If you happen to made it this far, you might find pialgorithms useful — a platform we’ve been constructing that helps groups securely handle organizational information in a single place.

    . . .

    Cherished this submit? Be part of me on 💌Substack and 💼LinkedIn

    . . .

    All photographs by the writer, besides talked about in any other case.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    I Replaced GPT-4 with a Local SLM and My CI/CD Pipeline Stopped Failing

    April 22, 2026

    How to Call Rust from Python

    April 22, 2026

    Inside the AI Power Move That Could Redefine Finance

    April 22, 2026

    Git UNDO : How to Rewrite Git History with Confidence

    April 22, 2026

    DIY AI & ML: Solving The Multi-Armed Bandit Problem with Thompson Sampling

    April 21, 2026

    Your RAG Gets Confidently Wrong as Memory Grows – I Built the Memory Layer That Stops It

    April 21, 2026

    Comments are closed.

    Editors Picks

    I Replaced GPT-4 with a Local SLM and My CI/CD Pipeline Stopped Failing

    April 22, 2026

    Humanoid data: 10 Things That Matter in AI Right Now

    April 22, 2026

    175 Park Avenue skyscraper in New York will rank among the tallest in the US

    April 22, 2026

    The conversation that could change a founder’s life

    April 22, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    How can we meet more women? VC Blackbird says it’s hard to find female founders to back

    September 2, 2025

    FAA Plan to Cut Flights Might Not Be an Utter Nightmare

    November 6, 2025

    Copenhagen-based Interhuman AI raises €2 million to build the social intelligence layer for AI

    August 26, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.