Back

How to Stream LLM Responses in Your App

Make AI chat feel instant by streaming tokens as they generate — enabling it on the API, piping it to the browser, and handling errors mid-stream.

How to Stream LLM Responses in Your App
Written by
BSH Technologies
Published on2026-04-12

How do you stream LLM responses in an app?

You stream LLM responses by enabling the streaming option on your API call and then forwarding each chunk of generated text to the browser as it arrives, instead of waiting for the whole reply. The model produces text token by token; with streaming on, the API sends those tokens as a continuous event stream, your backend relays them to the frontend over a streaming connection, and the UI appends each piece so words appear in real time. The total generation time is the same, but the app feels dramatically faster because the user sees output immediately.

Both the OpenAI and Claude APIs support streaming, and the pattern is consistent: turn it on, iterate the stream, relay the chunks. The reason it matters so much is psychological. A user staring at a blank screen for eight seconds assumes something is broken; the same user watching words appear after half a second happily reads along for those same eight seconds. You are not making the model faster, you are removing the dead silence that makes waiting feel intolerable.

Step one: turn on streaming at the API

In the SDK, you set the streaming flag and then iterate over the response as an async stream rather than awaiting a single object.

  • Enable the stream option (or the SDK's streaming method) on your request.
  • Loop over the incoming events, each carrying a small delta of text.
  • Concatenate the deltas on the server if you need the full text for logging, while still relaying each one onward.

The model is generating as you read, so your loop runs for the lifetime of the response, not in a single step.

Step two: get the chunks to the browser

Your backend has the stream; the frontend needs it. There are two common transports.

  1. Server-Sent Events (SSE) — a simple one-way stream from server to browser, well suited to pushing tokens and natively supported by browsers through the EventSource API.
  2. A streamed HTTP response — return a readable stream from your endpoint and read it on the client with the Fetch API's streaming body, appending each chunk as it lands.
Pick the transport your framework supports most naturally. Many modern full-stack frameworks have a streaming helper that handles the plumbing so you focus on the tokens.

On the client, you append each received chunk to the message currently being rendered. The effect is the familiar typing animation, except it is real generation, not a fake delay.

Handle the things that go wrong mid-stream

Streaming introduces failure modes a single request does not have, because the connection is open and live.

  • The stream breaks partway — handle a dropped connection so the UI shows a clear state rather than a half-message frozen forever.
  • The user navigates away — cancel the request so you stop paying for tokens nobody will read.
  • An error arrives after streaming starts — you have already shown partial text, so surface the error gracefully instead of discarding what was rendered.
  • You still need the full text — accumulate chunks server-side for logging, moderation, or storage, since the client only ever saw fragments.

Small touches that make streaming feel right

Once the tokens are flowing, a few details separate a rough implementation from one that feels deliberate and calm.

  • Show a thinking state instantly — render a cursor or subtle indicator the moment the request starts, before the first token, so there is never a dead pause.
  • Auto-scroll thoughtfully — keep the latest text in view as it grows, but stop fighting the user if they scroll up to re-read something.
  • Let users stop — a visible stop button that cancels the request gives people control and saves tokens when an answer is clearly going the wrong way.
  • Render markdown progressively — if your model emits markdown, format it as it streams so code blocks and lists appear styled rather than as raw symbols.

None of these change the underlying mechanics, but together they are the difference between a feature that feels native and one that feels bolted on.

When not to stream

Streaming is ideal for conversational, user-facing output. It is the wrong tool when you need the complete, validated result before doing anything — for example when the model returns JSON you must parse, or a function call you must execute. In those cases, partial output is useless until it is whole, so wait for the full response. Match the technique to the interaction: stream what a human reads as it appears, and buffer what your code has to consume in one piece.

Prefer it built for you?

Streaming is what makes AI features feel polished, but the backpressure, cancellation, and mid-stream error handling are easy to get subtly wrong. Talk to BSH Technologies about our software engineering services and we will build smooth, resilient streaming into your app.

Frequently asked questions

Does streaming make the LLM respond faster?

It does not change total generation time, but it makes the app feel much faster because the user sees text appear immediately instead of waiting for the entire reply. This improvement in perceived speed is why streaming is standard for conversational, user-facing AI features.

What transport should I use to stream tokens to the browser?

The two common options are Server-Sent Events (SSE), a simple one-way stream the browser supports via EventSource, and a streamed HTTP response read on the client with the Fetch streaming body. Pick whichever your framework supports most naturally; many full-stack frameworks include a streaming helper that handles the plumbing.

Should I stream when the model returns JSON?

No. Streaming suits text a human reads as it appears. When you need a complete, validated result before acting, such as JSON you must parse or a function call you must execute, wait for the full response. Partial structured output is unusable until it is whole, so buffer it instead.

What happens if a user leaves mid-stream?

Cancel the request so you stop generating and stop paying for tokens nobody will read. Streaming keeps the connection open and live, so your code should detect navigation or disconnection and abort the call, and handle a broken stream by showing a clear UI state rather than a frozen half-message.

Related Topics

#LLM#Streaming#Development

From the blog

View all posts
How to Build an AI Agent for Free in 2026
Applied AI

How to Build an AI Agent for Free in 2026

You can build a working AI agent for free in 2026 using n8n, open-source frameworks, and a free LLM tier. Here is the exact stack and the steps.

BSH Technologies
BSH Technologies · 2026-06-17
Best Free AI Agent Frameworks in 2026
Applied AI

Best Free AI Agent Frameworks in 2026

The best free AI agent frameworks in 2026 are LangChain, CrewAI, Microsoft AutoGen, LangGraph, and n8n. Here is how to choose between them.

BSH Technologies
BSH Technologies · 2026-06-16