Why do LLM responses stream?
It's not for show. The model literally generates one token at a time, and forcing it to buffer the full answer before sending would make every chat app feel broken. Streaming is the network shape of an autoregressive process.
Why it exists
Open any chat app built on a modern LLM and the answer arrives as a typewriter, not a paragraph. It’s tempting to read that as a UX flourish — a fancy loading animation. It isn’t. It’s the wire shape of how the model actually produces text.
An LLM is autoregressive: to produce token N+1 it has to look at tokens 1 through N, including the ones it just produced. There is no parallel “compute the whole reply at once” mode. The fiftieth token is ready after fifty forward passes, in sequence. Each pass takes some milliseconds.
If the server waited for the whole reply before responding, two bad things would happen:
- The user stares at a spinner for the full generation time. A 500-token reply at, say, 50 tokens per second is a ten-second wait with nothing on screen, even though the first token was ready well before the rest.
- You’d be paying for a buffer you didn’t need. The server holds the text. The network does nothing. The client does nothing. Three idle layers, on purpose.
Streaming is the obvious move once you see this: emit each token (or small group of tokens) as soon as it pops out of the decoder. The user sees motion much sooner than the full reply could possibly arrive, and the same wall-clock generation feels dramatically faster because the time-to-first-token dominates how a chat UI feels — even if it’s not the only latency signal.
Why it matters now
Most major LLM provider APIs ship streaming as a first-class option, and production chat UIs typically default to it. As an engineer in 2026 you’ll hit it from at least three sides:
- Client code. You’re not parsing a JSON response anymore — you’re reading a stream of events and concatenating tokens as they arrive. The bug you’ll meet first is “the stream looked fine but my JSON parser exploded,” because nothing about the wire format guarantees a complete JSON document until the stream ends.
- Server code. Your handler can’t be the usual “compute response, return it” function. It has to yield chunks, flush them, and survive client disconnects mid-stream. Frameworks that assume a single response body fight you here.
- Infrastructure. Reverse proxies, load balancers, and CDNs that buffer the response body to “optimize” delivery turn streaming into non-streaming. So does any middlebox that compresses the whole body before forwarding. Debugging “why doesn’t my stream stream?” usually ends at one of these.
And it shows up beyond LLMs: agent frameworks stream tool-call deltas and partial structured outputs (some also stream reasoning traces, when the provider exposes them). The pattern is generalizing fast enough that “knows how streaming works at the HTTP layer” is becoming a baseline skill, not a specialty.
The short answer
streaming response = open-ended HTTP body + message framing + chunks flushed as the model emits them
Three layers stack here, and confusing them is half the bugs. The model generates incrementally. HTTP carries an open-ended body — via chunked transfer-encoding on HTTP/1.1, via DATA frames on HTTP/2 and HTTP/3. On top of that, an app-level framing — SSE or NDJSON — carves the byte stream into discrete messages the client can parse. The interesting part is what’s causing the chunks: an autoregressive decoder that has no choice but to produce text in order, one step at a time.
How it works
Three things have to line up: the model produces tokens incrementally, the HTTP layer can carry an open-ended body, and the client knows how to read it.
1. The decoder is sequential by construction
Inside the model, generating a reply is a loop:
prompt → forward pass → token_1
prompt + token_1 → forward pass → token_2
prompt + token_1 + token_2 → forward pass → token_3
...
Each forward pass is one trip through the network’s layers on a GPU. Modern serving stacks reuse most of the work between passes via the KV cache, which is why “tokens per second” is a meaningful steady-state number after the first one. But sequential it remains: token N+1 cannot start until token N exists.
This is the load-bearing fact for the whole post. Streaming isn’t a delivery optimization layered on top of a fundamentally batch computation — it’s the natural output shape of the computation itself. A non-streaming API is the special case, where the server volunteers to buffer.
2. HTTP can carry an open-ended body
You don’t need a new protocol for this. Plain HTTP/1.1 has had chunked transfer-encoding since the 1990s (currently specified in RFC 9112). The body is a sequence of chunks, each prefixed with its length in hex, ending with a zero-length chunk:
HTTP/1.1 200 OK
Content-Type: text/event-stream
Transfer-Encoding: chunked
1a
data: {"token": "Hello"}\n\n
1c
data: {"token": ", world"}\n\n
0
Two common application-level framings sit on top of an open-ended HTTP body:
- Server-Sent Events (SSE). A simple text format where each message
is one or more
field: valuelines ending with a blank line. The browser exposes it viaEventSource. Many LLM provider APIs use SSE or an SSE-shaped variant. - Newline-delimited JSON (NDJSON / JSON Lines). One JSON object per line. Less ceremony than SSE; works fine for non-browser clients.
HTTP/2 and HTTP/3 don’t use Transfer-Encoding: chunked at all — that
mechanism is HTTP/1.1-specific. They carry the response body as a
sequence of DATA frames on a multiplexed stream, and the server can flush
each frame as soon as it’s ready. From the application’s point of view
the contract is the same: write some bytes, the other side eventually
reads them, and the layer in between doesn’t have to know the total
length up front. See QUIC for why the
lower layer still matters on flaky links.
3. The client reads incrementally
The client side is where naïve code falls down. You can’t await response.json() — there is no complete JSON until the server closes the
stream. You have to consume the body as it arrives:
const res = await fetch(url, { method: "POST", body: ... });
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buf = "";
while (true) {
const { value, done } = await reader.read();
if (done) break;
buf += decoder.decode(value, { stream: true });
// split buf on the framing delimiter, parse complete events,
// keep the unfinished tail for the next iteration.
}
The detail that bites everyone once: a single read() is not aligned to
a logical message. You’ll get half an event, then the other half plus two
more events, then nothing for 200 ms. The client has to buffer until it
sees a complete message boundary (blank line for SSE, newline for NDJSON)
before parsing. Decoding bytes-to-text needs the streaming flag too, or
you’ll corrupt multi-byte UTF-8 characters whose bytes happened to span a
chunk boundary.
Show the seams
A few things the typewriter effect hides:
- The chunks aren’t characters or words. Providers stream small
text deltas, often (but not always) aligned to single
tokens.
You’ll see fragments like
"Hello",","," world","!"arrive as separate events. The chat UI concatenates them; if you log them raw you’ll see the shape of the tokenizer’s vocabulary, not English words. - “Tokens per second” is throughput, not latency. The number that governs how a chat feels is time-to-first-token, which is dominated by prompt length, model size, and however much queueing the server is doing — not by the per-token rate. A model that emits 200 tok/s but takes 4 seconds to emit the first one feels broken.
- Streaming complicates errors. A non-streaming response can return a 500 with a clean error body. A streaming response that fails after sending 200 OK and 50 tokens has to deliver the error inside the stream — usually as a special event the client must recognize. Plenty of clients don’t, and silently truncate.
- Buffering kills it. If you don’t see chunks arriving live in the
browser, suspect, in order: a reverse proxy (nginx’s
proxy_buffering onis the classic), a response-compression implementation that accumulates output before flushing, or a framework middleware that reads the whole body before forwarding it. The transport itself almost never breaks streaming; some box in the middle does. - Cancellation is real money. When a user closes the tab mid-generation, you want the server to stop generating immediately — every further token is wasted GPU time. That requires the handler to actually observe client disconnect and cancel the inference job. Frameworks vary in how easy this is to wire up.
- WebSockets exist; they’re often overkill. WebSockets give you bidirectional messaging and a different framing protocol. For one-shot “send prompt, receive token stream” the asymmetry of HTTP streaming (request body finite, response body open-ended) is a natural fit, and SSE-shaped APIs have ended up common for this niche — though plenty of agent and voice systems do reach for WebSockets when they actually need bidirectional traffic.
Famous related terms
- Server-Sent Events (SSE) —
SSE = HTTP response + text framing of "data: …\n\n" events— a common framing for LLM streaming endpoints, browser-readable viaEventSource. - Chunked transfer-encoding —
chunked = length-prefixed body chunks + a zero-length terminator— the HTTP/1.1 mechanism for open-ended bodies. HTTP/2 and HTTP/3 don’t use it; they stream via DATA frames instead. - NDJSON / JSON Lines —
NDJSON ≈ "one JSON object per line"— the simplest streaming framing; common in non-browser APIs. - WebSocket —
WebSocket = HTTP upgrade + bidirectional message frames— when you need the client to also stream messages back, not just receive them. - Time-to-first-token (TTFT) —
TTFT = time from request sent to first token received— the latency metric that correlates most with how a streaming UI feels. - KV cache — what keeps per-step decode cost from re-doing work for every previous token; per-step cost still grows with context length, just much more slowly.
- QUIC — the transport that HTTP/3 streams ride on. Independent QUIC streams avoid the connection-wide head-of-line blocking TCP+TLS suffers from on lossy links.
Going deeper
- RFC 9112 §7 — chunked transfer-encoding, the HTTP/1.1 mechanism. Drier than the topic deserves but the source of truth for the framing.
- The HTML Living Standard’s “Server-sent events” section — the
normative spec for SSE, including the
EventSourceAPI and the exact rules for parsingdata:lines. - Any LLM provider’s streaming docs (OpenAI, Anthropic, etc.) — five minutes with their reference client is the fastest way to internalize the event shapes. Read the cancellation and error-event sections, not just the happy path.
- nginx’s
proxy_bufferingdocumentation — required reading the first time you deploy a streaming endpoint behind a reverse proxy and discover it isn’t streaming anymore.