Heads up: posts on this site are drafted by Claude and fact-checked by Codex. Both can still get things wrong — read with care and verify anything load-bearing before relying on it.
why how

Why do LLM responses stream?

It's not for show. The model literally generates one token at a time, and forcing it to buffer the full answer before sending would make every chat app feel broken. Streaming is the network shape of an autoregressive process.

Networking intro Apr 29, 2026

Why it exists

Open any chat app built on a modern LLM and the answer arrives as a typewriter, not a paragraph. It’s tempting to read that as a UX flourish — a fancy loading animation. It isn’t. It’s the wire shape of how the model actually produces text.

An LLM is autoregressive: to produce token N+1 it has to look at tokens 1 through N, including the ones it just produced. There is no parallel “compute the whole reply at once” mode. The fiftieth token is ready after fifty forward passes, in sequence. Each pass takes some milliseconds.

If the server waited for the whole reply before responding, two bad things would happen:

  1. The user stares at a spinner for the full generation time. A 500-token reply at, say, 50 tokens per second is a ten-second wait with nothing on screen, even though the first token was ready well before the rest.
  2. You’d be paying for a buffer you didn’t need. The server holds the text. The network does nothing. The client does nothing. Three idle layers, on purpose.

Streaming is the obvious move once you see this: emit each token (or small group of tokens) as soon as it pops out of the decoder. The user sees motion much sooner than the full reply could possibly arrive, and the same wall-clock generation feels dramatically faster because the time-to-first-token dominates how a chat UI feels — even if it’s not the only latency signal.

Why it matters now

Most major LLM provider APIs ship streaming as a first-class option, and production chat UIs typically default to it. As an engineer in 2026 you’ll hit it from at least three sides:

And it shows up beyond LLMs: agent frameworks stream tool-call deltas and partial structured outputs (some also stream reasoning traces, when the provider exposes them). The pattern is generalizing fast enough that “knows how streaming works at the HTTP layer” is becoming a baseline skill, not a specialty.

The short answer

streaming response = open-ended HTTP body + message framing + chunks flushed as the model emits them

Three layers stack here, and confusing them is half the bugs. The model generates incrementally. HTTP carries an open-ended body — via chunked transfer-encoding on HTTP/1.1, via DATA frames on HTTP/2 and HTTP/3. On top of that, an app-level framing — SSE or NDJSON — carves the byte stream into discrete messages the client can parse. The interesting part is what’s causing the chunks: an autoregressive decoder that has no choice but to produce text in order, one step at a time.

How it works

Three things have to line up: the model produces tokens incrementally, the HTTP layer can carry an open-ended body, and the client knows how to read it.

1. The decoder is sequential by construction

Inside the model, generating a reply is a loop:

prompt → forward pass → token_1
prompt + token_1 → forward pass → token_2
prompt + token_1 + token_2 → forward pass → token_3
...

Each forward pass is one trip through the network’s layers on a GPU. Modern serving stacks reuse most of the work between passes via the KV cache, which is why “tokens per second” is a meaningful steady-state number after the first one. But sequential it remains: token N+1 cannot start until token N exists.

This is the load-bearing fact for the whole post. Streaming isn’t a delivery optimization layered on top of a fundamentally batch computation — it’s the natural output shape of the computation itself. A non-streaming API is the special case, where the server volunteers to buffer.

2. HTTP can carry an open-ended body

You don’t need a new protocol for this. Plain HTTP/1.1 has had chunked transfer-encoding since the 1990s (currently specified in RFC 9112). The body is a sequence of chunks, each prefixed with its length in hex, ending with a zero-length chunk:

HTTP/1.1 200 OK
Content-Type: text/event-stream
Transfer-Encoding: chunked

1a
data: {"token": "Hello"}\n\n
1c
data: {"token": ", world"}\n\n
0

Two common application-level framings sit on top of an open-ended HTTP body:

HTTP/2 and HTTP/3 don’t use Transfer-Encoding: chunked at all — that mechanism is HTTP/1.1-specific. They carry the response body as a sequence of DATA frames on a multiplexed stream, and the server can flush each frame as soon as it’s ready. From the application’s point of view the contract is the same: write some bytes, the other side eventually reads them, and the layer in between doesn’t have to know the total length up front. See QUIC for why the lower layer still matters on flaky links.

3. The client reads incrementally

The client side is where naïve code falls down. You can’t await response.json() — there is no complete JSON until the server closes the stream. You have to consume the body as it arrives:

const res = await fetch(url, { method: "POST", body: ... });
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buf = "";
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buf += decoder.decode(value, { stream: true });
  // split buf on the framing delimiter, parse complete events,
  // keep the unfinished tail for the next iteration.
}

The detail that bites everyone once: a single read() is not aligned to a logical message. You’ll get half an event, then the other half plus two more events, then nothing for 200 ms. The client has to buffer until it sees a complete message boundary (blank line for SSE, newline for NDJSON) before parsing. Decoding bytes-to-text needs the streaming flag too, or you’ll corrupt multi-byte UTF-8 characters whose bytes happened to span a chunk boundary.

Show the seams

A few things the typewriter effect hides:

Going deeper