Users expect streaming. Anything that feels like a non-streaming LLM call in 2026 feels broken, like a webpage that loads all at once after a 3-second blank screen. Making streaming work end-to-end is more involved than it looks — your server-side buffering, edge proxies, client framework, and mobile browsers all have opinions. This post is the architecture and the specific gotchas we hit most often.

End-to-end streaming

Client (SSE/fetch) → edge (no buffering) → your API (ReadableStream) → transform layer (token→delta) → LLM (stream=true). Any buffering at any hop breaks the perception of streaming.

The transport choice

Two real options in 2026: Server-Sent Events (SSE) or streamed fetch (a ReadableStream as the response body). WebSockets are overkill for unidirectional streaming; HTTP/3 push isn't widely available; chunked transfer encoding is what both SSE and streamed fetch use underneath.

Our default is SSE. Reconnection semantics, `event:` and `id:` fields, easy debugging with `curl`. Streamed fetch works too; prefer it when you need custom framing (binary chunks, multiplexed streams). Don't overthink this choice — either works. Go with what your framework has better support for.

Server-side: what to stream

You have three choices: raw LLM tokens, semantic chunks, or structured deltas.

Raw tokens are simplest — pass through whatever the LLM SDK emits. Works for pure text responses. Breaks down when the model is generating anything that needs to be parsed incrementally — JSON, HTML, Markdown with code blocks.

Semantic chunks: wait until a natural boundary (end of sentence, end of JSON field) and stream chunks. Lower latency feel because there's less jitter; harder to implement because you need a streaming parser.

Structured deltas: for JSON outputs, stream parsed partial objects ("here's what the model has produced so far: {title: 'hello', body: 'partial...'}"). Use a streaming JSON parser (partial-json, clarinet). This is what complex UIs — copilots that update forms live, dashboards that populate sections progressively — actually need.

The infrastructure gotchas

Node.js runtimes buffer by default. Streaming in Express works but requires `res.flushHeaders()` and explicit flushing per chunk. Next.js App Router streams cleanly when you return a ReadableStream from a route handler. The Pages Router does not stream well; if you're on Pages, migrate or use an external API.

Edge platforms (Cloudflare Workers, Vercel Edge) stream natively — this is one area where they're categorically better than Node origin servers. The HTTP/2 chunked encoding works. If you're on Cloudflare, the path of least resistance is a Worker that proxies to your LLM provider and passes through the stream.

CDN and proxy layers break streaming frequently. Cloudflare's default settings pass streams; some corporate proxies don't. Check `Transfer-Encoding: chunked` on the response headers at each hop. If it disappears, you found the offender.

Client-side

Use the browser's `EventSource` for SSE — it handles reconnection. For streamed fetch, use `response.body.getReader()` and decode chunks manually. React's `useEffect` with an async iterator works well.

iOS Safari pre-2025 hard-buffered the first 256 bytes of any response before firing progress events. Pad your first chunk with whitespace if you need compatibility with older iOS. Modern iOS and Android browsers are fine.

User-facing polish

A blinking cursor during streaming feels alive. Render a light-gray cursor character at the insertion point and remove on stream end. Scroll the response into view as it grows; stop auto-scrolling if the user scrolls away. Show an estimated progress indicator (token count vs expected) only for long responses where users actually want to know. Avoid percent bars — streaming isn't a bounded task.

Handle cancellation properly. If the user closes the tab, aborts, or navigates away, close the upstream connection to the LLM provider. Leaked connections accumulate and burn tokens (and money). An AbortController propagated from the client through your API to the SDK call handles this cleanly.

Streaming LLM UX: architecture and pitfalls

The transport choice

Server-side: what to stream

The infrastructure gotchas

Client-side

User-facing polish

Continue the thread.

Latency budgeting for LLM systems

Designing AI copilots inside SaaS products

Conversational UX for AI that isn't a chatbot

Want to talk about this?