eazyware
Engineering·July 22, 2024·10 min read

WebSockets for AI: when SSE is not enough

Streaming LLM responses via SSE is simple but one-way. For bidirectional AI UX (voice, interruptions, tool calls), WebSockets unlock patterns SSE can't.

KR
Kushal R.
Engineering lead

Server-sent events (SSE) handle most AI streaming needs: server streams tokens, client displays them. Simple and effective for standard chat. But some AI UX patterns — voice interactions, interruptible generation, agent loops with concurrent tool results — require bidirectional communication. That's where WebSockets earn their extra complexity. This post is the decision framework and the concrete implementation patterns for WebSocket-based AI UX.

SSE vs WebSockets
SSE vs WebSockets for AI Server-Sent Events (SSE) One-way (server to client) HTTP-native, simpler Works through proxies easily Fits standard LLM streaming WebSockets Bidirectional Interruptions possible Voice, tool results mid-stream More infra complexity Pick WebSockets when Voice AI (user speaks while assistant talks) Interruptible generation (user cancels mid-response) Agent loops with real-time tool result injection
SSE: one-way, simpler, fits standard LLM streaming. WebSockets: bidirectional, enables voice, interruptions, concurrent tool results.

When SSE fits

Chat UI where user types, hits send, watches response stream back, then types again. SSE is exactly this pattern: unidirectional from server to client during the response.

Streaming generation for any read-only user experience. Document generation, code completion, long-form content. User doesn't need to interject mid-stream.

Simpler infrastructure. HTTP-native. Works through most proxies and load balancers without special configuration. Connection reuses standard HTTP keepalive.

Most AI products start with SSE and should. Default to SSE unless you have a specific reason for WebSockets.

When WebSockets become necessary

Voice AI. User speaks while the assistant is talking. Both streams flow in both directions simultaneously. Voice activity detection needs server-side audio input during server-side audio output. Only WebSockets handle this.

Interruptible generation. User sees the response going off-track and wants to redirect it. Stop button that actually takes effect; clarification message sent mid-response. SSE can't carry user input while server is streaming response.

Real-time tool result injection. Agent is generating output; new information arrives (web search completes, database query returns, another agent responds). Inject it into the generation without restarting. Bidirectional channel makes this clean.

Long-lived connections with low overhead. For continuously-interactive UIs, a single WebSocket connection beats repeated SSE setups.

Implementation patterns

Message framing. Each WebSocket message is a JSON object with a type field. Typed messaging supports many interaction patterns over one connection — user text, assistant tokens, tool calls, interrupts, acknowledgments.

Session state. WebSocket connections are stateful. Decide whether session state lives on the server holding the connection or in a distributed store. Server-held is simpler but brittle if the connection drops and reconnects to a different instance.

Reconnection handling. Networks drop WebSocket connections occasionally. Client should reconnect with session ID; server should restore context. Include heartbeat messages to detect dead connections faster than TCP timeouts.

Backpressure. If the client can't consume messages fast enough, the server should throttle. WebSocket libraries have varying backpressure support; check yours.

Infrastructure considerations

Load balancers need WebSocket support. Most modern LBs (AWS ALB, Cloudflare, nginx) handle this. Session affinity may be needed if server state is local.

WebSocket connections consume server resources while open. Capacity planning looks different from HTTP: concurrent open connections, not requests per second. Monitor open connection counts; plan capacity accordingly.

Timeouts. Long-lived connections run into intermediate proxy timeouts (typically 60-120s). Application-level heartbeats keep connections alive.

Voice AI specifics

Audio streaming in both directions. User's voice arrives as audio chunks (typically 20ms PCM frames). Server transcribes incrementally; LLM generates; TTS streams response audio.

Voice activity detection (VAD) on the server detects user interruptions. When user starts talking, server sends a stop signal to the TTS stream. User hears the assistant stop immediately when they start talking.

See voice AI buildout post for the full architecture. WebSockets are the transport layer; voice AI adds the stack on top.

Debugging WebSocket AI

WebSocket traffic is harder to debug than HTTP. Browser devtools show messages; server logs need structured logging of every message type.

Replay: for reproducing issues, capture WebSocket message streams and replay them in dev. Libraries exist (websocket-replay, custom scripts).

Observability: track per-connection metrics (messages per second, latency per message type, connection duration). These diverge from request-oriented observability.

When not to use WebSockets

Simple chat UI. Don't overengineer. SSE is enough.

Stateless short interactions. HTTP works fine.

When your infrastructure doesn't support long-lived connections well. Some managed platforms limit WebSocket duration or concurrent connections; know your constraints.

Read next
Streaming LLM UX: architecture and pitfalls
Read next
Building voice AI that passes the "grandma test"
Read next
Latency budgeting for LLM systems
Tags
WebSocketsSSEstreamingreal-time
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request