How to Stream AI API Responses: SSE Tutorial in Python and JavaScript for Every Provider

TokenMix Research Lab ยท 2026-04-13

How to Stream AI API Responses: SSE Tutorial in Python and JavaScript for Every Provider

How to Stream AI API Responses: SSE Streaming Tutorial for OpenAI, Anthropic, and Google SDKs (2026)

Streaming an AI API response reduces perceived latency by 50-80%. Without streaming, users stare at a blank screen for 2-10 seconds. With streaming, they see tokens arrive word by word in real time. This tutorial covers how to stream LLM responses using Server-Sent Events (SSE) in both Python and JavaScript, with complete working code for OpenAI, Anthropic, and Google SDKs. All code tested against current API versions as of April 2026, with performance benchmarks tracked by [TokenMix.ai](https://tokenmix.ai).

Table of Contents

---

Quick Comparison: Streaming Across Providers

| Dimension | OpenAI | Anthropic | Google Gemini | | --- | --- | --- | --- | | **Streaming Protocol** | SSE | SSE | SSE | | **Python SDK Stream** | `stream=True` | `.stream()` method | `stream=True` | | **JS/TS SDK Stream** | `stream=True` | `.stream()` method | `generateContentStream()` | | **Event Format** | `data: {json}\n\n` | `event: type\ndata: {json}\n\n` | `data: {json}\n\n` | | **Done Signal** | `data: [DONE]` | `event: message_stop` | Stream ends | | **Token-by-Token** | Yes | Yes | Yes (chunks vary) | | **Abort Support** | AbortController | AbortController | AbortController |

---

Why Streaming Improves UX: The Perceived Latency Effect

The human brain processes streaming text differently than a block of text that appears all at once. Streaming exploits this by delivering information incrementally.

**Without streaming (blocking):** - User sends message at T=0 - Blank screen for 3 seconds - Full response appears at T=3s - Perceived wait: 3 seconds

**With streaming:** - User sends message at T=0 - First token appears at T=0.3s (TTFT) - Tokens flow in at 120 tok/s for 2.7 seconds - Perceived wait: 0.3 seconds

Same total response time. But the user perceives a 10x improvement because the first visible feedback arrives in 0.3 seconds instead of 3.

TokenMix.ai benchmarks show streaming reduces bounce rates on AI chat interfaces by 30-40% compared to blocking responses. For a detailed speed comparison across providers, see our [AI API response time comparison](https://tokenmix.ai/blog/ai-api-response-time-comparison).

**When streaming matters most:**

| Use Case | Streaming Value | Why | | --- | --- | --- | | Chat interfaces | Critical | Users expect instant feedback | | Code generation | High | Developers read code as it appears | | Content writing | High | Writers review output in real time | | API pipelines (no user) | Low | No one is watching | | Batch processing | None | Responses are collected, not displayed |

---

How AI API Streaming Works: SSE Explained

All major AI APIs use Server-Sent Events (SSE) for streaming. SSE is a simple, one-way protocol where the server pushes data to the client over a single HTTP connection.

**The SSE format:**

Each message is prefixed with `data: ` and terminated with two newlines (`\n\n`). The client reads these messages as they arrive, parses the JSON, and extracts the text content.

**The request-response flow:**

1. Client sends POST request with `stream: true` 2. Server responds with `Content-Type: text/event-stream` 3. Server sends tokens one at a time as SSE messages 4. Client reads and processes each message incrementally 5. Server sends a termination signal (`[DONE]` or stream end)

**Key difference from WebSocket:** SSE is one-directional (server to client). This is fine for AI streaming because the client only needs to receive. WebSocket (bidirectional) is used by OpenAI's Realtime API for voice, but standard text streaming uses SSE.

---

Streaming with the OpenAI SDK

Python

client = OpenAI()

stream = client.chat.completions.create( model="gpt-4.1-mini", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain how neural networks learn."} ], stream=True, )

for chunk in stream: content = chunk.choices[0].delta.content if content: print(content, end="", flush=True)

print() # Final newline ```

**With usage tracking:**

full_response = "" for chunk in stream: if chunk.choices and chunk.choices[0].delta.content: content = chunk.choices[0].delta.content full_response += content print(content, end="", flush=True)

Usage info comes in the final chunk

JavaScript / TypeScript

const client = new OpenAI();

const stream = await client.chat.completions.create({ model: 'gpt-4.1-mini', messages: [ { role: 'system', content: 'You are a helpful assistant.' }, { role: 'user', content: 'Explain how neural networks learn.' } ], stream: true, });

for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content || ''; process.stdout.write(content); }

console.log(); // Final newline ```

---

Streaming with the Anthropic SDK

Anthropic uses a different event structure than OpenAI. Messages are wrapped in event types.

Python

client = anthropic.Anthropic()

with client.messages.stream( model="claude-haiku-3.5", max_tokens=1024, messages=[ {"role": "user", "content": "Explain how neural networks learn."} ], ) as stream: for text in stream.text_stream: print(text, end="", flush=True)

print() ```

**With full event handling:**

Access final message with usage

JavaScript / TypeScript

const client = new Anthropic();

const stream = client.messages.stream({ model: 'claude-haiku-3.5', max_tokens: 1024, messages: [ { role: 'user', content: 'Explain how neural networks learn.' } ], });

stream.on('text', (text) => { process.stdout.write(text); });

const finalMessage = await stream.finalMessage(); console.log(`\nTokens: ${finalMessage.usage.input_tokens} in, ${finalMessage.usage.output_tokens} out`); ```

---

Streaming with the Google Gemini SDK

Google's Gemini SDK uses a slightly different streaming interface.

Python

genai.configure(api_key="YOUR_API_KEY") model = genai.GenerativeModel("gemini-2.0-flash")

response = model.generate_content( "Explain how neural networks learn.", stream=True, )

for chunk in response: print(chunk.text, end="", flush=True)

print() ```

**With safety settings and configuration:**

response = model.generate_content( "Explain how neural networks learn.", generation_config=generation_config, stream=True, )

for chunk in response: if chunk.text: print(chunk.text, end="", flush=True)

Access usage metadata

JavaScript / TypeScript

const genAI = new GoogleGenerativeAI('YOUR_API_KEY'); const model = genAI.getGenerativeModel({ model: 'gemini-2.0-flash' });

const result = await model.generateContentStream('Explain how neural networks learn.');

for await (const chunk of result.stream) { const text = chunk.text(); process.stdout.write(text); }

console.log(); ```

**Google's streaming chunk behavior:** Unlike OpenAI and Anthropic, which send roughly one token per event, Gemini may batch multiple tokens into a single chunk. This means the visual "typing" effect may appear less smooth. To normalize this, buffer chunks and release them character by character on the frontend.

---

Building a Streaming Backend in Node.js

A production backend needs to proxy AI streams to your frontend securely. Here is a complete Express.js streaming endpoint.

const app = express(); app.use(express.json());

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

app.post('/api/stream', async (req, res) => { const { messages } = req.body;

// Set SSE headers res.setHeader('Content-Type', 'text/event-stream'); res.setHeader('Cache-Control', 'no-cache'); res.setHeader('Connection', 'keep-alive'); res.setHeader('X-Accel-Buffering', 'no'); // Disable nginx buffering

try { const stream = await openai.chat.completions.create({ model: 'gpt-4.1-mini', messages, stream: true, stream_options: { include_usage: true }, });

for await (const chunk of stream) { const content = chunk.choices[0]?.delta?.content; if (content) { res.write(`data: ${JSON.stringify({ type: 'token', content })}\n\n`); } if (chunk.usage) { res.write(`data: ${JSON.stringify({ type: 'usage', ...chunk.usage })}\n\n`); } }

res.write('data: [DONE]\n\n'); res.end(); } catch (error) { res.write(`data: ${JSON.stringify({ type: 'error', message: error.message })}\n\n`); res.end(); } });

app.listen(3001, () => console.log('Streaming server on port 3001')); ```

**Important: the `X-Accel-Buffering: no` header.** Without this, nginx (and many reverse proxies) buffer the entire response before sending it to the client, defeating the purpose of streaming. Add this header or configure your proxy to disable response buffering for SSE endpoints.

---

Building a Streaming Frontend in JavaScript

The frontend reads the SSE stream from your backend and updates the UI in real time.

**Using the Fetch API with ReadableStream:**

if (!response.ok) { onError(new Error(`HTTP ${response.status}`)); return; }

const reader = response.body.getReader(); const decoder = new TextDecoder(); let buffer = '';

while (true) { const { done, value } = await reader.read(); if (done) break;

buffer += decoder.decode(value, { stream: true }); const lines = buffer.split('\n\n'); buffer = lines.pop(); // Keep incomplete chunk in buffer

for (const line of lines) { if (!line.startsWith('data: ')) continue; const data = line.slice(6);

if (data === '[DONE]') { onDone(); return; }

try { const parsed = JSON.parse(data); if (parsed.type === 'token') { onToken(parsed.content); } else if (parsed.type === 'error') { onError(new Error(parsed.message)); } } catch (e) { // Skip malformed chunks } } }

onDone(); }

// Usage: let fullResponse = ''; streamChat( [{ role: 'user', content: 'Hello' }], (token) => { fullResponse += token; document.getElementById('response').textContent = fullResponse; }, () => console.log('Stream complete'), (err) => console.error('Stream error:', err) ); ```

**Adding abort support:**

const response = await fetch('/api/stream', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ messages }), signal: controller.signal, });

// Cancel the stream: document.getElementById('stop-btn').addEventListener('click', () => { controller.abort(); }); ```

---

Streaming Performance Comparison

How streaming latency breaks down across providers (TokenMix.ai data, April 2026):

| Provider + Model | TTFT | TPS (Streaming) | 100-Token Stream | 500-Token Stream | | --- | --- | --- | --- | --- | | Groq Llama 70B | 0.20s | 250 tok/s | 0.60s | 2.20s | | OpenAI GPT-4.1 mini | 0.30s | 120 tok/s | 1.13s | 4.47s | | Google Gemini Flash | 0.40s | 150 tok/s | 1.07s | 3.73s | | Anthropic Haiku 3.5 | 0.50s | 90 tok/s | 1.61s | 6.06s | | OpenAI GPT-5.4 | 0.50s | 80 tok/s | 1.75s | 6.75s | | DeepSeek V4 | 2.00s | 60 tok/s | 3.67s | 10.33s |

**For short responses (< 100 tokens):** TTFT dominates total time. Groq wins.

**For long responses (500+ tokens):** TPS matters more. Groq still wins, but Gemini Flash's higher TPS (150 vs 120) makes it faster than GPT-4.1 mini for long outputs.

**Chunk consistency matters for UX.** OpenAI sends very consistent 1-2 token chunks, creating a smooth typing effect. Anthropic's chunks are also consistent. Gemini sometimes batches 5-10 tokens per chunk, creating a choppier visual. Frontend buffering smooths this out.

---

Common Streaming Pitfalls and Fixes

**Pitfall 1: Reverse proxy buffering.** Nginx, Cloudflare, and AWS ALB all buffer responses by default. Your stream appears to arrive in large bursts instead of token by token.

Fix: Add `X-Accel-Buffering: no` header for nginx. In Cloudflare, disable response buffering in the dashboard. For AWS ALB, streaming works natively.

**Pitfall 2: Missing error handling.** If the AI provider returns a 429 or 500 after streaming starts, your frontend hangs.

Fix: Wrap the stream reader in a try/catch and implement a timeout. If no data arrives for 30 seconds, abort and show an error.

**Pitfall 3: Memory leaks from unclosed streams.** If the user navigates away mid-stream, the reader stays open.

Fix: Use `AbortController` and call `abort()` in a React `useEffect` cleanup or on page navigation.

**Pitfall 4: Markdown rendering flickers.** Rendering markdown on every token causes layout shifts and flickers.

Fix: Debounce markdown rendering -- update the raw text on every token but re-render markdown every 100ms.

**Pitfall 5: Token counting mismatch.** Streaming and non-streaming requests may return slightly different token counts due to rounding.

Fix: Use the `usage` object from the final streaming chunk (OpenAI) or final message (Anthropic) for accurate counts. Do not count tokens client-side.

---

How to Choose Your Streaming Stack

| Your Situation | Backend | Frontend | Provider | | --- | --- | --- | --- | | Next.js app, fast prototype | Vercel AI SDK | `useChat` hook | Any (one-line swap) | | React + Express, full control | Express SSE proxy | Fetch + ReadableStream | OpenAI or TokenMix.ai | | Python backend, web frontend | FastAPI + StreamingResponse | Fetch + ReadableStream | Any | | Python CLI tool | Direct SDK streaming | N/A (print to terminal) | Any | | Multi-provider with failover | TokenMix.ai API | Fetch + ReadableStream | Auto-routed | | Speed-critical chat | Express SSE + Groq | Fetch + ReadableStream | Groq |

For apps that need to switch providers without changing streaming code, [TokenMix.ai](https://tokenmix.ai) offers an OpenAI-compatible streaming endpoint that routes to 300+ models. Your frontend code stays identical regardless of the backend model.

---

Conclusion

Streaming AI API responses is non-negotiable for any user-facing AI feature. The implementation is straightforward: set `stream: true` in your API call, parse SSE events on the backend, and update the DOM incrementally on the frontend.

All three major providers (OpenAI, Anthropic, Google) support SSE streaming with similar patterns. OpenAI has the simplest streaming interface. Anthropic has the most structured event types. Google has inconsistent chunk sizes that need frontend buffering.

Start with the code examples in this guide. For a complete React integration, see our [AI API for React guide](https://tokenmix.ai/blog/ai-api-for-react-apps). For Next.js, see our [AI API for Next.js guide](https://tokenmix.ai/blog/ai-api-for-nextjs). Track streaming performance across providers with [TokenMix.ai](https://tokenmix.ai).

---

FAQ

How does AI API streaming work?

AI API streaming uses Server-Sent Events (SSE) to send tokens one at a time from the server to the client. The client sends a request with `stream: true`, and the server responds with a `text/event-stream` content type. Each token arrives as a `data:` message followed by two newlines. The client parses each message and updates the display incrementally.

Does streaming cost more than non-streaming API calls?

No. Streaming and non-streaming requests cost the same number of tokens. The total token count is identical. The only difference is delivery timing -- streaming sends tokens as they are generated instead of waiting for the complete response. There is no price premium for streaming.

Why does my streaming response arrive in chunks instead of token by token?

This is typically caused by reverse proxy buffering. Nginx, Cloudflare, and AWS load balancers buffer responses by default. Add the header `X-Accel-Buffering: no` for nginx, or disable response buffering in your CDN/proxy settings. Google Gemini also batches tokens into larger chunks natively, which requires frontend-side character-by-character rendering.

How do I cancel a streaming response mid-generation?

Use `AbortController` in JavaScript. Create a controller before the fetch call, pass `signal: controller.signal` to the fetch options, and call `controller.abort()` when the user clicks a stop button. On the backend, handle the aborted connection gracefully to avoid error logs.

Which AI provider has the smoothest streaming experience?

OpenAI delivers the most consistent token-by-token streaming with 1-2 tokens per SSE event. Anthropic is similarly consistent. Google Gemini sends variable-size chunks (1-10 tokens per event), which can feel choppy without frontend buffering. Groq has the fastest streaming speed at 250-350 tokens per second, making the text appear almost instantly. TokenMix.ai normalizes streaming behavior across all providers.

Can I stream AI responses in a Python backend?

Yes. All three major SDKs support Python streaming. OpenAI uses `for chunk in stream:`, Anthropic uses `with client.messages.stream() as stream:`, and Google uses `for chunk in response:`. For web backends, use FastAPI's `StreamingResponse` or Flask's response streaming to pipe tokens to the frontend.

---

*Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: [OpenAI API Docs](https://platform.openai.com/docs), [Anthropic API Docs](https://docs.anthropic.com), [Google AI SDK](https://ai.google.dev/docs), [TokenMix.ai](https://tokenmix.ai)*