TokenMix Research Lab · 2026-04-10

AI API Streaming Guide: SSE Streaming for LLMs in Python and Node.js (2026)
Streaming API responses reduce perceived latency by 60-80% compared to waiting for a complete response. TokenMix.ai latency monitoring across 300+ models shows that time-to-first-token with streaming averages 200-500ms, while non-streaming responses take 2-15 seconds for typical completions. This guide covers SSE streaming implementation for every major LLM provider, with production-ready code in Python and Node.js, latency benchmarks, and cost analysis.
If your AI application makes users wait for a full response before displaying anything, you are leaving significant UX improvement on the table.
Table of Contents
- [Quick Comparison: Streaming Support Across Providers]
- [Why LLM Streaming Matters: Latency Data]
- [How SSE Streaming Works for LLM APIs]
- [OpenAI Streaming Implementation]
- [Anthropic Claude Streaming Implementation]
- [Google Gemini Streaming Implementation]
- [DeepSeek and Open-Source Model Streaming]
- [Full Comparison Table: Streaming Performance]
- [Streaming vs Non-Streaming: When to Use Each]
- [Production Streaming: Error Handling and Reconnection]
- [Cost Analysis: Does Streaming Cost More?]
- [How to Choose a Streaming Strategy]
- [Conclusion]
- [FAQ]
Quick Comparison: Streaming Support Across Providers
| Feature | OpenAI | Anthropic Claude | Google Gemini | DeepSeek | Groq |
|---|---|---|---|---|---|
| Streaming Protocol | SSE | SSE | SSE / WebSocket | SSE | SSE |
| Time-to-First-Token (avg) | 300-600ms | 400-800ms | 250-500ms | 500-1,200ms | 100-200ms |
| Token Throughput (streaming) | 50-80 tok/s | 40-70 tok/s | 60-100 tok/s | 30-50 tok/s | 300-500 tok/s |
| Streaming + Function Calling | Yes | Yes | Yes | Yes | Yes |
| Streaming + Structured Output | Yes | Partial | Yes | Yes | No |
| Usage Stats in Stream | Final chunk | message_delta event |
Final chunk | Final chunk | Final chunk |
| Backpressure Handling | Client-side | Client-side | Client-side | Client-side | Client-side |
Why LLM Streaming Matters: Latency Data
The difference between streaming and non-streaming is not about total generation time. It is about perceived latency -- how long the user waits before seeing the first token of a response.
TokenMix.ai monitors time-to-first-token (TTFT) and total generation time across all major providers. Here is what the data shows for a typical 500-token response.
| Provider / Model | Non-Streaming Total | Streaming TTFT | Streaming Total | TTFT Improvement |
|---|---|---|---|---|
| OpenAI GPT-4o | 3.2-5.5s | 300-600ms | 3.5-6.0s | 82-89% faster |
| Claude Sonnet 4.6 | 4.0-7.0s | 400-800ms | 4.5-7.5s | 86-90% faster |
| Gemini 3.1 Pro | 2.5-4.5s | 250-500ms | 2.8-5.0s | 88-90% faster |
| DeepSeek V4 | 5.0-10.0s | 500-1,200ms | 5.5-11.0s | 88-90% faster |
| Groq Llama 4 | 0.5-1.5s | 100-200ms | 0.6-1.8s | 80-87% faster |
Key insight: Streaming adds a small overhead to total generation time (5-10% longer) because of the SSE framing and network overhead per chunk. But the perceived latency drops by 80-90%. For user-facing applications, this tradeoff is always worth it.
When streaming does not help: Batch processing, background tasks, and any workflow where no human is waiting for the response. In these cases, non-streaming is simpler to implement and avoids the SSE parsing overhead.
How SSE Streaming Works for LLM APIs
Server-Sent Events (SSE) is the standard protocol for LLM streaming. It is a simple, one-directional HTTP protocol where the server sends a stream of events to the client over a single long-lived connection.
The SSE format:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":" world"}}]}
data: [DONE]
Each chunk contains a small piece of the model's response. Your client accumulates these chunks to build the full response. The [DONE] event signals the end of the stream.
Why SSE over WebSockets: SSE works over standard HTTP, requires no special server infrastructure, and is supported by all LLM providers. WebSockets add bidirectional communication capability that LLM streaming does not need. Google Gemini supports both, but SSE is the recommended approach for simplicity.
Connection lifecycle:
- Client sends a standard HTTP POST request with
stream: true - Server responds with
Content-Type: text/event-stream - Server sends data events as tokens are generated
- Connection closes after the final
[DONE]event or on error
OpenAI Streaming Implementation
Python
from openai import OpenAI
client = OpenAI()
# Basic streaming
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing in 200 words"}],
stream=True
)
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content is not None:
content = chunk.choices[0].delta.content
full_response += content
print(content, end="", flush=True)
Python with Usage Tracking
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True,
stream_options={"include_usage": True}
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
if chunk.usage: # Final chunk contains usage
print(f"\nTokens: {chunk.usage.total_tokens}")
Node.js
import OpenAI from "openai";
const client = new OpenAI();
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Explain quantum computing" }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
Anthropic Claude Streaming Implementation
Claude's streaming uses a different event structure than OpenAI. Instead of chat.completion.chunk, Claude sends typed events: message_start, content_block_start, content_block_delta, content_block_stop, and message_delta.
Python
import anthropic
client = anthropic.Anthropic()
# Basic streaming
with client.messages.stream(
model="claude-sonnet-4-6-20260401",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain quantum computing in 200 words"}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
# Get final message with usage
message = stream.get_final_message()
print(f"\nInput tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")
Python with Event-Level Control
with client.messages.stream(
model="claude-sonnet-4-6-20260401",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain quantum computing"}]
) as stream:
for event in stream:
if event.type == "content_block_delta":
if event.delta.type == "text_delta":
print(event.delta.text, end="")
elif event.type == "message_delta":
print(f"\nStop reason: {event.delta.stop_reason}")
print(f"Output tokens: {event.usage.output_tokens}")
Node.js
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const stream = client.messages.stream({
model: "claude-sonnet-4-6-20260401",
max_tokens: 1024,
messages: [{ role: "user", content: "Explain quantum computing" }],
});
stream.on("text", (text) => {
process.stdout.write(text);
});
const finalMessage = await stream.finalMessage();
console.log(`\nTokens: ${finalMessage.usage.output_tokens}`);
Google Gemini Streaming Implementation
Gemini supports streaming through its native SDK and the OpenAI-compatible endpoint.
Python (Native SDK)
import google.generativeai as genai
model = genai.GenerativeModel("gemini-3.1-pro")
response = model.generate_content(
"Explain quantum computing in 200 words",
stream=True
)
for chunk in response:
print(chunk.text, end="", flush=True)
Python (OpenAI-Compatible Endpoint)
from openai import OpenAI
client = OpenAI(
base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
api_key="your-google-api-key"
)
stream = client.chat.completions.create(
model="gemini-3.1-pro",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Gemini Streaming Performance
Gemini has the fastest time-to-first-token among full-size models (250-500ms), primarily because of Google's inference infrastructure. Token throughput during streaming is also the highest at 60-100 tokens per second for Gemini 3.1 Pro.
DeepSeek and Open-Source Model Streaming
DeepSeek (OpenAI-Compatible)
from openai import OpenAI
client = OpenAI(
base_url="https://api.deepseek.com/v1",
api_key="your-deepseek-key"
)
stream = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Self-Hosted Models (vLLM, Ollama)
Most self-hosted inference servers support OpenAI-compatible streaming. The code is identical, with only the base_url changed.
# vLLM
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
# Ollama
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")
# Streaming works identically
stream = client.chat.completions.create(
model="llama-4-maverick",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True
)
Self-Hosted Streaming Performance
Streaming performance for self-hosted models depends entirely on your hardware. TokenMix.ai benchmarks on common configurations:
| Hardware | Model | TTFT | Throughput |
|---|---|---|---|
| A100 80GB | Llama 4 Maverick | 150-300ms | 80-120 tok/s |
| RTX 4090 | Llama 4 Scout | 200-400ms | 40-60 tok/s |
| M3 Max 64GB | Llama 3.3 70B (quantized) | 500-1,000ms | 15-25 tok/s |
Full Comparison Table: Streaming Performance
| Provider / Model | TTFT (p50) | TTFT (p95) | Throughput | Protocol | Streaming + Tools | Streaming Cost Premium |
|---|---|---|---|---|---|---|
| OpenAI GPT-4o | 350ms | 800ms | 50-80 tok/s | SSE | Full | None |
| OpenAI GPT-5.4 | 400ms | 900ms | 45-70 tok/s | SSE | Full | None |
| Claude Sonnet 4.6 | 500ms | 1,200ms | 40-70 tok/s | SSE | Full | None |
| Claude Haiku 4 | 200ms | 500ms | 80-120 tok/s | SSE | Full | None |
| Gemini 3.1 Pro | 300ms | 700ms | 60-100 tok/s | SSE | Full | None |
| Gemini 3.1 Flash | 150ms | 400ms | 100-150 tok/s | SSE | Full | None |
| DeepSeek V4 | 700ms | 2,000ms | 30-50 tok/s | SSE | Basic | None |
| Groq Llama 4 | 120ms | 300ms | 300-500 tok/s | SSE | Yes | None |
Streaming vs Non-Streaming: When to Use Each
Use Streaming When:
- User-facing chat interfaces: The 80-90% reduction in perceived latency is critical for user experience
- Long-form generation: Documents, articles, and reports that take 10+ seconds to generate
- Real-time collaboration: Multiple users watching the same AI output in real time
- Progressive rendering: Building UIs that display partial results (search results, recommendations)
Use Non-Streaming When:
- Batch processing: Processing thousands of requests where no human is waiting
- Structured output: Some structured output methods (Anthropic tool use) return data in the final chunk only
- Simple API integrations: Background services that just need the final result
- Cost-sensitive retry logic: Easier to implement retries with complete responses
Performance Comparison
| Metric | Streaming | Non-Streaming |
|---|---|---|
| Time-to-first-token | 200-800ms | Same as total time |
| Total generation time | 5-10% longer | Baseline |
| Client complexity | Higher (event parsing) | Lower (single response) |
| Error handling | More complex (mid-stream errors) | Simpler (request/response) |
| Memory usage | Lower (process chunks) | Higher (full response in memory) |
| Network overhead | Higher (SSE framing) | Lower (single JSON payload) |
Production Streaming: Error Handling and Reconnection
Production streaming requires handling mid-stream errors, connection drops, and timeouts. Here is a production-ready pattern.
Python Production Pattern
import time
from openai import OpenAI, APIError, APIConnectionError
client = OpenAI(
base_url="https://api.tokenmix.ai/v1",
api_key="your-tokenmix-key"
)
def stream_with_retry(messages, max_retries=3):
"""Production-ready streaming with retry logic."""
for attempt in range(max_retries):
try:
stream = client.chat.completions.create(
model="gpt-4o",
messages=messages,
stream=True,
timeout=30
)
full_response = ""
for chunk in stream:
if chunk.choices[0].delta.content:
content = chunk.choices[0].delta.content
full_response += content
yield content
return # Success
except APIConnectionError:
if attempt < max_retries - 1:
time.sleep(2 ** attempt)
continue
raise
except APIError as e:
if e.status_code == 429: # Rate limited
time.sleep(5)
continue
raise
Node.js Production Pattern
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.tokenmix.ai/v1",
apiKey: "your-tokenmix-key",
});
async function* streamWithRetry(
messages: OpenAI.ChatCompletionMessageParam[],
maxRetries = 3
) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const stream = await client.chat.completions.create({
model: "gpt-4o",
messages,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) yield content;
}
return;
} catch (error) {
if (attempt < maxRetries - 1) {
await new Promise((r) => setTimeout(r, 2 ** attempt * 1000));
continue;
}
throw error;
}
}
}
Key Production Considerations
Timeout handling: Set a reasonable timeout (30-60 seconds) for the stream connection. TokenMix.ai data shows that streams that do not produce a chunk within 10 seconds have a 90% chance of failing completely.
Partial response recovery: If a stream fails mid-generation, you have a partial response. Decide whether to display the partial result, retry with context, or fail gracefully. TokenMix.ai's API handles failover automatically, switching to a backup model if the primary provider's stream drops.
Client-side buffering: For web applications, buffer a few tokens before rendering to avoid flickering. Displaying each individual token creates a jittery experience. A 3-5 token buffer provides smooth rendering.
Cost Analysis: Does Streaming Cost More?
No. Streaming does not cost more tokens. The input and output token count is identical whether you stream or not. The total cost per request is the same.
Where costs differ:
| Factor | Streaming | Non-Streaming |
|---|---|---|
| Token cost | Same | Same |
| Network bandwidth | 5-15% higher (SSE overhead) | Baseline |
| Server connections | Longer-lived | Short-lived |
| Infrastructure cost | Slightly higher (connection management) | Slightly lower |
The only meaningful cost difference is infrastructure. Long-lived SSE connections consume more server resources than short request-response cycles. For most applications, this is negligible.
TokenMix.ai pricing: Streaming and non-streaming requests are priced identically. No premium for streaming. TokenMix.ai handles connection management and provider failover, reducing your infrastructure overhead.
How to Choose a Streaming Strategy
| Your Scenario | Recommendation | Why |
|---|---|---|
| User-facing chatbot | Always stream | 80-90% perceived latency reduction |
| API backend service | Do not stream | Simpler code, no UX benefit |
| Real-time dashboard | Stream | Progressive data display |
| Batch data extraction | Do not stream | Easier error handling and retries |
| Mobile application | Stream with buffering | Smooth UX, manage network variability |
| Multi-provider setup | Stream via TokenMix.ai | Unified SSE format across all providers |
| Lowest TTFT required | Use Groq or Gemini Flash | 100-200ms TTFT, highest throughput |
Conclusion
Streaming is essential for any user-facing AI application. The 80-90% reduction in perceived latency is too significant to ignore. Every major LLM provider supports SSE streaming, and the implementation is straightforward with modern SDKs.
The provider choice matters for streaming performance. Groq leads in raw speed (100-200ms TTFT, 300-500 tok/s) but has limited model selection. Gemini 3.1 Flash offers the best balance of speed and capability among full-featured models. Claude Sonnet 4.6 has the highest TTFT but compensates with superior response quality.
For production systems, route through TokenMix.ai's unified API. You get consistent SSE streaming across all providers, automatic failover if a provider's stream drops mid-generation, and identical pricing for streaming and non-streaming. Write your streaming client once, access 300+ models through a single endpoint.
FAQ
Does streaming an LLM API cost more than non-streaming?
No. Streaming and non-streaming requests consume the same number of input and output tokens. The cost per request is identical. The only difference is slightly higher network bandwidth (5-15%) due to SSE framing overhead, which is negligible in practice. TokenMix.ai charges the same rate for both.
What is SSE streaming and how does it work with LLM APIs?
Server-Sent Events (SSE) is a one-directional HTTP protocol where the server pushes events to the client over a long-lived connection. For LLM APIs, each event contains a small chunk of the generated response (typically 1-3 tokens). The client accumulates chunks to build the full response. All major providers (OpenAI, Anthropic, Google, DeepSeek) use SSE for streaming.
Which LLM provider has the fastest streaming response?
Groq has the fastest time-to-first-token at 100-200ms and the highest throughput at 300-500 tokens per second, but is limited to open-source models. Among full-featured providers, Google Gemini 3.1 Flash leads with 150-400ms TTFT and 100-150 tok/s throughput. TokenMix.ai monitors streaming performance across all providers in real time.
How do I handle errors during an LLM streaming response?
Implement a retry wrapper with exponential backoff. If a stream fails mid-generation, decide whether to display the partial response or retry. Set a timeout (30-60 seconds) for the connection. For production systems, use TokenMix.ai's API which automatically fails over to backup models if a stream drops, maintaining continuity.
Can I use streaming with function calling and tool use?
Yes. OpenAI, Anthropic, and Gemini all support streaming with function calling. Tool calls are streamed as they are generated, allowing you to start processing the function call before the model finishes its response. However, Anthropic's tool use returns the structured data in a single content block, so you cannot parse tool arguments progressively.
Should I use streaming for batch processing?
No. Streaming adds complexity (event parsing, connection management, error handling) without UX benefit when no human is waiting. For batch processing, use non-streaming requests or batch APIs (OpenAI Batch API offers 50% cost savings). Reserve streaming for user-facing applications where perceived latency matters.
Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: OpenAI Streaming Documentation, Anthropic Streaming Guide, MDN Server-Sent Events + TokenMix.ai