TokenMix Research Lab · 2026-04-10

AI API Streaming Guide 2026: SSE Implementation in Python and Node.js for Every Provider

AI API Streaming Guide: SSE Streaming for LLMs in Python and Node.js (2026)

Streaming API responses reduce perceived latency by 60-80% compared to waiting for a complete response. TokenMix.ai latency monitoring across 300+ models shows that time-to-first-token with streaming averages 200-500ms, while non-streaming responses take 2-15 seconds for typical completions. This guide covers SSE streaming implementation for every major LLM provider, with production-ready code in Python and Node.js, latency benchmarks, and cost analysis.

If your AI application makes users wait for a full response before displaying anything, you are leaving significant UX improvement on the table.

Table of Contents


Quick Comparison: Streaming Support Across Providers

Feature OpenAI Anthropic Claude Google Gemini DeepSeek Groq
Streaming Protocol SSE SSE SSE / WebSocket SSE SSE
Time-to-First-Token (avg) 300-600ms 400-800ms 250-500ms 500-1,200ms 100-200ms
Token Throughput (streaming) 50-80 tok/s 40-70 tok/s 60-100 tok/s 30-50 tok/s 300-500 tok/s
Streaming + Function Calling Yes Yes Yes Yes Yes
Streaming + Structured Output Yes Partial Yes Yes No
Usage Stats in Stream Final chunk message_delta event Final chunk Final chunk Final chunk
Backpressure Handling Client-side Client-side Client-side Client-side Client-side

Why LLM Streaming Matters: Latency Data

The difference between streaming and non-streaming is not about total generation time. It is about perceived latency -- how long the user waits before seeing the first token of a response.

TokenMix.ai monitors time-to-first-token (TTFT) and total generation time across all major providers. Here is what the data shows for a typical 500-token response.

Provider / Model Non-Streaming Total Streaming TTFT Streaming Total TTFT Improvement
OpenAI GPT-4o 3.2-5.5s 300-600ms 3.5-6.0s 82-89% faster
Claude Sonnet 4.6 4.0-7.0s 400-800ms 4.5-7.5s 86-90% faster
Gemini 3.1 Pro 2.5-4.5s 250-500ms 2.8-5.0s 88-90% faster
DeepSeek V4 5.0-10.0s 500-1,200ms 5.5-11.0s 88-90% faster
Groq Llama 4 0.5-1.5s 100-200ms 0.6-1.8s 80-87% faster

Key insight: Streaming adds a small overhead to total generation time (5-10% longer) because of the SSE framing and network overhead per chunk. But the perceived latency drops by 80-90%. For user-facing applications, this tradeoff is always worth it.

When streaming does not help: Batch processing, background tasks, and any workflow where no human is waiting for the response. In these cases, non-streaming is simpler to implement and avoids the SSE parsing overhead.

How SSE Streaming Works for LLM APIs

Server-Sent Events (SSE) is the standard protocol for LLM streaming. It is a simple, one-directional HTTP protocol where the server sends a stream of events to the client over a single long-lived connection.

The SSE format:

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":"Hello"}}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"delta":{"content":" world"}}]}

data: [DONE]

Each chunk contains a small piece of the model's response. Your client accumulates these chunks to build the full response. The [DONE] event signals the end of the stream.

Why SSE over WebSockets: SSE works over standard HTTP, requires no special server infrastructure, and is supported by all LLM providers. WebSockets add bidirectional communication capability that LLM streaming does not need. Google Gemini supports both, but SSE is the recommended approach for simplicity.

Connection lifecycle:

  1. Client sends a standard HTTP POST request with stream: true
  2. Server responds with Content-Type: text/event-stream
  3. Server sends data events as tokens are generated
  4. Connection closes after the final [DONE] event or on error

OpenAI Streaming Implementation

Python

from openai import OpenAI
client = OpenAI()

# Basic streaming
stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing in 200 words"}],
    stream=True
)

full_response = ""
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        content = chunk.choices[0].delta.content
        full_response += content
        print(content, end="", flush=True)

Python with Usage Tracking

stream = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True,
    stream_options={"include_usage": True}
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
    if chunk.usage:  # Final chunk contains usage
        print(f"\nTokens: {chunk.usage.total_tokens}")

Node.js

import OpenAI from "openai";

const client = new OpenAI();

const stream = await client.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Explain quantum computing" }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

Anthropic Claude Streaming Implementation

Claude's streaming uses a different event structure than OpenAI. Instead of chat.completion.chunk, Claude sends typed events: message_start, content_block_start, content_block_delta, content_block_stop, and message_delta.

Python

import anthropic
client = anthropic.Anthropic()

# Basic streaming
with client.messages.stream(
    model="claude-sonnet-4-6-20260401",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain quantum computing in 200 words"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Get final message with usage
message = stream.get_final_message()
print(f"\nInput tokens: {message.usage.input_tokens}")
print(f"Output tokens: {message.usage.output_tokens}")

Python with Event-Level Control

with client.messages.stream(
    model="claude-sonnet-4-6-20260401",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain quantum computing"}]
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            if event.delta.type == "text_delta":
                print(event.delta.text, end="")
        elif event.type == "message_delta":
            print(f"\nStop reason: {event.delta.stop_reason}")
            print(f"Output tokens: {event.usage.output_tokens}")

Node.js

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const stream = client.messages.stream({
  model: "claude-sonnet-4-6-20260401",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Explain quantum computing" }],
});

stream.on("text", (text) => {
  process.stdout.write(text);
});

const finalMessage = await stream.finalMessage();
console.log(`\nTokens: ${finalMessage.usage.output_tokens}`);

Google Gemini Streaming Implementation

Gemini supports streaming through its native SDK and the OpenAI-compatible endpoint.

Python (Native SDK)

import google.generativeai as genai

model = genai.GenerativeModel("gemini-3.1-pro")

response = model.generate_content(
    "Explain quantum computing in 200 words",
    stream=True
)

for chunk in response:
    print(chunk.text, end="", flush=True)

Python (OpenAI-Compatible Endpoint)

from openai import OpenAI

client = OpenAI(
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/",
    api_key="your-google-api-key"
)

stream = client.chat.completions.create(
    model="gemini-3.1-pro",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Gemini Streaming Performance

Gemini has the fastest time-to-first-token among full-size models (250-500ms), primarily because of Google's inference infrastructure. Token throughput during streaming is also the highest at 60-100 tokens per second for Gemini 3.1 Pro.

DeepSeek and Open-Source Model Streaming

DeepSeek (OpenAI-Compatible)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.deepseek.com/v1",
    api_key="your-deepseek-key"
)

stream = client.chat.completions.create(
    model="deepseek-chat",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Self-Hosted Models (vLLM, Ollama)

Most self-hosted inference servers support OpenAI-compatible streaming. The code is identical, with only the base_url changed.

# vLLM
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

# Ollama
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")

# Streaming works identically
stream = client.chat.completions.create(
    model="llama-4-maverick",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    stream=True
)

Self-Hosted Streaming Performance

Streaming performance for self-hosted models depends entirely on your hardware. TokenMix.ai benchmarks on common configurations:

Hardware Model TTFT Throughput
A100 80GB Llama 4 Maverick 150-300ms 80-120 tok/s
RTX 4090 Llama 4 Scout 200-400ms 40-60 tok/s
M3 Max 64GB Llama 3.3 70B (quantized) 500-1,000ms 15-25 tok/s

Full Comparison Table: Streaming Performance

Provider / Model TTFT (p50) TTFT (p95) Throughput Protocol Streaming + Tools Streaming Cost Premium
OpenAI GPT-4o 350ms 800ms 50-80 tok/s SSE Full None
OpenAI GPT-5.4 400ms 900ms 45-70 tok/s SSE Full None
Claude Sonnet 4.6 500ms 1,200ms 40-70 tok/s SSE Full None
Claude Haiku 4 200ms 500ms 80-120 tok/s SSE Full None
Gemini 3.1 Pro 300ms 700ms 60-100 tok/s SSE Full None
Gemini 3.1 Flash 150ms 400ms 100-150 tok/s SSE Full None
DeepSeek V4 700ms 2,000ms 30-50 tok/s SSE Basic None
Groq Llama 4 120ms 300ms 300-500 tok/s SSE Yes None

Streaming vs Non-Streaming: When to Use Each

Use Streaming When:

Use Non-Streaming When:

Performance Comparison

Metric Streaming Non-Streaming
Time-to-first-token 200-800ms Same as total time
Total generation time 5-10% longer Baseline
Client complexity Higher (event parsing) Lower (single response)
Error handling More complex (mid-stream errors) Simpler (request/response)
Memory usage Lower (process chunks) Higher (full response in memory)
Network overhead Higher (SSE framing) Lower (single JSON payload)

Production Streaming: Error Handling and Reconnection

Production streaming requires handling mid-stream errors, connection drops, and timeouts. Here is a production-ready pattern.

Python Production Pattern

import time
from openai import OpenAI, APIError, APIConnectionError

client = OpenAI(
    base_url="https://api.tokenmix.ai/v1",
    api_key="your-tokenmix-key"
)

def stream_with_retry(messages, max_retries=3):
    """Production-ready streaming with retry logic."""
    for attempt in range(max_retries):
        try:
            stream = client.chat.completions.create(
                model="gpt-4o",
                messages=messages,
                stream=True,
                timeout=30
            )

            full_response = ""
            for chunk in stream:
                if chunk.choices[0].delta.content:
                    content = chunk.choices[0].delta.content
                    full_response += content
                    yield content

            return  # Success

        except APIConnectionError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
                continue
            raise

        except APIError as e:
            if e.status_code == 429:  # Rate limited
                time.sleep(5)
                continue
            raise

Node.js Production Pattern

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.tokenmix.ai/v1",
  apiKey: "your-tokenmix-key",
});

async function* streamWithRetry(
  messages: OpenAI.ChatCompletionMessageParam[],
  maxRetries = 3
) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      const stream = await client.chat.completions.create({
        model: "gpt-4o",
        messages,
        stream: true,
      });

      for await (const chunk of stream) {
        const content = chunk.choices[0]?.delta?.content;
        if (content) yield content;
      }
      return;
    } catch (error) {
      if (attempt < maxRetries - 1) {
        await new Promise((r) => setTimeout(r, 2 ** attempt * 1000));
        continue;
      }
      throw error;
    }
  }
}

Key Production Considerations

Timeout handling: Set a reasonable timeout (30-60 seconds) for the stream connection. TokenMix.ai data shows that streams that do not produce a chunk within 10 seconds have a 90% chance of failing completely.

Partial response recovery: If a stream fails mid-generation, you have a partial response. Decide whether to display the partial result, retry with context, or fail gracefully. TokenMix.ai's API handles failover automatically, switching to a backup model if the primary provider's stream drops.

Client-side buffering: For web applications, buffer a few tokens before rendering to avoid flickering. Displaying each individual token creates a jittery experience. A 3-5 token buffer provides smooth rendering.

Cost Analysis: Does Streaming Cost More?

No. Streaming does not cost more tokens. The input and output token count is identical whether you stream or not. The total cost per request is the same.

Where costs differ:

Factor Streaming Non-Streaming
Token cost Same Same
Network bandwidth 5-15% higher (SSE overhead) Baseline
Server connections Longer-lived Short-lived
Infrastructure cost Slightly higher (connection management) Slightly lower

The only meaningful cost difference is infrastructure. Long-lived SSE connections consume more server resources than short request-response cycles. For most applications, this is negligible.

TokenMix.ai pricing: Streaming and non-streaming requests are priced identically. No premium for streaming. TokenMix.ai handles connection management and provider failover, reducing your infrastructure overhead.

How to Choose a Streaming Strategy

Your Scenario Recommendation Why
User-facing chatbot Always stream 80-90% perceived latency reduction
API backend service Do not stream Simpler code, no UX benefit
Real-time dashboard Stream Progressive data display
Batch data extraction Do not stream Easier error handling and retries
Mobile application Stream with buffering Smooth UX, manage network variability
Multi-provider setup Stream via TokenMix.ai Unified SSE format across all providers
Lowest TTFT required Use Groq or Gemini Flash 100-200ms TTFT, highest throughput

Conclusion

Streaming is essential for any user-facing AI application. The 80-90% reduction in perceived latency is too significant to ignore. Every major LLM provider supports SSE streaming, and the implementation is straightforward with modern SDKs.

The provider choice matters for streaming performance. Groq leads in raw speed (100-200ms TTFT, 300-500 tok/s) but has limited model selection. Gemini 3.1 Flash offers the best balance of speed and capability among full-featured models. Claude Sonnet 4.6 has the highest TTFT but compensates with superior response quality.

For production systems, route through TokenMix.ai's unified API. You get consistent SSE streaming across all providers, automatic failover if a provider's stream drops mid-generation, and identical pricing for streaming and non-streaming. Write your streaming client once, access 300+ models through a single endpoint.

FAQ

Does streaming an LLM API cost more than non-streaming?

No. Streaming and non-streaming requests consume the same number of input and output tokens. The cost per request is identical. The only difference is slightly higher network bandwidth (5-15%) due to SSE framing overhead, which is negligible in practice. TokenMix.ai charges the same rate for both.

What is SSE streaming and how does it work with LLM APIs?

Server-Sent Events (SSE) is a one-directional HTTP protocol where the server pushes events to the client over a long-lived connection. For LLM APIs, each event contains a small chunk of the generated response (typically 1-3 tokens). The client accumulates chunks to build the full response. All major providers (OpenAI, Anthropic, Google, DeepSeek) use SSE for streaming.

Which LLM provider has the fastest streaming response?

Groq has the fastest time-to-first-token at 100-200ms and the highest throughput at 300-500 tokens per second, but is limited to open-source models. Among full-featured providers, Google Gemini 3.1 Flash leads with 150-400ms TTFT and 100-150 tok/s throughput. TokenMix.ai monitors streaming performance across all providers in real time.

How do I handle errors during an LLM streaming response?

Implement a retry wrapper with exponential backoff. If a stream fails mid-generation, decide whether to display the partial response or retry. Set a timeout (30-60 seconds) for the connection. For production systems, use TokenMix.ai's API which automatically fails over to backup models if a stream drops, maintaining continuity.

Can I use streaming with function calling and tool use?

Yes. OpenAI, Anthropic, and Gemini all support streaming with function calling. Tool calls are streamed as they are generated, allowing you to start processing the function call before the model finishes its response. However, Anthropic's tool use returns the structured data in a single content block, so you cannot parse tool arguments progressively.

Should I use streaming for batch processing?

No. Streaming adds complexity (event parsing, connection management, error handling) without UX benefit when no human is waiting. For batch processing, use non-streaming requests or batch APIs (OpenAI Batch API offers 50% cost savings). Reserve streaming for user-facing applications where perceived latency matters.


Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: OpenAI Streaming Documentation, Anthropic Streaming Guide, MDN Server-Sent Events + TokenMix.ai