TokenMix Research Lab · 2026-04-10

AI API Async Processing and Webhooks: How to Handle LLM Batch Jobs Across Providers (2026 Guide)
Last Updated: 2026-04-29
Author: TokenMix Research Lab
Batch API saves 50% off standard pricing with 24h SLA. At 500K requests/month: saves $6,875/month ($82,500/year). OpenAI and Anthropic offer it; no provider has native webhooks (build with polling + queue). Breakeven kicks in at 10K+ requests/month — day one.
AI API async processing is the most overlooked cost and performance lever in LLM engineering. Synchronous API calls waste money on idle connections and hit rate limits faster. Asynchronous patterns -- batch APIs, polling, and webhooks -- cut costs by 50%, increase throughput by 10x, and make large-scale LLM workloads viable. This guide compares async processing across OpenAI Batch API, Anthropic Message Batches, and custom webhook architectures, with real implementation patterns and cost analysis from TokenMix.ai as of April 2026.
Table of Contents
- Quick Comparison: AI API Async Methods
- Why Async Matters for LLM Workloads
- OpenAI Batch API: The 50% Discount Engine
- Anthropic Message Batches: Structured Batch Processing
- Polling vs Webhooks: Architecture Patterns
- Building a Webhook-Based LLM Pipeline
- Provider Async Support Comparison
- Full Comparison Table
- Cost Impact: Sync vs Async Processing
- Which Async Pattern Should You Pick?
- What's the Bottom Line on Async Processing?
- FAQ
Quick Comparison: AI API Async Methods
OpenAI Batch: 50K requests, file upload, 50% off, 24h SLA. Anthropic Batch: 10K requests, direct API, 50% off, 24h SLA. Polling: real-time + standard pricing. Webhooks: event-driven + standard pricing, requires custom infra.
| Dimension | OpenAI Batch API | Anthropic Batches | Custom Polling | LLM Webhooks |
|---|---|---|---|---|
| Cost Savings | 50% off standard | 50% off standard | 0% (standard pricing) | 0% (standard pricing) |
| Completion Time | Up to 24 hours | Up to 24 hours | Real-time + polling interval | Near real-time |
| Max Batch Size | 50,000 requests | 10,000 requests | Unlimited (self-managed) | Unlimited (self-managed) |
| Setup Complexity | Low (file upload) | Low (API call) | Medium (queue + workers) | High (endpoint + auth) |
| Status Tracking | Built-in | Built-in | Self-managed | Event-driven |
| Error Handling | Per-request errors | Per-request errors | Custom retry logic | Custom retry logic |
| Best For | Non-urgent bulk jobs | Non-urgent Claude jobs | Real-time with monitoring | Event-driven architectures |
Why Async Matters for LLM Workloads
10K-doc sync GPT-5.4 pipeline: 8-22 hours sequential, full price. Same load on Batch API: 4-6 hours typical, 50% off. For non-real-time workloads, batch is strictly superior — half the cost, zero rate-limit headaches, simpler error handling.
Synchronous LLM API calls are the default pattern. Send a request, wait for the response, process the next one. This works for chatbots and interactive applications. It does not work for batch processing, content generation pipelines, or any workload involving hundreds or thousands of requests.
The Sync Bottleneck
A typical synchronous pipeline processing 10,000 documents with GPT-5.4:
- Average request latency: 3-8 seconds
- Sequential processing time: 8-22 hours
- With 10 concurrent connections: 1-2 hours
- Rate limit: 10,000 RPM on Tier 5
- Cost: full price ($2.50/$15.00 per M tokens)
The same workload with OpenAI's Batch API:
- Processing time: up to 24 hours (usually 4-6 hours)
- No rate limit management needed
- Cost: 50% off ($1.25/$7.50 per M tokens)
For non-time-sensitive workloads, the batch approach is strictly superior. Half the cost, zero rate limit headaches, and simpler error handling.
Three Async Patterns
The AI API ecosystem offers three distinct async patterns, each suited to different use cases:
Provider Batch APIs (OpenAI, Anthropic): Upload a file of requests, get results within 24 hours at 50% discount. Simplest implementation, biggest cost savings.
Polling: Send requests asynchronously, poll for completion status. Works with any provider. Adds complexity but gives real-time control.
Webhooks: Register a callback URL, receive results when ready. Most architecturally elegant but requires infrastructure. No major LLM provider offers native webhook support as of April 2026 -- you build this yourself.
OpenAI Batch API: The 50% Discount Engine
OpenAI Batch API: 50,000 requests max, JSONL upload, 50% discount, 24-hour SLA (typically 4-6h). GPT-5.4 batch pricing: $1.25/$7.50 per M tokens vs sync $2.50/$15.00. Most teams aren't using it — the biggest cost win in OpenAI's stack.
OpenAI's Batch API is the most impactful cost optimization feature most teams are not using. It processes chat completion requests at half price with a 24-hour SLA.
How It Works
- Create a JSONL file with one request per line, each containing a
custom_idand the standard chat completion parameters. - Upload the file via the Files API.
- Create a batch referencing the uploaded file.
- Poll for completion or wait for the batch to finish.
- Download results as a JSONL file with responses keyed by
custom_id.
Implementation
from openai import OpenAI
import json
client = OpenAI()
# Step 1: Create JSONL request file
requests = []
for i, doc in enumerate(documents):
requests.append({
"custom_id": f"doc-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-5.4",
"messages": [
{"role": "system", "content": "Summarize this document."},
{"role": "user", "content": doc}
],
"max_tokens": 500
}
})
with open("batch_input.jsonl", "w") as f:
for req in requests:
f.write(json.dumps(req) + "\n")
# Step 2: Upload file
batch_file = client.files.create(
file=open("batch_input.jsonl", "rb"),
purpose="batch"
)
# Step 3: Create batch
batch = client.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
# Step 4: Poll for completion
import time
while True:
status = client.batches.retrieve(batch.id)
if status.status == "completed":
break
if status.status == "failed":
raise Exception(f"Batch failed: {status.errors}")
time.sleep(60)
# Step 5: Download results
result_file = client.files.content(status.output_file_id)
results = [json.loads(line) for line in result_file.text.strip().split("\n")]
Key Specifications
| Parameter | Value |
|---|---|
| Max requests per batch | 50,000 |
| Max file size | 100 MB |
| Completion SLA | 24 hours (typically 4-6 hours) |
| Cost discount | 50% off standard pricing |
| Supported endpoints | Chat completions, embeddings |
| Concurrent batches | Unlimited |
| Result retention | 30 days |
When to Use OpenAI Batch
The Batch API is optimal when three conditions are met: (1) results are not needed in real-time, (2) you have 100+ requests to process, and (3) cost matters. Content generation, document summarization, data extraction, embedding generation, and evaluation pipelines all qualify.
What it does well:
- 50% cost reduction with zero code complexity
- Handles rate limiting automatically
- Per-request error handling in results
- Scales to 50,000 requests per batch
- Results available for 30 days
Trade-offs:
- Up to 24 hours for completion (no SLA on faster delivery)
- No streaming support
- Cannot cancel individual requests within a batch
- Limited to chat completions and embeddings
- No webhook notification on completion
Anthropic Message Batches: Structured Batch Processing
Anthropic Batch: 10,000 requests max (5x smaller than OpenAI), direct API call (no file upload), 50% off, request-level cancellation. Cleaner SDK for batches under 10K; OpenAI scales further but Anthropic's developer experience is tighter.
Anthropic's Message Batches API follows a similar philosophy to OpenAI's Batch API: submit a collection of requests, get results at 50% discount within 24 hours.
How It Works
Unlike OpenAI's file-based approach, Anthropic batches are created directly via API with an array of request objects. Each request includes a custom_id and standard message parameters.
Implementation
from anthropic import Anthropic
client = Anthropic()
# Create batch with requests
batch = client.messages.batches.create(
requests=[
{
"custom_id": f"doc-{i}",
"params": {
"model": "claude-sonnet-4-6-20260401",
"max_tokens": 500,
"messages": [
{"role": "user", "content": f"Summarize: {doc}"}
]
}
}
for i, doc in enumerate(documents)
]
)
# Poll for completion
import time
while True:
status = client.messages.batches.retrieve(batch.id)
if status.processing_status == "ended":
break
time.sleep(60)
# Stream results
for result in client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
summary = result.result.message.content[0].text
print(f"{result.custom_id}: {summary}")
Key Specifications
| Parameter | Value |
|---|---|
| Max requests per batch | 10,000 |
| Completion SLA | 24 hours |
| Cost discount | 50% off standard pricing |
| Supported endpoints | Messages API only |
| Result streaming | Yes (iterate over results) |
| Cancellation | Yes (cancel remaining requests) |
OpenAI Batch vs Anthropic Batch
| Feature | OpenAI Batch | Anthropic Batch |
|---|---|---|
| Max batch size | 50,000 | 10,000 |
| Input method | JSONL file upload | Direct API call |
| Discount | 50% | 50% |
| Cancellation | Batch-level | Request-level |
| Result retrieval | File download | Streaming iteration |
| Endpoints | Chat + embeddings | Messages only |
Anthropic's approach is cleaner for smaller batches (under 10,000 requests) because it avoids the file upload step. OpenAI's approach scales better for massive batches and supports embeddings.
What it does well:
- 50% cost reduction on Claude models
- Direct API creation (no file upload step)
- Request-level cancellation
- Streaming result retrieval
- Clean SDK integration
Trade-offs:
- 10,000 request limit per batch (5x smaller than OpenAI)
- No embedding support
- 24-hour completion window
- Newer API with less community documentation
Polling vs Webhooks: Architecture Patterns
No major LLM provider offers native webhook support as of April 2026. Polling at 60s intervals × 100 concurrent batches = 144,000 wasted status checks/day. Custom webhook layer (queue + worker + endpoint) is mandatory above 10 concurrent batches.
Neither OpenAI nor Anthropic offers native webhook notifications for batch completion. You have two choices: poll for status or build a webhook layer yourself.
Polling Pattern
Polling is the simplest approach. Check batch status at regular intervals until completion.
import asyncio
async def poll_batch(client, batch_id, interval=60):
while True:
status = client.batches.retrieve(batch_id)
if status.status in ("completed", "failed", "expired"):
return status
await asyncio.sleep(interval)
Advantages: Simple, no infrastructure, works with any provider. Disadvantages: Wastes compute on status checks. At 60-second intervals across 100 concurrent batches, you make 144,000 status checks per day. Most return "still processing."
Webhook Pattern
Webhooks invert the control flow. Instead of asking "are you done?", the system tells you when it is done. Since providers do not offer native LLM webhooks, you build this with a message queue.
[Batch Submitted] -> [Queue Job] -> [Worker Polls Provider]
|
[Completion Detected]
|
[Webhook POST to your endpoint]
|
[Process Results]
Building Webhook Infrastructure
A practical webhook pipeline for LLM batch jobs uses three components:
- Job queue (Redis, SQS, or RabbitMQ): tracks submitted batches and their callback URLs.
- Polling worker: checks batch status at intervals, fires webhooks on completion.
- Webhook endpoint: your application receives results via HTTP POST.
# Webhook endpoint (Flask example)
from flask import Flask, request
app = Flask(__name__)
@app.route("/webhook/batch-complete", methods=["POST"])
def batch_complete():
payload = request.json
batch_id = payload["batch_id"]
status = payload["status"]
results_url = payload["results_url"]
# Process results
process_batch_results(batch_id, results_url)
return {"received": True}, 200
When to Use Each Pattern
| Scenario | Use Polling | Use Webhooks |
|---|---|---|
| Less than 10 concurrent batches | Yes | Overkill |
| 10-100 concurrent batches | Borderline | Yes |
| 100+ concurrent batches | No | Yes |
| Simple script/notebook | Yes | No |
| Production microservices | No | Yes |
| Event-driven architecture | No | Yes |
Building a Webhook-Based LLM Pipeline
Production batch pipeline = 3 components: job tracker DB + 5-min polling worker + webhook POST receiver. Optimal batch size: 1,000-5,000 requests (median 2.5h completion vs 24h for 50K batches). Smaller batches finish faster — five 10K batches beat one 50K batch.
For teams running LLM batch jobs at scale, here is a production-ready architecture using TokenMix.ai's unified API with webhook-style notifications.
Architecture Overview
[Document Queue] -> [Batch Creator]
|
[Submit to OpenAI/Anthropic Batch API via TokenMix.ai]
|
[Job Tracker DB]
|
[Polling Worker (every 5 min)]
|
[On Completion: POST webhook]
|
[Result Processor] -> [Output DB]
Key Design Decisions
Batch size optimization. Smaller batches (1,000-5,000 requests) complete faster on average. A 50,000-request batch may take the full 24 hours. Five batches of 10,000 requests each often complete within 6-8 hours total. TokenMix.ai's monitoring data shows batches under 5,000 requests complete in a median of 2.5 hours.
Error handling. Both OpenAI and Anthropic return per-request errors within batch results. Your pipeline must handle: (1) batch-level failures (entire batch rejected), (2) request-level failures (individual requests fail), and (3) timeout failures (batch exceeds 24-hour window).
Retry strategy. Failed requests should be collected and resubmitted in a new batch, not retried individually via the synchronous API. Individual retries forfeit the 50% discount.
Multi-provider routing. Using TokenMix.ai's unified API, you can route batch jobs to different providers based on cost, speed, and availability. If OpenAI's batch queue is congested (visible via longer completion times), route to Anthropic's batch API, or vice versa.
Provider Async Support Comparison
Only OpenAI and Anthropic offer first-party batch APIs with 50% discounts. Gemini uses Vertex Batch Predict (varies); DeepSeek/Together/Groq have async SDK but zero cost savings. For non-OpenAI/Anthropic providers, async = throughput only, not savings.
| Provider | Batch API | Batch Discount | Webhook Support | Max Batch Size | Async SDK |
|---|---|---|---|---|---|
| OpenAI | Yes | 50% | No (polling only) | 50,000 | AsyncOpenAI |
| Anthropic | Yes | 50% | No (polling only) | 10,000 | AsyncAnthropic |
| Google Gemini | No (Vertex Batch Predict) | Varies | Pub/Sub integration | N/A | Native async |
| DeepSeek | No | N/A | No | N/A | AsyncOpenAI compatible |
| Together | No | N/A | No | N/A | AsyncTogether |
| Groq | No | N/A | No | N/A | AsyncGroq |
| TokenMix.ai | Via providers | Passed through | Planned | Provider limits | OpenAI compatible |
Only OpenAI and Anthropic offer first-party batch APIs with cost discounts. For other providers, async processing means concurrent requests with async SDK clients -- useful for throughput but no cost savings.
Full Comparison Table
5 patterns side-by-side: Sync (real-time, full price), OpenAI/Anthropic Batch (50% off, 24h SLA), Custom Async+Polling (real-time, infra needed), Custom Webhooks (near-real-time, highest setup effort). Batch APIs are the only path to provider-side cost savings.
| Feature | Sync API | OpenAI Batch | Anthropic Batch | Custom Async + Polling | Custom Webhooks |
|---|---|---|---|---|---|
| Cost | Standard | 50% off | 50% off | Standard | Standard |
| Latency | 2-10 sec | 2-24 hours | 2-24 hours | 2-10 sec | 2-10 sec + delivery |
| Throughput | Rate-limited | Unlimited in batch | Unlimited in batch | Rate-limited | Rate-limited |
| Rate Limit Mgmt | Manual | Automatic | Automatic | Manual | Manual |
| Setup Effort | None | Low | Low | Medium | High |
| Infrastructure | None | None | None | Queue + workers | Queue + workers + endpoint |
| Real-Time | Yes | No | No | Yes | Near real-time |
| Error Handling | Immediate | Per-request in results | Per-request in results | Custom | Custom |
| Status Tracking | N/A | Built-in | Built-in | Custom | Event-driven |
| Provider Lock-In | Per provider | OpenAI only | Anthropic only | None | None |
Cost Impact: Sync vs Async Processing
At 500K requests/month (GPT-5.4, 5K input + 1K output): sync = $13,750/mo, batch = $6,875/mo. Annual savings: $82,500. At 1M requests/mo savings hit $165,000/year. Breakeven for engineering effort: roughly day one.
Monthly Cost Comparison: 500,000 Requests
Assumptions: GPT-5.4, average 5,000 input tokens and 1,000 output tokens per request.
| Method | Input Cost | Output Cost | Infrastructure | Total Monthly |
|---|---|---|---|---|
| Sync API | $6,250 | $7,500 | $0 | $13,750 |
| Batch API (50% off) | $3,125 | $3,750 | $0 | $6,875 |
| Async + Polling | $6,250 | $7,500 | ~$100 | $13,850 |
| Custom Webhooks | $6,250 | $7,500 | ~$200 | $13,950 |
The Batch API saves $6,875/month -- $82,500/year -- on a 500,000 request/month workload. No other optimization delivers this magnitude of savings with so little engineering effort.
When Async Pays for Itself
| Monthly Requests | Sync Cost | Batch Cost | Annual Savings |
|---|---|---|---|
| 10,000 | $275 | $137 | $1,650 |
| 100,000 | $2,750 | $1,375 | $16,500 |
| 500,000 | $13,750 | $6,875 | $82,500 |
| 1,000,000 | $27,500 | $13,750 | $165,000 |
At any volume above 10,000 requests/month, the batch API discount exceeds the cost of the engineering time to implement it. The breakeven is roughly day one.
Which Async Pattern Should You Pick?
Default: OpenAI/Anthropic Batch API for non-urgent bulk (50% savings, zero infra). Real-time + parallelism: Async SDK. 100+ concurrent jobs: custom webhooks. Multi-provider: TokenMix.ai unified routing. Mixed traffic: batch first, async fallback for time-sensitive.
| Your Situation | Choose | Why |
|---|---|---|
| Non-urgent bulk processing, using OpenAI | OpenAI Batch API | 50% savings, zero infrastructure |
| Non-urgent bulk processing, using Claude | Anthropic Batch API | 50% savings, clean API |
| Need results within seconds | Async SDK (concurrent) | Real-time with parallelism |
| 10+ concurrent batch jobs | Polling + queue | Centralized tracking |
| 100+ concurrent jobs, event-driven arch | Custom webhooks | Scalable, decoupled |
| Multi-provider batch processing | TokenMix.ai + provider batches | Unified API, route to cheapest batch |
| Simple scripts or notebooks | Sync with retry | Minimal complexity |
| High-volume, cost-sensitive | Batch API first, async fallback | Maximum savings on eligible traffic |
What's the Bottom Line on Async Processing?
Above $500/month LLM spend, async is mandatory — Batch API delivers 50% cost reduction with zero infra. Priority order: (1) batch all non-real-time workloads, (2) async SDK for real-time throughput, (3) webhooks only at 100+ concurrent jobs. Day-one ROI on every dollar batched.
AI API async processing is not optional for teams spending more than $500/month on LLM APIs. The OpenAI and Anthropic Batch APIs deliver 50% cost reduction with minimal implementation effort. For every dollar you spend on synchronous API calls that could have been batched, you are leaving 50 cents on the table.
The priority order is clear. First, move all non-real-time workloads to batch APIs (OpenAI or Anthropic). Second, implement async SDK patterns for real-time workloads that need higher throughput. Third, build LLM webhook infrastructure only if you have 100+ concurrent jobs and an event-driven architecture.
Through TokenMix.ai's unified API, you can route eligible requests to whichever provider's batch API offers the best combination of cost, speed, and availability. The platform tracks batch queue times across providers in real-time, helping you avoid congested queues and minimize completion latency.
Start with the batch API. The 50% discount pays for itself from the first request.
FAQ
Does OpenAI have a webhook for batch API completion?
No. As of April 2026, OpenAI does not offer native webhook notifications for batch completion. You must poll the batch status endpoint at regular intervals. Typical polling intervals are 60-300 seconds. For production systems with many concurrent batches, building a custom webhook layer with a job queue reduces wasted polling calls.
How much does the OpenAI Batch API save compared to the standard API?
The OpenAI Batch API offers a flat 50% discount on all requests. For GPT-5.4, this reduces pricing from $2.50/$15.00 to $1.25/$7.50 per million tokens. On a workload of 500,000 requests per month, this saves approximately $82,500 annually. The only tradeoff is a 24-hour completion window.
What is the difference between polling and webhooks for LLM APIs?
Polling repeatedly checks the provider's status endpoint until a job completes. Webhooks receive an HTTP POST notification when the job finishes. Polling is simpler but wastes compute on status checks. Webhooks are more efficient but require you to build and host an endpoint. No major LLM provider offers native webhook support -- you must build the webhook layer yourself.
Can I use Anthropic's Batch API with the OpenAI SDK?
Not directly. Anthropic's batch API uses the Anthropic SDK with its own message format. However, through unified API providers like TokenMix.ai that support the OpenAI SDK format, you can access batch-eligible pricing across providers with a single integration. The provider handles the translation between formats.
How long does a batch API request take to complete?
Both OpenAI and Anthropic have a 24-hour SLA for batch completion. In practice, TokenMix.ai's monitoring shows median completion times of 2-4 hours for batches under 5,000 requests and 6-12 hours for larger batches. Completion times vary based on queue load and request complexity.
Should I use async or batch processing for AI API calls?
Use batch processing (OpenAI Batch API or Anthropic Batches) for non-time-sensitive workloads to get 50% cost savings. Use async processing (concurrent API calls with AsyncOpenAI) for real-time workloads where you need results within seconds but want higher throughput than sequential calls. The two patterns solve different problems and can be combined in the same application.
Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: OpenAI, Anthropic, TokenMix.ai