Async AI API Processing 2026: Batch API, Webhooks, and Polling Patterns for Production
TokenMix Research Lab · 2026-04-10

AI API Async Processing and Webhooks: How to Handle LLM Batch Jobs Across Providers (2026 Guide)
AI API async processing is the most overlooked cost and performance lever in LLM engineering. Synchronous API calls waste money on idle connections and hit [rate limits](https://tokenmix.ai/blog/ai-api-rate-limits-guide) faster. Asynchronous patterns -- batch APIs, polling, and webhooks -- cut costs by 50%, increase throughput by 10x, and make large-scale LLM workloads viable. This guide compares async processing across OpenAI [Batch API](https://tokenmix.ai/blog/openai-batch-api-pricing), Anthropic Message Batches, and custom webhook architectures, with real implementation patterns and cost analysis from [TokenMix.ai](https://tokenmix.ai) as of April 2026.
Table of Contents
- [Quick Comparison: AI API Async Methods]
- [Why Async Matters for LLM Workloads]
- [OpenAI Batch API: The 50% Discount Engine]
- [Anthropic Message Batches: Structured Batch Processing]
- [Polling vs Webhooks: Architecture Patterns]
- [Building a Webhook-Based LLM Pipeline]
- [Provider Async Support Comparison]
- [Full Comparison Table]
- [Cost Impact: Sync vs Async Processing]
- [Decision Guide: Which Async Pattern to Choose]
- [Conclusion]
- [FAQ]
---
Quick Comparison: AI API Async Methods
| Dimension | OpenAI Batch API | Anthropic Batches | Custom Polling | LLM Webhooks | | --- | --- | --- | --- | --- | | **Cost Savings** | 50% off standard | 50% off standard | 0% (standard pricing) | 0% (standard pricing) | | **Completion Time** | Up to 24 hours | Up to 24 hours | Real-time + polling interval | Near real-time | | **Max Batch Size** | 50,000 requests | 10,000 requests | Unlimited (self-managed) | Unlimited (self-managed) | | **Setup Complexity** | Low (file upload) | Low (API call) | Medium (queue + workers) | High (endpoint + auth) | | **Status Tracking** | Built-in | Built-in | Self-managed | Event-driven | | **Error Handling** | Per-request errors | Per-request errors | Custom retry logic | Custom retry logic | | **Best For** | Non-urgent bulk jobs | Non-urgent Claude jobs | Real-time with monitoring | Event-driven architectures |
---
Why Async Matters for LLM Workloads
Synchronous LLM API calls are the default pattern. Send a request, wait for the response, process the next one. This works for chatbots and interactive applications. It does not work for batch processing, content generation pipelines, or any workload involving hundreds or thousands of requests.
The Sync Bottleneck
A typical synchronous pipeline processing 10,000 documents with [GPT-5.4](https://tokenmix.ai/blog/gpt-5-api-pricing): - Average request latency: 3-8 seconds - Sequential processing time: 8-22 hours - With 10 concurrent connections: 1-2 hours - Rate limit: 10,000 RPM on Tier 5 - Cost: full price ($2.50/$15.00 per M tokens)
The same workload with OpenAI's Batch API: - Processing time: up to 24 hours (usually 4-6 hours) - No rate limit management needed - Cost: 50% off ($1.25/$7.50 per M tokens)
For non-time-sensitive workloads, the batch approach is strictly superior. Half the cost, zero rate limit headaches, and simpler error handling.
Three Async Patterns
The AI API ecosystem offers three distinct async patterns, each suited to different use cases:
1. **Provider Batch APIs** (OpenAI, Anthropic): Upload a file of requests, get results within 24 hours at 50% discount. Simplest implementation, biggest cost savings.
2. **Polling**: Send requests asynchronously, poll for completion status. Works with any provider. Adds complexity but gives real-time control.
3. **Webhooks**: Register a callback URL, receive results when ready. Most architecturally elegant but requires infrastructure. No major LLM provider offers native webhook support as of April 2026 -- you build this yourself.
---
OpenAI Batch API: The 50% Discount Engine
OpenAI's Batch API is the most impactful cost optimization feature most teams are not using. It processes chat completion requests at half price with a 24-hour SLA.
How It Works
1. **Create a JSONL file** with one request per line, each containing a `custom_id` and the standard chat completion parameters. 2. **Upload the file** via the Files API. 3. **Create a batch** referencing the uploaded file. 4. **Poll for completion** or wait for the batch to finish. 5. **Download results** as a JSONL file with responses keyed by `custom_id`.
Implementation
client = OpenAI()
Step 1: Create JSONL request file
with open("batch_input.jsonl", "w") as f: for req in requests: f.write(json.dumps(req) + "\n")
Step 2: Upload file
Step 3: Create batch
Step 4: Poll for completion
Step 5: Download results
Key Specifications
| Parameter | Value | | --- | --- | | **Max requests per batch** | 50,000 | | **Max file size** | 100 MB | | **Completion SLA** | 24 hours (typically 4-6 hours) | | **Cost discount** | 50% off standard pricing | | **Supported endpoints** | Chat completions, embeddings | | **Concurrent batches** | Unlimited | | **Result retention** | 30 days |
When to Use OpenAI Batch
The Batch API is optimal when three conditions are met: (1) results are not needed in real-time, (2) you have 100+ requests to process, and (3) cost matters. Content generation, document summarization, data extraction, embedding generation, and evaluation pipelines all qualify.
**What it does well:** - 50% cost reduction with zero code complexity - Handles rate limiting automatically - Per-request error handling in results - Scales to 50,000 requests per batch - Results available for 30 days
**Trade-offs:** - Up to 24 hours for completion (no SLA on faster delivery) - No [streaming](https://tokenmix.ai/blog/ai-api-streaming-guide) support - Cannot cancel individual requests within a batch - Limited to chat completions and embeddings - No webhook notification on completion
---
Anthropic Message Batches: Structured Batch Processing
Anthropic's Message Batches API follows a similar philosophy to OpenAI's Batch API: submit a collection of requests, get results at 50% discount within 24 hours.
How It Works
Unlike OpenAI's file-based approach, Anthropic batches are created directly via API with an array of request objects. Each request includes a `custom_id` and standard message parameters.
Implementation
client = Anthropic()
Create batch with requests
Poll for completion
Stream results
Key Specifications
| Parameter | Value | | --- | --- | | **Max requests per batch** | 10,000 | | **Completion SLA** | 24 hours | | **Cost discount** | 50% off standard pricing | | **Supported endpoints** | Messages API only | | **Result streaming** | Yes (iterate over results) | | **Cancellation** | Yes (cancel remaining requests) |
OpenAI Batch vs Anthropic Batch
| Feature | OpenAI Batch | Anthropic Batch | | --- | --- | --- | | **Max batch size** | 50,000 | 10,000 | | **Input method** | JSONL file upload | Direct API call | | **Discount** | 50% | 50% | | **Cancellation** | Batch-level | Request-level | | **Result retrieval** | File download | Streaming iteration | | **Endpoints** | Chat + embeddings | Messages only |
Anthropic's approach is cleaner for smaller batches (under 10,000 requests) because it avoids the file upload step. OpenAI's approach scales better for massive batches and supports embeddings.
**What it does well:** - 50% cost reduction on Claude models - Direct API creation (no file upload step) - Request-level cancellation - Streaming result retrieval - Clean SDK integration
**Trade-offs:** - 10,000 request limit per batch (5x smaller than OpenAI) - No embedding support - 24-hour completion window - Newer API with less community documentation
---
Polling vs Webhooks: Architecture Patterns
Neither OpenAI nor Anthropic offers native webhook notifications for batch completion. You have two choices: poll for status or build a webhook layer yourself.
Polling Pattern
Polling is the simplest approach. Check batch status at regular intervals until completion.
async def poll_batch(client, batch_id, interval=60): while True: status = client.batches.retrieve(batch_id) if status.status in ("completed", "failed", "expired"): return status await asyncio.sleep(interval) ```
**Advantages:** Simple, no infrastructure, works with any provider. **Disadvantages:** Wastes compute on status checks. At 60-second intervals across 100 concurrent batches, you make 144,000 status checks per day. Most return "still processing."
Webhook Pattern
Webhooks invert the control flow. Instead of asking "are you done?", the system tells you when it is done. Since providers do not offer native LLM webhooks, you build this with a message queue.
Building Webhook Infrastructure
A practical webhook pipeline for LLM batch jobs uses three components:
1. **Job queue** (Redis, SQS, or RabbitMQ): tracks submitted batches and their callback URLs. 2. **Polling worker**: checks batch status at intervals, fires webhooks on completion. 3. **Webhook endpoint**: your application receives results via HTTP POST.
app = Flask(__name__)
@app.route("/webhook/batch-complete", methods=["POST"]) def batch_complete(): payload = request.json batch_id = payload["batch_id"] status = payload["status"] results_url = payload["results_url"]
Process results
When to Use Each Pattern
| Scenario | Use Polling | Use Webhooks | | --- | --- | --- | | Less than 10 concurrent batches | Yes | Overkill | | 10-100 concurrent batches | Borderline | Yes | | 100+ concurrent batches | No | Yes | | Simple script/notebook | Yes | No | | Production microservices | No | Yes | | Event-driven architecture | No | Yes |
---
Building a Webhook-Based LLM Pipeline
For teams running LLM batch jobs at scale, here is a production-ready architecture using TokenMix.ai's unified API with webhook-style notifications.
Architecture Overview
Key Design Decisions
**Batch size optimization.** Smaller batches (1,000-5,000 requests) complete faster on average. A 50,000-request batch may take the full 24 hours. Five batches of 10,000 requests each often complete within 6-8 hours total. TokenMix.ai's monitoring data shows batches under 5,000 requests complete in a median of 2.5 hours.
**Error handling.** Both OpenAI and Anthropic return per-request errors within batch results. Your pipeline must handle: (1) batch-level failures (entire batch rejected), (2) request-level failures (individual requests fail), and (3) timeout failures (batch exceeds 24-hour window).
**Retry strategy.** Failed requests should be collected and resubmitted in a new batch, not retried individually via the synchronous API. Individual retries forfeit the 50% discount.
**Multi-provider routing.** Using [TokenMix.ai](https://tokenmix.ai)'s unified API, you can route batch jobs to different providers based on cost, speed, and availability. If OpenAI's batch queue is congested (visible via longer completion times), route to Anthropic's batch API, or vice versa.
---
Provider Async Support Comparison
| Provider | Batch API | Batch Discount | Webhook Support | Max Batch Size | Async SDK | | --- | --- | --- | --- | --- | --- | | **OpenAI** | Yes | 50% | No (polling only) | 50,000 | AsyncOpenAI | | **Anthropic** | Yes | 50% | No (polling only) | 10,000 | AsyncAnthropic | | **Google Gemini** | No (Vertex Batch Predict) | Varies | Pub/Sub integration | N/A | Native async | | **DeepSeek** | No | N/A | No | N/A | AsyncOpenAI compatible | | **Together** | No | N/A | No | N/A | AsyncTogether | | **Groq** | No | N/A | No | N/A | AsyncGroq | | **TokenMix.ai** | Via providers | Passed through | Planned | Provider limits | OpenAI compatible |
Only OpenAI and Anthropic offer first-party batch APIs with cost discounts. For other providers, async processing means concurrent requests with async SDK clients -- useful for throughput but no cost savings.
---
Full Comparison Table
| Feature | Sync API | OpenAI Batch | Anthropic Batch | Custom Async + Polling | Custom Webhooks | | --- | --- | --- | --- | --- | --- | | **Cost** | Standard | 50% off | 50% off | Standard | Standard | | **Latency** | 2-10 sec | 2-24 hours | 2-24 hours | 2-10 sec | 2-10 sec + delivery | | **Throughput** | Rate-limited | Unlimited in batch | Unlimited in batch | Rate-limited | Rate-limited | | **Rate Limit Mgmt** | Manual | Automatic | Automatic | Manual | Manual | | **Setup Effort** | None | Low | Low | Medium | High | | **Infrastructure** | None | None | None | Queue + workers | Queue + workers + endpoint | | **Real-Time** | Yes | No | No | Yes | Near real-time | | **Error Handling** | Immediate | Per-request in results | Per-request in results | Custom | Custom | | **Status Tracking** | N/A | Built-in | Built-in | Custom | Event-driven | | **Provider Lock-In** | Per provider | OpenAI only | Anthropic only | None | None |
---
Cost Impact: Sync vs Async Processing
Monthly Cost Comparison: 500,000 Requests
Assumptions: GPT-5.4, average 5,000 input tokens and 1,000 output tokens per request.
| Method | Input Cost | Output Cost | Infrastructure | Total Monthly | | --- | --- | --- | --- | --- | | **Sync API** | $6,250 | $7,500 | $0 | **$13,750** | | **Batch API (50% off)** | $3,125 | $3,750 | $0 | **$6,875** | | **Async + Polling** | $6,250 | $7,500 | ~$100 | **$13,850** | | **Custom Webhooks** | $6,250 | $7,500 | ~$200 | **$13,950** |
The Batch API saves $6,875/month -- $82,500/year -- on a 500,000 request/month workload. No other optimization delivers this magnitude of savings with so little engineering effort.
When Async Pays for Itself
| Monthly Requests | Sync Cost | Batch Cost | Annual Savings | | --- | --- | --- | --- | | 10,000 | $275 | $137 | $1,650 | | 100,000 | $2,750 | $1,375 | $16,500 | | 500,000 | $13,750 | $6,875 | $82,500 | | 1,000,000 | $27,500 | $13,750 | $165,000 |
At any volume above 10,000 requests/month, the batch API discount exceeds the cost of the engineering time to implement it. The breakeven is roughly day one.
---
Decision Guide: Which Async Pattern to Choose
| Your Situation | Choose | Why | | --- | --- | --- | | Non-urgent bulk processing, using OpenAI | **OpenAI Batch API** | 50% savings, zero infrastructure | | Non-urgent bulk processing, using Claude | **Anthropic Batch API** | 50% savings, clean API | | Need results within seconds | **Async SDK (concurrent)** | Real-time with parallelism | | 10+ concurrent batch jobs | **Polling + queue** | Centralized tracking | | 100+ concurrent jobs, event-driven arch | **Custom webhooks** | Scalable, decoupled | | Multi-provider batch processing | **TokenMix.ai + provider batches** | Unified API, route to cheapest batch | | Simple scripts or notebooks | **Sync with retry** | Minimal complexity | | High-volume, cost-sensitive | **Batch API first, async fallback** | Maximum savings on eligible traffic |
---
Conclusion
AI API async processing is not optional for teams spending more than $500/month on LLM APIs. The OpenAI and Anthropic Batch APIs deliver 50% cost reduction with minimal implementation effort. For every dollar you spend on synchronous API calls that could have been batched, you are leaving 50 cents on the table.
The priority order is clear. First, move all non-real-time workloads to batch APIs (OpenAI or Anthropic). Second, implement async SDK patterns for real-time workloads that need higher throughput. Third, build LLM webhook infrastructure only if you have 100+ concurrent jobs and an event-driven architecture.
Through [TokenMix.ai](https://tokenmix.ai)'s unified API, you can route eligible requests to whichever provider's batch API offers the best combination of cost, speed, and availability. The platform tracks batch queue times across providers in real-time, helping you avoid congested queues and minimize completion latency.
Start with the batch API. The 50% discount pays for itself from the first request.
---
FAQ
Does OpenAI have a webhook for batch API completion?
No. As of April 2026, OpenAI does not offer native webhook notifications for batch completion. You must poll the batch status endpoint at regular intervals. Typical polling intervals are 60-300 seconds. For production systems with many concurrent batches, building a custom webhook layer with a job queue reduces wasted polling calls.
How much does the OpenAI Batch API save compared to the standard API?
The OpenAI Batch API offers a flat 50% discount on all requests. For GPT-5.4, this reduces pricing from $2.50/$15.00 to $1.25/$7.50 per million tokens. On a workload of 500,000 requests per month, this saves approximately $82,500 annually. The only tradeoff is a 24-hour completion window.
What is the difference between polling and webhooks for LLM APIs?
Polling repeatedly checks the provider's status endpoint until a job completes. Webhooks receive an HTTP POST notification when the job finishes. Polling is simpler but wastes compute on status checks. Webhooks are more efficient but require you to build and host an endpoint. No major LLM provider offers native webhook support -- you must build the webhook layer yourself.
Can I use Anthropic's Batch API with the OpenAI SDK?
Not directly. Anthropic's batch API uses the Anthropic SDK with its own message format. However, through unified API providers like TokenMix.ai that support the OpenAI SDK format, you can access batch-eligible pricing across providers with a single integration. The provider handles the translation between formats.
How long does a batch API request take to complete?
Both OpenAI and Anthropic have a 24-hour SLA for batch completion. In practice, TokenMix.ai's monitoring shows median completion times of 2-4 hours for batches under 5,000 requests and 6-12 hours for larger batches. Completion times vary based on queue load and request complexity.
Should I use async or batch processing for AI API calls?
Use batch processing (OpenAI Batch API or Anthropic Batches) for non-time-sensitive workloads to get 50% cost savings. Use async processing (concurrent API calls with AsyncOpenAI) for real-time workloads where you need results within seconds but want higher throughput than sequential calls. The two patterns solve different problems and can be combined in the same application.
---
*Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: [OpenAI](https://platform.openai.com/docs/guides/batch), [Anthropic](https://docs.anthropic.com/en/docs/build-with-claude/batch-processing), [TokenMix.ai](https://tokenmix.ai)*