TokenMix Research Lab · 2026-04-24
Anthropic Messages API Documentation: Real Examples 2026
Anthropic's Messages API is the primary entry point for calling Claude models. This reference covers the full request/response schema, rate limits by tier (1-4), max output token limits per model, streaming setup, tool use (function calling), vision input, prompt caching, and the Anthropic-specific quirks that catch new developers. Unlike OpenAI's chat completion API, Anthropic uses a simpler messages structure and requires explicit max_tokens on every request. All code examples verified against Anthropic Python SDK 0.40+, TypeScript SDK 0.35+, and direct curl as of April 24, 2026. TokenMix.ai exposes the same API surface via OpenAI-compatible endpoint if you prefer OpenAI SDK — or use Anthropic SDK directly as shown below.
Table of Contents
- Confirmed vs Speculation
- Request Schema Essentials
- Rate Limits by Tier
- Max Tokens Per Model
- Streaming Setup
- Tool Use (Function Calling)
- Vision Input
- Prompt Caching
- FAQ
Confirmed vs Speculation
| Claim | Status | Source |
|---|---|---|
Base URL api.anthropic.com/v1/messages |
Confirmed | Anthropic API docs |
Python SDK anthropic==0.40.0+ current |
Confirmed | PyPI |
| Messages differ from OpenAI chat API | Confirmed | Schema inspection |
max_tokens required |
Confirmed | API will error without it |
| Rate limits Tier 1: 50 req/min Sonnet, 20 Opus | Confirmed | Anthropic rate limits docs |
| Max output tokens: 8192 for most models | Confirmed | Per-model |
| Claude 4.7 tokenizer differs from 4.6 | Confirmed | Tokenizer tax analysis |
Request Schema Essentials
Minimum viable Messages API request:
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from env
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[
{"role": "user", "content": "Hello, Claude."}
]
)
print(response.content[0].text)
Required fields:
model— exact model ID likeclaude-opus-4-7,claude-sonnet-4-6,claude-haiku-4-5max_tokens— output token limit (REQUIRED, unlike OpenAI's optional)messages— array of{role, content}objects (alternating user/assistant)
Common optional fields:
system— system prompt as top-level param (NOT a message entry)temperature(0-1.0)top_p(0-1.0)stop_sequences— list of stringsmetadata.user_id— for abuse trackingstream— booleantools— array for function calling
Rate Limits by Tier
Anthropic tiers ramp based on spend. Current limits as of April 2026:
| Tier | Min spend | Opus 4.7 RPM | Sonnet 4.6 RPM | Haiku 4.5 RPM |
|---|---|---|---|---|
| Tier 1 | $0 (signup) | 20 | 50 | 100 |
| Tier 2 | $40 spent | 40 | 100 | 200 |
| Tier 3 | $200 spent | 80 | 200 | 400 |
| Tier 4 | $400 spent, 7 days | 400 | 1,000 | 2,000 |
| Custom | Enterprise contract | Custom | Custom | Custom |
Token-per-minute limits scale similarly. Requesting tier upgrade is automatic based on usage. For enterprise-scale needs, contact Anthropic for custom SLAs.
Max Tokens Per Model
Output token ceiling per request:
| Model | Max output tokens | Max context |
|---|---|---|
| claude-haiku-4-5 | 8,192 | 200K |
| claude-sonnet-4-6 | 8,192 (64K with beta header) | 200K / 1M beta |
| claude-opus-4-7 | 8,192 (extended thinking higher) | 200K / 1M beta |
| claude-sonnet-3-7 | 8,192 | 200K |
For outputs >8192 tokens, use the extended output beta:
client.messages.create(
model="claude-sonnet-4-6",
max_tokens=64000,
extra_headers={"anthropic-beta": "output-128k-2025-02-19"},
...
)
Streaming Setup
Stream for interactive chat UX:
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": "Explain quantum tunneling."}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
TypeScript equivalent:
const stream = await client.messages.stream({
model: "claude-sonnet-4-6",
max_tokens: 1024,
messages: [{role: "user", content: "..."}]
})
for await (const event of stream) {
if (event.type === "content_block_delta") {
process.stdout.write(event.delta.text);
}
}
Tool Use (Function Calling)
Define tools, let Claude decide when to call:
tools = [{
"name": "get_weather",
"description": "Get current weather for a location.",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}]
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=[{"role": "user", "content": "Weather in Tokyo?"}]
)
# Response contains content blocks, check for tool_use
for block in response.content:
if block.type == "tool_use":
print(f"Call {block.name} with {block.input}")
Multi-turn tool loop: execute tool, append result as user message with tool_result block, call again.
Vision Input
Base64 image or URL (URL requires public accessible):
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": base64_encoded_image
}},
{"type": "text", "text": "What's in this image?"}
]
}]
)
Opus 4.7 supports 3.75 megapixel images; Sonnet 4.6 supports 3.0 MP; Haiku 4.5 supports 2.0 MP. Above these, images are downsampled.
Prompt Caching
Cache expensive system prompts / long context to save 90% on repeated calls:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{"type": "text", "text": "You are a helpful assistant."},
{
"type": "text",
"text": large_document_content, # e.g., 50K tokens
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": "Summarize section 3."}]
)
Cache valid 5 minutes. Subsequent calls within window cost 10% of original input cost. Essential for RAG and long-context Q&A workflows.
FAQ
Why does Anthropic require max_tokens when OpenAI doesn't?
Anthropic's design choice to force explicit budget declaration. Prevents runaway generation costs. When migrating from OpenAI SDK code, add max_tokens to every request.
How do I handle rate limits programmatically?
Check response headers for anthropic-ratelimit-requests-remaining and retry-after. Implement exponential backoff with jitter. TokenMix.ai gateway handles this automatically with multi-provider fallback.
What's the difference between claude-3-5-sonnet-20241022-v2:0 and claude-sonnet-4-6?
First is versioned AWS Bedrock model ID, second is Anthropic's current canonical name. Use Anthropic's direct names for direct API; Bedrock/Vertex have their own prefixed naming conventions.
Can I use Anthropic's API with OpenAI SDK?
Not directly — different schemas. Via TokenMix.ai or similar gateway, yes — the gateway translates OpenAI SDK calls to Anthropic's format. Useful for code portability.
Does prompt caching work across requests from different users?
No — cache is per-account. If your app has 1000 users each querying the same doc, each user hits cold cache on first call. Consider caching strategically at your own layer.
How do I pass a system prompt?
Top-level system parameter, NOT a message. Common mistake: adding {role: "system"} to messages array — Anthropic doesn't support that, will error.
What about structured output (JSON mode)?
Anthropic doesn't have an explicit JSON mode like OpenAI. Use tool use with input_schema to force structured outputs, or use prompt-level instruction: "Respond with valid JSON only." See structured output guide.
Sources
- Anthropic Messages API Docs
- Anthropic Rate Limits
- Anthropic Python SDK
- Anthropic TypeScript SDK
- Claude Opus 4.7 Review — TokenMix
- Prompt Caching Guide — TokenMix
By TokenMix Research Lab · Updated 2026-04-24