TokenMix Research Lab · 2026-04-24

Anthropic Messages API Documentation: Real Examples 2026

Anthropic's Messages API is the primary entry point for calling Claude models. This reference covers the full request/response schema, rate limits by tier (1-4), max output token limits per model, streaming setup, tool use (function calling), vision input, prompt caching, and the Anthropic-specific quirks that catch new developers. Unlike OpenAI's chat completion API, Anthropic uses a simpler messages structure and requires explicit max_tokens on every request. All code examples verified against Anthropic Python SDK 0.40+, TypeScript SDK 0.35+, and direct curl as of April 24, 2026. TokenMix.ai exposes the same API surface via OpenAI-compatible endpoint if you prefer OpenAI SDK — or use Anthropic SDK directly as shown below.

Confirmed vs Speculation
Request Schema Essentials
Rate Limits by Tier
Max Tokens Per Model
Streaming Setup
Tool Use (Function Calling)
Vision Input
Prompt Caching
FAQ

Confirmed vs Speculation

Claim	Status	Source
Base URL `api.anthropic.com/v1/messages`	Confirmed	Anthropic API docs
Python SDK `anthropic==0.40.0+` current	Confirmed	PyPI
Messages differ from OpenAI chat API	Confirmed	Schema inspection
`max_tokens` required	Confirmed	API will error without it
Rate limits Tier 1: 50 req/min Sonnet, 20 Opus	Confirmed	Anthropic rate limits docs
Max output tokens: 8192 for most models	Confirmed	Per-model
Claude 4.7 tokenizer differs from 4.6	Confirmed	Tokenizer tax analysis

Snapshot note (2026-04-24): SDK version numbers (Python 0.40+, TypeScript 0.35+), rate-limit tier thresholds, and the output-128k-2025-02-19 beta header reflect Anthropic's configuration at snapshot. Beta headers rotate periodically — check the current values on platform.claude.com/docs before copy-pasting into production.

Request Schema Essentials

Minimum viable Messages API request:

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Hello, Claude."}
    ]
)

print(response.content[0].text)

Required fields:

model — exact model ID like claude-opus-4-7, claude-sonnet-4-6, claude-haiku-4-5
max_tokens — output token limit (REQUIRED, unlike OpenAI's optional)
messages — array of {role, content} objects (alternating user/assistant)

Common optional fields:

system — system prompt as top-level param (NOT a message entry)
temperature (0-1.0)
top_p (0-1.0)
stop_sequences — list of strings
metadata.user_id — for abuse tracking
stream — boolean
tools — array for function calling

Rate Limits by Tier

Anthropic tiers ramp based on spend. Current limits as of April 2026:

Tier	Min spend	Opus 4.7 RPM	Sonnet 4.6 RPM	Haiku 4.5 RPM
Tier 1	$0 (signup)	20	50	100
Tier 2	$40 spent	40	100	200
Tier 3	$200 spent	80	200	400
Tier 4	$400 spent, 7 days	400	1,000	2,000
Custom	Enterprise contract	Custom	Custom	Custom

Token-per-minute limits scale similarly. Requesting tier upgrade is automatic based on usage. For enterprise-scale needs, contact Anthropic for custom SLAs.

Max Tokens Per Model

Output token ceiling per request:

Model	Max output tokens	Max context
claude-haiku-4-5	8,192	200K
claude-sonnet-4-6	8,192 (64K with beta header)	200K / 1M beta
claude-opus-4-7	8,192 (extended thinking higher)	200K / 1M beta
claude-sonnet-3-7	8,192	200K

For outputs >8192 tokens, use the extended output beta:

client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=64000,
    extra_headers={"anthropic-beta": "output-128k-2025-02-19"},
    ...
)

Streaming Setup

Stream for interactive chat UX:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain quantum tunneling."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

TypeScript equivalent:

const stream = await client.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [{role: "user", content: "..."}]
})

for await (const event of stream) {
    if (event.type === "content_block_delta") {
        process.stdout.write(event.delta.text);
    }
}

Tool Use (Function Calling)

Define tools, let Claude decide when to call:

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a location.",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {"type": "string"}
        },
        "required": ["location"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "Weather in Tokyo?"}]
)

# Response contains content blocks, check for tool_use
for block in response.content:
    if block.type == "tool_use":
        print(f"Call {block.name} with {block.input}")

Multi-turn tool loop: execute tool, append result as user message with tool_result block, call again.

Vision Input

Base64 image or URL (URL requires public accessible):

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": base64_encoded_image
            }},
            {"type": "text", "text": "What's in this image?"}
        ]
    }]
)

Opus 4.7 supports 3.75 megapixel images; Sonnet 4.6 supports 3.0 MP; Haiku 4.5 supports 2.0 MP. Above these, images are downsampled.

Prompt Caching

Cache expensive system prompts / long context to save 90% on repeated calls:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a helpful assistant."},
        {
            "type": "text",
            "text": large_document_content,  # e.g., 50K tokens
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Summarize section 3."}]
)

Cache valid 5 minutes. Subsequent calls within window cost 10% of original input cost. Essential for RAG and long-context Q&A workflows.

FAQ

Why does Anthropic require `max_tokens` when OpenAI doesn't?

Anthropic's design choice to force explicit budget declaration. Prevents runaway generation costs. When migrating from OpenAI SDK code, add max_tokens to every request.

How do I handle rate limits programmatically?

Check response headers for anthropic-ratelimit-requests-remaining and retry-after. Implement exponential backoff with jitter. TokenMix.ai gateway handles this automatically with multi-provider fallback.

What's the difference between `claude-3-5-sonnet-20241022-v2:0` and `claude-sonnet-4-6`?

First is versioned AWS Bedrock model ID, second is Anthropic's current canonical name. Use Anthropic's direct names for direct API; Bedrock/Vertex have their own prefixed naming conventions.

Can I use Anthropic's API with OpenAI SDK?

Not directly — different schemas. Via TokenMix.ai or similar gateway, yes — the gateway translates OpenAI SDK calls to Anthropic's format. Useful for code portability.

Does prompt caching work across requests from different users?

No — cache is per-account. If your app has 1000 users each querying the same doc, each user hits cold cache on first call. Consider caching strategically at your own layer.

How do I pass a system prompt?

Top-level system parameter, NOT a message. Common mistake: adding {role: "system"} to messages array — Anthropic doesn't support that, will error.

What about structured output (JSON mode)?

Anthropic doesn't have an explicit JSON mode like OpenAI. Use tool use with input_schema to force structured outputs, or use prompt-level instruction: "Respond with valid JSON only." See structured output guide.

Sources

By TokenMix Research Lab · Updated 2026-04-24