TokenMix Research Lab · 2026-04-24

Anthropic Messages API Documentation: Real Examples 2026

Anthropic Messages API Documentation: Real Examples 2026

Anthropic's Messages API is the primary entry point for calling Claude models. This reference covers the full request/response schema, rate limits by tier (1-4), max output token limits per model, streaming setup, tool use (function calling), vision input, prompt caching, and the Anthropic-specific quirks that catch new developers. Unlike OpenAI's chat completion API, Anthropic uses a simpler messages structure and requires explicit max_tokens on every request. All code examples verified against Anthropic Python SDK 0.40+, TypeScript SDK 0.35+, and direct curl as of April 24, 2026. TokenMix.ai exposes the same API surface via OpenAI-compatible endpoint if you prefer OpenAI SDK — or use Anthropic SDK directly as shown below.

Table of Contents


Confirmed vs Speculation

Claim Status Source
Base URL api.anthropic.com/v1/messages Confirmed Anthropic API docs
Python SDK anthropic==0.40.0+ current Confirmed PyPI
Messages differ from OpenAI chat API Confirmed Schema inspection
max_tokens required Confirmed API will error without it
Rate limits Tier 1: 50 req/min Sonnet, 20 Opus Confirmed Anthropic rate limits docs
Max output tokens: 8192 for most models Confirmed Per-model
Claude 4.7 tokenizer differs from 4.6 Confirmed Tokenizer tax analysis

Snapshot note (2026-04-24): SDK version numbers (Python 0.40+, TypeScript 0.35+), rate-limit tier thresholds, and the output-128k-2025-02-19 beta header reflect Anthropic's configuration at snapshot. Beta headers rotate periodically — check the current values on platform.claude.com/docs before copy-pasting into production.

Request Schema Essentials

Minimum viable Messages API request:

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "Hello, Claude."}
    ]
)

print(response.content[0].text)

Required fields:

Common optional fields:

Rate Limits by Tier

Anthropic tiers ramp based on spend. Current limits as of April 2026:

Tier Min spend Opus 4.7 RPM Sonnet 4.6 RPM Haiku 4.5 RPM
Tier 1 $0 (signup) 20 50 100
Tier 2 $40 spent 40 100 200
Tier 3 $200 spent 80 200 400
Tier 4 $400 spent, 7 days 400 1,000 2,000
Custom Enterprise contract Custom Custom Custom

Token-per-minute limits scale similarly. Requesting tier upgrade is automatic based on usage. For enterprise-scale needs, contact Anthropic for custom SLAs.

Max Tokens Per Model

Output token ceiling per request:

Model Max output tokens Max context
claude-haiku-4-5 8,192 200K
claude-sonnet-4-6 8,192 (64K with beta header) 200K / 1M beta
claude-opus-4-7 8,192 (extended thinking higher) 200K / 1M beta
claude-sonnet-3-7 8,192 200K

For outputs >8192 tokens, use the extended output beta:

client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=64000,
    extra_headers={"anthropic-beta": "output-128k-2025-02-19"},
    ...
)

Streaming Setup

Stream for interactive chat UX:

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain quantum tunneling."}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

TypeScript equivalent:

const stream = await client.messages.stream({
    model: "claude-sonnet-4-6",
    max_tokens: 1024,
    messages: [{role: "user", content: "..."}]
})

for await (const event of stream) {
    if (event.type === "content_block_delta") {
        process.stdout.write(event.delta.text);
    }
}

Tool Use (Function Calling)

Define tools, let Claude decide when to call:

tools = [{
    "name": "get_weather",
    "description": "Get current weather for a location.",
    "input_schema": {
        "type": "object",
        "properties": {
            "location": {"type": "string"}
        },
        "required": ["location"]
    }
}]

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "Weather in Tokyo?"}]
)

# Response contains content blocks, check for tool_use
for block in response.content:
    if block.type == "tool_use":
        print(f"Call {block.name} with {block.input}")

Multi-turn tool loop: execute tool, append result as user message with tool_result block, call again.

Vision Input

Base64 image or URL (URL requires public accessible):

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": base64_encoded_image
            }},
            {"type": "text", "text": "What's in this image?"}
        ]
    }]
)

Opus 4.7 supports 3.75 megapixel images; Sonnet 4.6 supports 3.0 MP; Haiku 4.5 supports 2.0 MP. Above these, images are downsampled.

Prompt Caching

Cache expensive system prompts / long context to save 90% on repeated calls:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {"type": "text", "text": "You are a helpful assistant."},
        {
            "type": "text",
            "text": large_document_content,  # e.g., 50K tokens
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": "Summarize section 3."}]
)

Cache valid 5 minutes. Subsequent calls within window cost 10% of original input cost. Essential for RAG and long-context Q&A workflows.

FAQ

Why does Anthropic require max_tokens when OpenAI doesn't?

Anthropic's design choice to force explicit budget declaration. Prevents runaway generation costs. When migrating from OpenAI SDK code, add max_tokens to every request.

How do I handle rate limits programmatically?

Check response headers for anthropic-ratelimit-requests-remaining and retry-after. Implement exponential backoff with jitter. TokenMix.ai gateway handles this automatically with multi-provider fallback.

What's the difference between claude-3-5-sonnet-20241022-v2:0 and claude-sonnet-4-6?

First is versioned AWS Bedrock model ID, second is Anthropic's current canonical name. Use Anthropic's direct names for direct API; Bedrock/Vertex have their own prefixed naming conventions.

Can I use Anthropic's API with OpenAI SDK?

Not directly — different schemas. Via TokenMix.ai or similar gateway, yes — the gateway translates OpenAI SDK calls to Anthropic's format. Useful for code portability.

Does prompt caching work across requests from different users?

No — cache is per-account. If your app has 1000 users each querying the same doc, each user hits cold cache on first call. Consider caching strategically at your own layer.

How do I pass a system prompt?

Top-level system parameter, NOT a message. Common mistake: adding {role: "system"} to messages array — Anthropic doesn't support that, will error.

What about structured output (JSON mode)?

Anthropic doesn't have an explicit JSON mode like OpenAI. Use tool use with input_schema to force structured outputs, or use prompt-level instruction: "Respond with valid JSON only." See structured output guide.


Sources

By TokenMix Research Lab · Updated 2026-04-24