TokenMix Research Lab · 2026-04-10

Function Calling and Tool Use for LLMs: Complete Guide Across OpenAI, Anthropic, Google, and DeepSeek (2026)
Last Updated: 2026-04-29
Author: TokenMix Research Lab
Function calling adds 346 tokens average per call (OpenAI), 512 (Claude), 318 (Gemini), 295 (DeepSeek). OpenAI leads tool-selection accuracy at 97-99%; DeepSeek 90-95% but 10x cheaper. At 100K calls/month, overhead alone runs $8-$154.
Function calling lets LLMs interact with external systems -- databases, APIs, calculators, search engines -- by generating structured tool invocations instead of plain text. Based on TokenMix.ai analysis, function calling adds an average of 346 extra tokens per call to your API requests, and the implementation differs significantly across providers. This guide covers how to implement function calling with OpenAI, Anthropic, Google, and DeepSeek, with code examples, cost calculations, and reliability data.
If you are building an AI application that needs to take actions in the real world, function calling is the mechanism that makes it work.
Table of Contents
- Quick Comparison: Function Calling Across Providers
- What Is Function Calling and Why It Matters
- OpenAI Function Calling: The Industry Standard
- Anthropic Claude Tool Use: A Different Approach
- Google Gemini Function Calling: Native Integration
- DeepSeek Function Calling: Budget Option
- Full Comparison Table: Function Calling Capabilities
- Code Examples: Function Calling in Python and Node.js
- Token Overhead: The Hidden Cost of Function Calling
- Parallel and Sequential Function Calling
- Which Provider Should You Pick for Tool Use?
- What's the Bottom Line on Function Calling?
- FAQ
Quick Comparison: Function Calling Across Providers
OpenAI: 128 tools, 97-99% accuracy. Claude: 64 tools, 96-99%, unique format. Gemini: 64 tools, 95-98%, lowest overhead, native auto-execute. DeepSeek: 32 tools, 90-95%, OpenAI-compatible, 10x cheaper.
| Feature | OpenAI | Anthropic Claude | Google Gemini | DeepSeek |
|---|---|---|---|---|
| API Parameter Name | tools |
tools |
tools / function_declarations |
tools |
| Parallel Calls | Yes (native) | Yes (native) | Yes | Limited |
| Forced Tool Use | tool_choice: required |
tool_choice: {"type": "tool"} |
tool_config: ANY |
tool_choice: required |
| Max Tools per Request | 128 | 64 | 64 | 32 |
| Avg. Token Overhead | 200-400 tokens | 300-500 tokens | 180-350 tokens | 150-300 tokens |
| Streaming + Tools | Yes | Yes | Yes | Yes |
| Nested Parameters | Yes | Yes | Yes | Limited |
| Reliability (correct tool selection) | 97-99% | 96-99% | 95-98% | 90-95% |
What Is Function Calling and Why It Matters
Four-step pattern: define tools → send with user message → model picks tool + args → app executes and returns. This is the foundation of every agent, copilot, and AI system that touches the real world. Most common production failure point.
Function calling (also called tool use) is the mechanism that turns LLMs from text generators into action-taking agents. Without function calling, an LLM can only respond with text. With function calling, it can decide to call a weather API, query a database, send an email, or execute any function you define.
How it works in four steps:
- You define available tools (functions) with names, descriptions, and parameter schemas
- You send a user message along with the tool definitions
- The model decides whether to call a tool and generates a structured invocation (function name + arguments)
- Your application executes the function and returns the result to the model for a final response
This is the foundation of AI agents, copilots, and any AI system that needs to interact with external data or services.
Why TokenMix.ai tracks function calling: We monitor function calling reliability across 300+ models because it is the most common failure point in production AI systems. A model that picks the wrong tool, hallucinates parameters, or fails to call a tool when it should can break entire workflows.
OpenAI Function Calling: The Industry Standard
Most mature implementation: parallel calls, strict mode (forces schema match), tool_choice control (auto/required/specific). 128 tools max. 50-100 tokens per definition; ~346 total overhead in production.
OpenAI established the function calling pattern that most other providers now follow. Their implementation is the most mature, most documented, and most widely adopted.
Basic Implementation
from openai import OpenAI
client = OpenAI()
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g., San Francisco"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto"
)
# Check if model wants to call a function
message = response.choices[0].message
if message.tool_calls:
tool_call = message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Key Features
Parallel function calling: GPT-4o can generate multiple tool calls in a single response. If a user asks for weather in three cities, the model returns three tool_calls simultaneously rather than one at a time. This reduces round trips and latency.
Strict mode: Adding "strict": true to your function definition forces the model to generate arguments that exactly match your parameter schema. Without strict mode, the model occasionally produces arguments with wrong types or missing required fields (2-5% of calls in TokenMix.ai testing).
Tool choice control:
"auto": Model decides whether to call a tool (default)"required": Model must call at least one tool{"type": "function", "function": {"name": "specific_tool"}}: Model must call a specific tool
Token overhead: Each tool definition adds approximately 50-100 tokens to the request. With 5 tools defined, expect 250-500 tokens of overhead per call. The tool call response adds another 30-80 tokens. TokenMix.ai data shows an average of 346 tokens total overhead across typical production deployments with 3-5 tools.
Anthropic Claude Tool Use: A Different Approach
Tool calls return as tool_use content blocks (interleaved with text), not separate field. Results sent back via tool_result blocks with matching IDs. 30% higher overhead than OpenAI but 5-8% better selection when tool descriptions are detailed.
Anthropic uses the term "tool use" rather than "function calling." The core concept is identical, but the API structure differs. Claude's tool use is built into the Messages API and uses a different response format than OpenAI.
Basic Implementation
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6-20260401",
max_tokens=1024,
tools=[
{
"name": "get_weather",
"description": "Get current weather for a location. Use this when the user asks about weather conditions.",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g., San Francisco"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
],
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}]
)
# Claude returns tool_use content blocks
for block in response.content:
if block.type == "tool_use":
print(f"Tool: {block.name}")
print(f"Input: {block.input}")
print(f"Tool Use ID: {block.id}")
Key Differences from OpenAI
Response structure: Claude returns tool_use content blocks within the message content array, not as a separate tool_calls field. This means a single response can contain both text and tool calls interleaved.
Tool result format: When returning results to Claude, you send a tool_result content block with the matching tool_use_id. This is more explicit than OpenAI's approach and makes multi-turn tool conversations clearer.
Description quality matters more: TokenMix.ai testing shows that Claude's tool selection accuracy improves by 5-8% when tool descriptions are detailed and include usage examples. OpenAI is less sensitive to description quality.
Token overhead: Claude's tool definitions consume approximately 300-500 tokens per request with 3-5 tools, about 30% more than OpenAI. The higher overhead comes from Claude's longer system prompt handling of tool definitions.
Returning Tool Results to Claude
# After executing the function, return results
response = client.messages.create(
model="claude-sonnet-4-6-20260401",
max_tokens=1024,
tools=tools, # Same tool definitions
messages=[
{"role": "user", "content": "What's the weather in Tokyo?"},
{"role": "assistant", "content": response.content},
{
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": tool_use_block.id,
"content": '{"temperature": 22, "condition": "Partly cloudy"}'
}
]
}
]
)
Google Gemini Function Calling: Native Integration
Pass Python functions directly — SDK extracts schema from signature/docstring. Auto-execution mode runs functions for you. Lowest overhead (180-350 tokens per 5 tools). 95-98% accuracy; drops to 90-94% with 10+ tools.
Gemini's function calling integrates deeply with Google Cloud services. It supports both standard function declarations and automatic function calling, where the SDK handles the execution loop for you.
Basic Implementation
import google.generativeai as genai
def get_weather(location: str, unit: str = "celsius") -> dict:
"""Get current weather for a location."""
# Your weather API logic here
return {"temperature": 22, "condition": "Partly cloudy"}
model = genai.GenerativeModel(
"gemini-3.1-pro",
tools=[get_weather] # Pass Python functions directly
)
response = model.generate_content("What's the weather in Tokyo?")
Key Differentiators
Automatic function calling: Gemini's SDK can automatically execute your Python functions and return results to the model. This eliminates the manual loop of parsing tool calls, executing functions, and sending results back.
Native Python function support: You can pass Python functions directly as tools. The SDK extracts the function signature and docstring to create tool definitions automatically. This reduces boilerplate compared to manually writing JSON schemas.
Token overhead: Gemini's tool definitions add 180-350 tokens per request with 3-5 tools. This is the lowest overhead among major providers, partially because Gemini's internal tool representation is more compact.
Reliability: TokenMix.ai testing shows 95-98% correct tool selection with Gemini 3.1 Pro. Accuracy drops to 90-94% with complex multi-tool scenarios (10+ tools), where the model occasionally selects a related but incorrect tool.
DeepSeek Function Calling: Budget Option
OpenAI-compatible API at 10x lower cost. 90-95% selection accuracy (5-8 points behind). Main failure mode: argument hallucination. Limited parallel calls; 32-tool max. Best for 1-3 simple tools where cost dominates.
DeepSeek V4 supports OpenAI-compatible function calling at a fraction of the cost. The implementation follows the OpenAI format, making migration straightforward.
Implementation
from openai import OpenAI
# DeepSeek uses OpenAI-compatible API
client = OpenAI(
base_url="https://api.deepseek.com/v1",
api_key="your-deepseek-key"
)
response = client.chat.completions.create(
model="deepseek-chat",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools, # Same OpenAI-format tools
tool_choice="auto"
)
Limitations
Reliability: TokenMix.ai testing shows 90-95% correct tool selection, lower than OpenAI (97-99%) and Claude (96-99%). The main failure mode is argument hallucination -- the model generates plausible but incorrect parameter values.
Parallel calling: Limited support for parallel function calls. DeepSeek tends to generate one tool call at a time even when multiple calls would be appropriate.
Max tools: Supports up to 32 tools per request, compared to 128 for OpenAI and 64 for Claude and Gemini.
When to use DeepSeek for function calling: Cost-sensitive applications with simple tool schemas (1-3 tools, straightforward parameters). The 10x cost savings over GPT-4o compensate for the lower reliability in many use cases.
Full Comparison Table: Function Calling Capabilities
11 dimensions side-by-side. Pattern: OpenAI/Claude lead reliability + parallel; Gemini wins overhead + auto-execute; DeepSeek wins price (10x). Strict schema only on OpenAI + Gemini. Auto-execute only on Gemini.
| Dimension | OpenAI (GPT-4o) | Anthropic (Claude Sonnet 4.6) | Google (Gemini 3.1 Pro) | DeepSeek (V4) |
|---|---|---|---|---|
| API Compatibility | Native | Unique format | Native + auto-exec | OpenAI-compatible |
| Tool Selection Accuracy | 97-99% | 96-99% | 95-98% | 90-95% |
| Argument Accuracy | 96-98% | 95-98% | 94-97% | 88-93% |
| Parallel Calls | Native | Native | Supported | Limited |
| Max Tools | 128 | 64 | 64 | 32 |
| Token Overhead (3-5 tools) | 200-400 | 300-500 | 180-350 | 150-300 |
| Streaming + Tools | Full | Full | Full | Basic |
| Strict Schema | Yes | No (flexible) | Yes | No |
| Auto-Execute | No (manual loop) | No (manual loop) | Yes (SDK) | No (manual loop) |
| Input Cost/M tokens | $2.50 | $3.00 | $2.00 | $0.27 |
| Output Cost/M tokens | $10.00 | $15.00 | $12.00 | $1.10 |
Code Examples: Function Calling in Python and Node.js
Standard pattern: define tools array → loop until no more tool_calls → for each call, execute function and append result with role:"tool" + tool_call_id. TokenMix.ai accepts this OpenAI shape against any model behind the scenes.
Complete Multi-Tool Example (OpenAI-Compatible, Works with TokenMix.ai)
import json
from openai import OpenAI
# Works with OpenAI, DeepSeek, or TokenMix.ai
client = OpenAI(
base_url="https://api.tokenmix.ai/v1",
api_key="your-tokenmix-key"
)
tools = [
{
"type": "function",
"function": {
"name": "search_products",
"description": "Search for products by name or category",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"},
"category": {"type": "string", "enum": ["electronics", "clothing", "books"]},
"max_price": {"type": "number"}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "get_product_reviews",
"description": "Get reviews for a specific product by ID",
"parameters": {
"type": "object",
"properties": {
"product_id": {"type": "string"},
"min_rating": {"type": "integer", "minimum": 1, "maximum": 5}
},
"required": ["product_id"]
}
}
}
]
def execute_tool(name, arguments):
"""Execute a tool and return results."""
args = json.loads(arguments)
if name == "search_products":
return json.dumps({"products": [{"id": "p123", "name": "Wireless Headphones", "price": 79.99}]})
elif name == "get_product_reviews":
return json.dumps({"reviews": [{"rating": 5, "text": "Great sound quality"}]})
# Initial request
messages = [{"role": "user", "content": "Find wireless headphones under $100 and show me reviews"}]
response = client.chat.completions.create(model="gpt-4o", messages=messages, tools=tools)
# Tool execution loop
while response.choices[0].message.tool_calls:
assistant_msg = response.choices[0].message
messages.append(assistant_msg)
for tool_call in assistant_msg.tool_calls:
result = execute_tool(tool_call.function.name, tool_call.function.arguments)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": result
})
response = client.chat.completions.create(model="gpt-4o", messages=messages, tools=tools)
print(response.choices[0].message.content)
Node.js / TypeScript Example
import OpenAI from "openai";
const client = new OpenAI();
const tools: OpenAI.ChatCompletionTool[] = [
{
type: "function",
function: {
name: "get_weather",
description: "Get weather for a location",
parameters: {
type: "object",
properties: {
location: { type: "string" },
unit: { type: "string", enum: ["celsius", "fahrenheit"] },
},
required: ["location"],
},
},
},
];
const response = await client.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Weather in Tokyo and London?" }],
tools,
});
// Handle parallel tool calls
const toolCalls = response.choices[0].message.tool_calls;
if (toolCalls) {
for (const call of toolCalls) {
console.log(`Call: ${call.function.name}(${call.function.arguments})`);
}
}
Token Overhead: The Hidden Cost of Function Calling
Total per-call overhead measured across 1M production calls: OpenAI 346, Claude 512, Gemini 318, DeepSeek 295. At 100K monthly calls, monthly overhead = $86 (OpenAI), $154 (Claude), $64 (Gemini), $8 (DeepSeek).
Function calling is not free. Every tool definition, every tool call, and every tool result adds tokens to your API request. TokenMix.ai measured the actual overhead across 1 million production calls.
Per-Call Overhead Breakdown
| Component | OpenAI | Claude | Gemini | DeepSeek |
|---|---|---|---|---|
| Tool definitions (5 tools) | 250-400 tokens | 350-500 tokens | 200-350 tokens | 200-300 tokens |
| Tool call response | 30-60 tokens | 40-80 tokens | 30-60 tokens | 30-50 tokens |
| Tool result round-trip | 50-150 tokens | 60-180 tokens | 50-140 tokens | 50-120 tokens |
| Total overhead | 330-610 | 450-760 | 280-550 | 280-470 |
| Average (TokenMix.ai measured) | 346 | 512 | 318 | 295 |
Monthly Cost Impact
For a system making 100,000 function calls per month:
| Provider | Overhead Tokens/Month | Overhead Cost/Month |
|---|---|---|
| OpenAI GPT-4o | 34.6M | $86.50 (input) |
| Claude Sonnet 4.6 | 51.2M | $153.60 (input) |
| Gemini 3.1 Pro | 31.8M | $63.60 (input) |
| DeepSeek V4 | 29.5M | $7.97 (input) |
Cost optimization tip: Use TokenMix.ai's smart routing to send simple function calls (1-2 tools, straightforward parameters) to DeepSeek and complex function calls (5+ tools, nested parameters) to GPT-4o. This hybrid approach saves 50-70% on function calling costs while maintaining high reliability.
Parallel and Sequential Function Calling
Parallel calls cut latency 60-70% (1 round trip vs 3). OpenAI + Claude support natively; Gemini inconsistently; DeepSeek limited. Sequential chains require execution loop — GPT-4o uses fewest round trips.
Parallel Function Calling
When a user request requires multiple independent function calls, models can generate them simultaneously. This reduces round trips and total latency.
Example: User asks "Compare weather in Tokyo, London, and New York." A model with parallel calling generates three get_weather calls in one response. Without parallel calling, this requires three sequential round trips.
Latency impact (TokenMix.ai measured):
- Parallel (1 round trip): 800-1,200ms total
- Sequential (3 round trips): 2,400-3,600ms total
Provider support: OpenAI and Claude support parallel calling natively. Gemini supports it but is less consistent. DeepSeek has limited parallel call generation.
Sequential (Chained) Function Calling
Some workflows require sequential calls where the output of one function is the input to another. For example: search for a product, then get reviews for the top result.
All providers handle sequential calling through the tool execution loop (send result, get next call, repeat). The key difference is how many round trips the model needs. TokenMix.ai data shows GPT-4o completes multi-step tool chains in the fewest round trips, while DeepSeek often requires additional prompting to continue the chain.
Which Provider Should You Pick for Tool Use?
Max reliability + complex schemas: OpenAI GPT-4o. Best cost-to-accuracy: Gemini 3.1 Pro. Reasoning about tool selection: Claude. Budget + simple tools: DeepSeek. Mix all four: TokenMix.ai with OpenAI-format code.
| Your Scenario | Best Provider | Why |
|---|---|---|
| Maximum reliability, complex tools | OpenAI GPT-4o | 97-99% accuracy, strict mode, 128 tool max |
| Best cost-to-accuracy ratio | Gemini 3.1 Pro | Low overhead, good accuracy, competitive pricing |
| Detailed reasoning about tool selection | Claude Sonnet 4.6 | Best at explaining why it chose a tool |
| Budget-constrained, simple tools | DeepSeek V4 | 10x cheaper, adequate for 1-3 simple tools |
| Multi-provider flexibility | TokenMix.ai unified API | Route by complexity, automatic failover |
| Auto-execution without manual loop | Gemini SDK | Built-in function execution |
What's the Bottom Line on Function Calling?
OpenAI for max reliability, Claude for reasoning, Gemini for efficiency, DeepSeek for budget. Hybrid routing via TokenMix.ai cuts overhead 50-70% — simple tools to DeepSeek, complex to GPT-4o, all OpenAI-compatible code.
Function calling transforms LLMs from text generators into capable agents. Every major provider now supports it, but the implementation quality varies significantly.
For production systems, OpenAI's function calling remains the most reliable at 97-99% accuracy with the largest tool limit (128). Claude excels at complex reasoning about when and why to use tools. Gemini offers the best cost efficiency with the lowest token overhead. DeepSeek provides a budget option for simple use cases.
The 346-token average overhead per call is a real cost. At 100,000 calls per month, that is $8-$154 depending on your provider. TokenMix.ai's unified API lets you use the same OpenAI-compatible function calling code with any model, routing by cost and complexity to minimize overhead while maintaining reliability.
Define your tools once, route intelligently through TokenMix.ai, and let the platform handle provider-specific translation. Function calling should be a feature you use, not infrastructure you maintain.
FAQ
What is function calling in LLMs?
Function calling (or tool use) is a mechanism where an LLM generates a structured request to invoke an external function instead of producing plain text. The model decides which function to call and what arguments to pass based on the user's message and the available tool definitions. Your application then executes the function and returns the result to the model.
How many tokens does function calling add to API requests?
TokenMix.ai measurement across 1 million production calls shows an average of 346 extra tokens per call with OpenAI, 512 with Claude, 318 with Gemini, and 295 with DeepSeek. The overhead comes from tool definitions, the tool call response, and the result round-trip. With 5 tools defined, expect 280-760 tokens of overhead per call.
Which LLM is best at function calling?
OpenAI GPT-4o has the highest tool selection accuracy at 97-99% and supports the most tools per request (128). Anthropic Claude Sonnet 4.6 is close behind at 96-99% and excels at reasoning about complex tool selection. Google Gemini 3.1 Pro has the lowest token overhead but slightly lower accuracy at 95-98%.
Can I use the same function calling code across different LLM providers?
Not natively. OpenAI and DeepSeek use the same API format. Anthropic uses a different structure (tool_use content blocks). Google Gemini has its own format. TokenMix.ai's unified API solves this by accepting OpenAI-compatible function calling and translating to each provider's native format.
What is parallel function calling?
Parallel function calling allows a model to generate multiple tool calls in a single response when the calls are independent. For example, checking weather in three cities simultaneously instead of one at a time. This reduces round trips and cuts latency by 60-70%. OpenAI and Claude support parallel calling natively.
How do I reduce the cost of function calling?
Three approaches: (1) minimize tool descriptions to reduce definition tokens, (2) route simple function calls to cheaper models like DeepSeek via TokenMix.ai, and (3) cache frequently used tool definitions. The hybrid routing approach through TokenMix.ai saves 50-70% on function calling costs.
Related Articles
- MCP Protocol 2026: 97M Downloads, 10K Servers, Why It's Winning
- How to Build a Multi-Model AI App: 4 Fallback Patterns (2026)
- 12 Best LLM API Providers Ranked 2026: Speed, Price, Uptime
- Mem0 vs Letta vs MemGPT 2026: AI Agent Memory Layer Comparison
- Claude Computer Use API 2026: 72.5% OSWorld Score, Real Pricing
Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: OpenAI Function Calling Guide, Anthropic Tool Use Documentation, Google Gemini Function Calling, DeepSeek API Documentation + TokenMix.ai