TokenMix Research Lab · 2026-04-24

LiteLLM Gemini 3 Integration 2026: Setup, Cost, Routing
Last Updated: 2026-04-30
Author: TokenMix Research Lab
Data checked: 2026-04-30
Use LiteLLM with Gemini 3 when you need OpenAI-style code, local routing rules, and Google model access through one proxy. Do not use the retired gemini-3-pro-preview model string.
Google's current Gemini model page lists Gemini 3.1 Pro, Gemini 3 Flash, and Gemini 3.1 Flash-Lite, and says Gemini 3 Pro Preview was shut down on March 9, 2026. Google's Gemini 3 guide says Gemini 3 models support a 1 million token input context window and up to 64k output tokens. The pricing page lists Gemini 3.1 Pro Preview at $2.00 per 1M input tokens and $12.00 per 1M output tokens on Standard under 200k prompt tokens, while Gemini 3 Flash is $0.50 / $3.00 and Gemini 3.1 Flash-Lite is $0.25 / $1.50. LiteLLM's Gemini provider docs confirm the gemini/ route, OpenAI-style /chat/completions, streaming, tools, response_format, and Gemini 3 reasoning mappings. This guide is the cleaned-up 2026 integration path.
Table of Contents
- Quick Answer
- Confirmed vs Caveat
- Correct Gemini 3 Model Strings
- Three Integration Paths
- LiteLLM Proxy Setup
- OpenAI SDK Example
- Reasoning And Thinking Settings
- Tools, Streaming, Vision, And Structured Output
- Pricing And Cost Math
- Routing Policy For Production
- When To Use TokenMix.ai Instead
- Common Errors
- Final Recommendation
- FAQ
- Related Articles
- Sources
Quick Answer
For Google AI Studio API keys, use the LiteLLM gemini/ provider prefix:
model_list:
- model_name: gemini-3-flash
litellm_params:
model: gemini/gemini-3-flash-preview
api_key: os.environ/GEMINI_API_KEY
Then call LiteLLM through an OpenAI SDK client:
from openai import OpenAI
client = OpenAI(
api_key="sk-your-litellm-key",
base_url="http://localhost:4000"
)
response = client.chat.completions.create(
model="gemini-3-flash",
messages=[{"role": "user", "content": "Summarize this API migration plan."}],
reasoning_effort="low",
)
print(response.choices[0].message.content)
The decision is simple. LiteLLM is useful if your team wants a self-managed proxy. TokenMix.ai is cleaner if you want a hosted OpenAI-compatible gateway across Gemini, Claude, GPT, DeepSeek, and open models without maintaining proxy infrastructure.
Confirmed vs Caveat
| Claim | Status | Source / note |
|---|---|---|
LiteLLM supports Google AI Studio Gemini through gemini/ |
Confirmed | LiteLLM Gemini provider docs |
| LiteLLM exposes Gemini through OpenAI-style chat completions | Confirmed | LiteLLM lists /chat/completions under supported OpenAI endpoints |
| Google also has its own OpenAI-compatible Gemini endpoint | Confirmed | Google OpenAI compatibility docs |
| Gemini 3 Pro Preview should still be used | False | Google says it was shut down March 9, 2026 |
gemini-3.1-pro-preview is the current Pro target |
Confirmed | Google model and pricing pages |
| Gemini 3 Flash and 3.1 Flash-Lite have API free tiers | Confirmed | Google Gemini 3 guide and pricing page |
| LiteLLM removes all Gemini-specific behavior | No | Thinking, service tiers, thought signatures, and model quirks still matter |
| LiteLLM is a managed multi-provider API product | No | It is software you operate unless you buy a hosted LiteLLM setup |
Correct Gemini 3 Model Strings
The model string is the easiest place to make a costly mistake.
| Use case | LiteLLM provider model | Local alias in config | Status |
|---|---|---|---|
| Best Gemini reasoning | gemini/gemini-3.1-pro-preview |
gemini-3.1-pro |
Current preview |
| General agent work | gemini/gemini-3-flash-preview |
gemini-3-flash |
Current preview |
| High-volume low-cost tasks | gemini/gemini-3.1-flash-lite-preview |
gemini-3.1-flash-lite |
Current preview |
| Legacy Gemini 3 Pro | gemini/gemini-3-pro-preview |
Do not use | Shut down March 9, 2026 |
| Stable fallback | gemini/gemini-2.5-flash |
gemini-2.5-flash |
Safer fallback |
If a sample still shows gemini-3-pro-preview, treat it as a pattern example. For production, read the active model list first.
Three Integration Paths
| Path | Best for | Strength | Weakness |
|---|---|---|---|
| Google native Gemini SDK | New Gemini-first apps | Full Google feature access | Not OpenAI-compatible |
| Google OpenAI compatibility endpoint | Fast migration from OpenAI SDK | No local proxy needed | Gemini-specific gaps still exist |
| LiteLLM proxy | Teams with local routing, budgets, logging | One proxy across 100+ providers | You operate the proxy |
| TokenMix.ai | Production multi-model access | Hosted OpenAI-compatible gateway | External service dependency |
LiteLLM is a good bridge. TokenMix.ai is a good product layer. The difference matters.
LiteLLM Proxy Setup
Install LiteLLM proxy:
uv tool install "litellm[proxy]"
Set your key:
export GEMINI_API_KEY="your-google-ai-studio-key"
Create litellm_config.yaml:
model_list:
- model_name: gemini-3.1-pro
litellm_params:
model: gemini/gemini-3.1-pro-preview
api_key: os.environ/GEMINI_API_KEY
- model_name: gemini-3-flash
litellm_params:
model: gemini/gemini-3-flash-preview
api_key: os.environ/GEMINI_API_KEY
- model_name: gemini-3.1-flash-lite
litellm_params:
model: gemini/gemini-3.1-flash-lite-preview
api_key: os.environ/GEMINI_API_KEY
litellm_settings:
master_key: sk-your-litellm-master-key
Start the proxy:
litellm --config litellm_config.yaml --port 4000
The proxy now accepts OpenAI-style requests at:
http://localhost:4000/v1/chat/completions
OpenAI SDK Example
Python:
from openai import OpenAI
client = OpenAI(
api_key="sk-your-litellm-master-key",
base_url="http://localhost:4000/v1"
)
response = client.chat.completions.create(
model="gemini-3-flash",
messages=[
{"role": "system", "content": "You are a precise API migration assistant."},
{"role": "user", "content": "Convert this OpenAI-only app plan to a Gemini-ready plan."}
],
reasoning_effort="low",
)
print(response.choices[0].message.content)
Node:
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "sk-your-litellm-master-key",
baseURL: "http://localhost:4000/v1",
});
const response = await client.chat.completions.create({
model: "gemini-3-flash",
messages: [
{ role: "user", content: "Return a short Gemini 3 integration checklist." },
],
reasoning_effort: "low",
});
console.log(response.choices[0].message.content);
Direct Google OpenAI-compatible endpoint:
from openai import OpenAI
client = OpenAI(
api_key="GEMINI_API_KEY",
base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)
response = client.chat.completions.create(
model="gemini-3-flash-preview",
messages=[{"role": "user", "content": "Hello Gemini 3."}]
)
Use direct Google compatibility for one-provider apps. Use LiteLLM or a unified AI API gateway when you need provider routing.
Reasoning And Thinking Settings
Gemini 3 is not just another chat model. Thinking settings affect latency, output quality, and cost.
| Setting | Gemini 3 behavior | LiteLLM handling | Use it when |
|---|---|---|---|
reasoning_effort="low" |
Lower reasoning depth | Maps toward low thinking level | Chat, extraction, simple agents |
reasoning_effort="medium" |
Model-dependent support | May map to medium or high | Balanced workflows |
reasoning_effort="high" |
Maximum reasoning depth | Maps to high thinking level | Coding, planning, hard analysis |
reasoning_effort="none" |
Cannot fully disable Gemini 3 thinking | LiteLLM maps to minimal or low where possible | Cost control, not zero reasoning |
thinking_level |
Native Gemini-style control | Can be passed through advanced fields | Gemini-specific tuning |
| Thought signatures | Preserve reasoning continuity | Must be handled carefully in multi-turn flows | Function calling and agent loops |
Google's Gemini 3 guide says thought signatures must be returned exactly in some strict tool and image cases. If you build long-running agents, test multi-turn tool loops before you ship.
Tools, Streaming, Vision, And Structured Output
| Feature | Works through LiteLLM? | Notes |
|---|---|---|
| Streaming | Yes | Use stream=True in OpenAI-style calls |
| Function calling | Yes | LiteLLM accepts OpenAI tool shape and translates to Gemini |
| Vision input | Yes | Gemini is natively multimodal; test image payload size |
| Structured outputs | Yes | Use response_format, but validate schema behavior |
| Embeddings | Yes | LiteLLM lists /embeddings for Google AI Studio |
| Videos | Supported endpoint class | Use model-specific docs before production |
| Image edits | Supported endpoint class | Separate from normal text chat |
| Service tiers | Partially | LiteLLM maps OpenAI service_tier to Gemini tiers |
The practical rule: text chat and streaming are straightforward. Tool loops, structured output, and multimodal payloads need integration tests.
Pricing And Cost Math
Google's pricing page separates Standard, Batch, Flex, and Priority. Most teams should estimate Standard first, then use Batch or Flex for offline jobs.
| Model | Standard input | Standard output | Free API tier | Best use |
|---|---|---|---|---|
| Gemini 3.1 Pro Preview | $2.00 / 1M under 200k prompt tokens | $12.00 / 1M under 200k prompt tokens | No API free tier | Highest reasoning |
| Gemini 3 Flash Preview | $0.50 / 1M text/image/video input | $3.00 / 1M output | Yes | General agents |
| Gemini 3.1 Flash-Lite Preview | $0.25 / 1M text/image/video input | $1.50 / 1M output | Yes | High-volume automation |
| Gemini 2.5 Flash | Check current pricing page | Check current pricing page | Usually safer fallback | Stable fallback |
Scenario 1: one heavy analysis request.
| Model | Input tokens | Output tokens | Estimated cost |
|---|---|---|---|
| Gemini 3.1 Pro Preview | 100,000 | 5,000 | $0.26 |
| Gemini 3 Flash Preview | 100,000 | 5,000 | $0.065 |
| Gemini 3.1 Flash-Lite Preview | 100,000 | 5,000 | $0.0325 |
Scenario 2: one million small agent calls per month, each with 2,000 input tokens and 500 output tokens.
| Model | Monthly input | Monthly output | Estimated monthly cost |
|---|---|---|---|
| Gemini 3.1 Pro Preview | 2B tokens | 500M tokens | $10,000 |
| Gemini 3 Flash Preview | 2B tokens | 500M tokens | $2,500 |
| Gemini 3.1 Flash-Lite Preview | 2B tokens | 500M tokens | $1,250 |
That is why routing matters. If every request goes to Pro, the bill can be 8x the Flash-Lite path for simple workloads.
Routing Policy For Production
Start with a cheap model. Escalate only when the task needs it.
| Task type | First model | Escalate to | Reason |
|---|---|---|---|
| Classification | Gemini 3.1 Flash-Lite | Gemini 3 Flash | Low-cost, low-risk |
| Summaries under 20k tokens | Gemini 3 Flash | Gemini 3.1 Pro | Better latency and price |
| Long document reasoning | Gemini 3 Flash | Gemini 3.1 Pro | Pro only when answer quality matters |
| Coding agent planning | Gemini 3.1 Pro | Claude or GPT fallback | Hard reasoning is worth the premium |
| Tool-heavy agents | Gemini 3 Flash | Gemini 3.1 Pro | Test thought signatures |
| Bulk offline enrichment | Gemini 3.1 Flash-Lite Batch | Gemini 3 Flash Batch | Batch pricing can cut cost |
LiteLLM can handle local routing. TokenMix.ai's LLM API gateway can handle hosted routing when you do not want to run the proxy yourself.
When To Use TokenMix.ai Instead
Use TokenMix.ai when the problem is not just "call Gemini." Use it when the problem is provider access, fallbacks, cost-efficient routing, unified billing, and one OpenAI-compatible endpoint.
| Requirement | LiteLLM proxy | TokenMix.ai |
|---|---|---|
| Local self-hosted control | Strong | Not the main point |
| No proxy operations | Weak | Strong |
| Multi-provider OpenAI-compatible access | Good if configured | Built in |
| Fast experimentation | Good | Strong |
| Centralized hosted gateway | Requires deployment | Built in |
| Gemini plus Claude/GPT/DeepSeek fallback | Possible | Built in |
| Internal infra ownership | Your team | TokenMix.ai |
If you only need Gemini, Google direct compatibility is enough. If you need an internal proxy, use LiteLLM. If you need a hosted multi-model layer, use TokenMix.ai or compare LiteLLM alternatives.
Common Errors
| Error | Likely cause | Fix |
|---|---|---|
| Model not found | Using gemini-3-pro-preview |
Move to gemini-3.1-pro-preview, gemini-3-flash-preview, or gemini-3.1-flash-lite-preview |
| Auth failure | Missing GEMINI_API_KEY |
Set the environment variable and restart proxy |
| LiteLLM tries Vertex AI | Missing gemini/ prefix |
Use gemini/gemini-3-flash-preview for Google AI Studio API keys |
| Higher cost than expected | Pro used for all requests | Add Flash-Lite and Flash routing |
| Tool loop breaks | Missing thought signatures or model-specific tool behavior | Test multi-turn tool calls before release |
| Streaming works locally but fails behind proxy | Reverse proxy buffering | Disable buffering and test chunked responses |
| Structured output is inconsistent | Schema too loose or unsupported field | Tighten schema and add validation |
| Latency spikes | High reasoning level or Pro model | Lower reasoning effort or route simple tasks to Flash-Lite |
Final Recommendation
Use LiteLLM + Gemini 3 if you want a self-managed OpenAI-compatible proxy. Use Gemini 3.1 Pro only for hard reasoning. Put Flash or Flash-Lite first for normal agent traffic.
For production teams, the clean routing stack is:
| Layer | Recommended choice |
|---|---|
| Single Gemini app | Google OpenAI-compatible endpoint |
| Internal proxy | LiteLLM |
| Hosted multi-provider gateway | TokenMix.ai |
| Heavy reasoning | Gemini 3.1 Pro Preview |
| Everyday agent tasks | Gemini 3 Flash Preview |
| High-volume cheap automation | Gemini 3.1 Flash-Lite Preview |
FAQ
Does LiteLLM support Gemini 3?
Yes. LiteLLM supports Gemini through the gemini/ provider route and OpenAI-style endpoints. Use current Google model strings, not retired examples.
What model string should I use for Gemini 3 Pro?
Use gemini/gemini-3.1-pro-preview in LiteLLM config. Google says gemini-3-pro-preview was shut down on March 9, 2026.
Is Gemini 3 free through the API?
Some Gemini 3 models have free API tiers. Google says Gemini 3 Flash and Gemini 3.1 Flash-Lite have API free tiers, but Gemini 3.1 Pro Preview does not.
Can I use the OpenAI SDK directly with Gemini 3?
Yes. Google provides an OpenAI-compatible endpoint at https://generativelanguage.googleapis.com/v1beta/openai/. LiteLLM is optional if you need routing, budgets, logging, or multiple providers.
Does reasoning_effort="none" turn off Gemini 3 thinking?
No. LiteLLM maps it to minimal or low where possible, but Gemini 3 thinking cannot be fully disabled in the same way. Treat it as cost control, not zero reasoning.
Should I use LiteLLM or TokenMix.ai?
Use LiteLLM if you want to operate your own proxy. Use TokenMix.ai if you want hosted OpenAI-compatible access across many models without maintaining proxy infrastructure.
Is Gemini 3.1 Pro too expensive for agents?
It can be. In a one-million-call workload with 2,000 input and 500 output tokens per call, Pro is roughly $10,000 at Standard pricing under 200k prompt tokens, while Flash-Lite is about $1,250.
Can LiteLLM route Gemini to Claude or GPT fallbacks?
Yes, LiteLLM can route across configured providers. If you want that capability as a hosted service instead of an internal deployment, compare TokenMix.ai with OpenRouter and other gateways.
Related Articles
- Gemini OpenAI-Compatible API 2026: Direct Setup Guide
- OpenAI-Compatible API Guide 2026: SDK, Providers, Pricing
- AI API Gateway 2026: Routing, Fallbacks, Cost Control
- Best Unified AI API Gateways 2026: 7 Tools, Scores, Costs
- LiteLLM Alternative Managed: When A Hosted Gateway Wins
- OpenRouter Alternatives: Cheaper API Routing Options
Sources
- Google Gemini models: https://ai.google.dev/gemini-api/docs/models
- Google Gemini 3 developer guide: https://ai.google.dev/gemini-api/docs/gemini-3
- Google Gemini API pricing: https://ai.google.dev/gemini-api/docs/pricing
- Google OpenAI compatibility: https://ai.google.dev/gemini-api/docs/openai
- LiteLLM Gemini provider docs: https://docs.litellm.ai/docs/providers/gemini
- LiteLLM proxy configuration: https://docs.litellm.ai/docs/proxy/configs
- LiteLLM routing docs: https://docs.litellm.ai/docs/routing