TokenMix Research Lab · 2026-04-24

LiteLLM Gemini 3 Integration 2026: Setup, Cost, Routing

Last Updated: 2026-04-30
Author: TokenMix Research Lab
Data checked: 2026-04-30

Use LiteLLM with Gemini 3 when you need OpenAI-style code, local routing rules, and Google model access through one proxy. Do not use the retired gemini-3-pro-preview model string.

Google's current Gemini model page lists Gemini 3.1 Pro, Gemini 3 Flash, and Gemini 3.1 Flash-Lite, and says Gemini 3 Pro Preview was shut down on March 9, 2026. Google's Gemini 3 guide says Gemini 3 models support a 1 million token input context window and up to 64k output tokens. The pricing page lists Gemini 3.1 Pro Preview at $2.00 per 1M input tokens and $12.00 per 1M output tokens on Standard under 200k prompt tokens, while Gemini 3 Flash is $0.50 / $3.00 and Gemini 3.1 Flash-Lite is $0.25 / $1.50. LiteLLM's Gemini provider docs confirm the gemini/ route, OpenAI-style /chat/completions, streaming, tools, response_format, and Gemini 3 reasoning mappings. This guide is the cleaned-up 2026 integration path.

Quick Answer
Confirmed vs Caveat
Correct Gemini 3 Model Strings
Three Integration Paths
LiteLLM Proxy Setup
OpenAI SDK Example
Reasoning And Thinking Settings
Tools, Streaming, Vision, And Structured Output
Pricing And Cost Math
Routing Policy For Production
When To Use TokenMix.ai Instead
Common Errors
Final Recommendation
FAQ
Related Articles
Sources

Quick Answer

For Google AI Studio API keys, use the LiteLLM gemini/ provider prefix:

model_list:
  - model_name: gemini-3-flash
    litellm_params:
      model: gemini/gemini-3-flash-preview
      api_key: os.environ/GEMINI_API_KEY

Then call LiteLLM through an OpenAI SDK client:

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-litellm-key",
    base_url="http://localhost:4000"
)

response = client.chat.completions.create(
    model="gemini-3-flash",
    messages=[{"role": "user", "content": "Summarize this API migration plan."}],
    reasoning_effort="low",
)

print(response.choices[0].message.content)

The decision is simple. LiteLLM is useful if your team wants a self-managed proxy. TokenMix.ai is cleaner if you want a hosted OpenAI-compatible gateway across Gemini, Claude, GPT, DeepSeek, and open models without maintaining proxy infrastructure.

Confirmed vs Caveat

Claim	Status	Source / note
LiteLLM supports Google AI Studio Gemini through `gemini/`	Confirmed	LiteLLM Gemini provider docs
LiteLLM exposes Gemini through OpenAI-style chat completions	Confirmed	LiteLLM lists `/chat/completions` under supported OpenAI endpoints
Google also has its own OpenAI-compatible Gemini endpoint	Confirmed	Google OpenAI compatibility docs
Gemini 3 Pro Preview should still be used	False	Google says it was shut down March 9, 2026
`gemini-3.1-pro-preview` is the current Pro target	Confirmed	Google model and pricing pages
Gemini 3 Flash and 3.1 Flash-Lite have API free tiers	Confirmed	Google Gemini 3 guide and pricing page
LiteLLM removes all Gemini-specific behavior	No	Thinking, service tiers, thought signatures, and model quirks still matter
LiteLLM is a managed multi-provider API product	No	It is software you operate unless you buy a hosted LiteLLM setup

Correct Gemini 3 Model Strings

The model string is the easiest place to make a costly mistake.

Use case	LiteLLM provider model	Local alias in config	Status
Best Gemini reasoning	`gemini/gemini-3.1-pro-preview`	`gemini-3.1-pro`	Current preview
General agent work	`gemini/gemini-3-flash-preview`	`gemini-3-flash`	Current preview
High-volume low-cost tasks	`gemini/gemini-3.1-flash-lite-preview`	`gemini-3.1-flash-lite`	Current preview
Legacy Gemini 3 Pro	`gemini/gemini-3-pro-preview`	Do not use	Shut down March 9, 2026
Stable fallback	`gemini/gemini-2.5-flash`	`gemini-2.5-flash`	Safer fallback

If a sample still shows gemini-3-pro-preview, treat it as a pattern example. For production, read the active model list first.

Three Integration Paths

Path	Best for	Strength	Weakness
Google native Gemini SDK	New Gemini-first apps	Full Google feature access	Not OpenAI-compatible
Google OpenAI compatibility endpoint	Fast migration from OpenAI SDK	No local proxy needed	Gemini-specific gaps still exist
LiteLLM proxy	Teams with local routing, budgets, logging	One proxy across 100+ providers	You operate the proxy
TokenMix.ai	Production multi-model access	Hosted OpenAI-compatible gateway	External service dependency

LiteLLM is a good bridge. TokenMix.ai is a good product layer. The difference matters.

LiteLLM Proxy Setup

Install LiteLLM proxy:

uv tool install "litellm[proxy]"

Set your key:

export GEMINI_API_KEY="your-google-ai-studio-key"

Create litellm_config.yaml:

model_list:
  - model_name: gemini-3.1-pro
    litellm_params:
      model: gemini/gemini-3.1-pro-preview
      api_key: os.environ/GEMINI_API_KEY

  - model_name: gemini-3-flash
    litellm_params:
      model: gemini/gemini-3-flash-preview
      api_key: os.environ/GEMINI_API_KEY

  - model_name: gemini-3.1-flash-lite
    litellm_params:
      model: gemini/gemini-3.1-flash-lite-preview
      api_key: os.environ/GEMINI_API_KEY

litellm_settings:
  master_key: sk-your-litellm-master-key

Start the proxy:

litellm --config litellm_config.yaml --port 4000

The proxy now accepts OpenAI-style requests at:

http://localhost:4000/v1/chat/completions

OpenAI SDK Example

Python:

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-litellm-master-key",
    base_url="http://localhost:4000/v1"
)

response = client.chat.completions.create(
    model="gemini-3-flash",
    messages=[
        {"role": "system", "content": "You are a precise API migration assistant."},
        {"role": "user", "content": "Convert this OpenAI-only app plan to a Gemini-ready plan."}
    ],
    reasoning_effort="low",
)

print(response.choices[0].message.content)

Node:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-your-litellm-master-key",
  baseURL: "http://localhost:4000/v1",
});

const response = await client.chat.completions.create({
  model: "gemini-3-flash",
  messages: [
    { role: "user", content: "Return a short Gemini 3 integration checklist." },
  ],
  reasoning_effort: "low",
});

console.log(response.choices[0].message.content);

Direct Google OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="GEMINI_API_KEY",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[{"role": "user", "content": "Hello Gemini 3."}]
)

Use direct Google compatibility for one-provider apps. Use LiteLLM or a unified AI API gateway when you need provider routing.

Reasoning And Thinking Settings

Gemini 3 is not just another chat model. Thinking settings affect latency, output quality, and cost.

Setting	Gemini 3 behavior	LiteLLM handling	Use it when
`reasoning_effort="low"`	Lower reasoning depth	Maps toward low thinking level	Chat, extraction, simple agents
`reasoning_effort="medium"`	Model-dependent support	May map to medium or high	Balanced workflows
`reasoning_effort="high"`	Maximum reasoning depth	Maps to high thinking level	Coding, planning, hard analysis
`reasoning_effort="none"`	Cannot fully disable Gemini 3 thinking	LiteLLM maps to minimal or low where possible	Cost control, not zero reasoning
`thinking_level`	Native Gemini-style control	Can be passed through advanced fields	Gemini-specific tuning
Thought signatures	Preserve reasoning continuity	Must be handled carefully in multi-turn flows	Function calling and agent loops

Google's Gemini 3 guide says thought signatures must be returned exactly in some strict tool and image cases. If you build long-running agents, test multi-turn tool loops before you ship.

Tools, Streaming, Vision, And Structured Output

Feature	Works through LiteLLM?	Notes
Streaming	Yes	Use `stream=True` in OpenAI-style calls
Function calling	Yes	LiteLLM accepts OpenAI tool shape and translates to Gemini
Vision input	Yes	Gemini is natively multimodal; test image payload size
Structured outputs	Yes	Use `response_format`, but validate schema behavior
Embeddings	Yes	LiteLLM lists `/embeddings` for Google AI Studio
Videos	Supported endpoint class	Use model-specific docs before production
Image edits	Supported endpoint class	Separate from normal text chat
Service tiers	Partially	LiteLLM maps OpenAI `service_tier` to Gemini tiers

The practical rule: text chat and streaming are straightforward. Tool loops, structured output, and multimodal payloads need integration tests.

Pricing And Cost Math

Google's pricing page separates Standard, Batch, Flex, and Priority. Most teams should estimate Standard first, then use Batch or Flex for offline jobs.

Model	Standard input	Standard output	Free API tier	Best use
Gemini 3.1 Pro Preview	$2.00 / 1M under 200k prompt tokens	$12.00 / 1M under 200k prompt tokens	No API free tier	Highest reasoning
Gemini 3 Flash Preview	$0.50 / 1M text/image/video input	$3.00 / 1M output	Yes	General agents
Gemini 3.1 Flash-Lite Preview	$0.25 / 1M text/image/video input	$1.50 / 1M output	Yes	High-volume automation
Gemini 2.5 Flash	Check current pricing page	Check current pricing page	Usually safer fallback	Stable fallback

Scenario 1: one heavy analysis request.

Model	Input tokens	Output tokens	Estimated cost
Gemini 3.1 Pro Preview	100,000	5,000	$0.26
Gemini 3 Flash Preview	100,000	5,000	$0.065
Gemini 3.1 Flash-Lite Preview	100,000	5,000	$0.0325

Scenario 2: one million small agent calls per month, each with 2,000 input tokens and 500 output tokens.

Model	Monthly input	Monthly output	Estimated monthly cost
Gemini 3.1 Pro Preview	2B tokens	500M tokens	$10,000
Gemini 3 Flash Preview	2B tokens	500M tokens	$2,500
Gemini 3.1 Flash-Lite Preview	2B tokens	500M tokens	$1,250

That is why routing matters. If every request goes to Pro, the bill can be 8x the Flash-Lite path for simple workloads.

Routing Policy For Production

Start with a cheap model. Escalate only when the task needs it.

Task type	First model	Escalate to	Reason
Classification	Gemini 3.1 Flash-Lite	Gemini 3 Flash	Low-cost, low-risk
Summaries under 20k tokens	Gemini 3 Flash	Gemini 3.1 Pro	Better latency and price
Long document reasoning	Gemini 3 Flash	Gemini 3.1 Pro	Pro only when answer quality matters
Coding agent planning	Gemini 3.1 Pro	Claude or GPT fallback	Hard reasoning is worth the premium
Tool-heavy agents	Gemini 3 Flash	Gemini 3.1 Pro	Test thought signatures
Bulk offline enrichment	Gemini 3.1 Flash-Lite Batch	Gemini 3 Flash Batch	Batch pricing can cut cost

LiteLLM can handle local routing. TokenMix.ai's LLM API gateway can handle hosted routing when you do not want to run the proxy yourself.

When To Use TokenMix.ai Instead

Use TokenMix.ai when the problem is not just "call Gemini." Use it when the problem is provider access, fallbacks, cost-efficient routing, unified billing, and one OpenAI-compatible endpoint.

Requirement	LiteLLM proxy	TokenMix.ai
Local self-hosted control	Strong	Not the main point
No proxy operations	Weak	Strong
Multi-provider OpenAI-compatible access	Good if configured	Built in
Fast experimentation	Good	Strong
Centralized hosted gateway	Requires deployment	Built in
Gemini plus Claude/GPT/DeepSeek fallback	Possible	Built in
Internal infra ownership	Your team	TokenMix.ai

If you only need Gemini, Google direct compatibility is enough. If you need an internal proxy, use LiteLLM. If you need a hosted multi-model layer, use TokenMix.ai or compare LiteLLM alternatives.

Common Errors

Error	Likely cause	Fix
Model not found	Using `gemini-3-pro-preview`	Move to `gemini-3.1-pro-preview`, `gemini-3-flash-preview`, or `gemini-3.1-flash-lite-preview`
Auth failure	Missing `GEMINI_API_KEY`	Set the environment variable and restart proxy
LiteLLM tries Vertex AI	Missing `gemini/` prefix	Use `gemini/gemini-3-flash-preview` for Google AI Studio API keys
Higher cost than expected	Pro used for all requests	Add Flash-Lite and Flash routing
Tool loop breaks	Missing thought signatures or model-specific tool behavior	Test multi-turn tool calls before release
Streaming works locally but fails behind proxy	Reverse proxy buffering	Disable buffering and test chunked responses
Structured output is inconsistent	Schema too loose or unsupported field	Tighten schema and add validation
Latency spikes	High reasoning level or Pro model	Lower reasoning effort or route simple tasks to Flash-Lite

Final Recommendation

Use LiteLLM + Gemini 3 if you want a self-managed OpenAI-compatible proxy. Use Gemini 3.1 Pro only for hard reasoning. Put Flash or Flash-Lite first for normal agent traffic.

For production teams, the clean routing stack is:

Layer	Recommended choice
Single Gemini app	Google OpenAI-compatible endpoint
Internal proxy	LiteLLM
Hosted multi-provider gateway	TokenMix.ai
Heavy reasoning	Gemini 3.1 Pro Preview
Everyday agent tasks	Gemini 3 Flash Preview
High-volume cheap automation	Gemini 3.1 Flash-Lite Preview

FAQ

Does LiteLLM support Gemini 3?

Yes. LiteLLM supports Gemini through the gemini/ provider route and OpenAI-style endpoints. Use current Google model strings, not retired examples.

What model string should I use for Gemini 3 Pro?

Use gemini/gemini-3.1-pro-preview in LiteLLM config. Google says gemini-3-pro-preview was shut down on March 9, 2026.

Is Gemini 3 free through the API?

Some Gemini 3 models have free API tiers. Google says Gemini 3 Flash and Gemini 3.1 Flash-Lite have API free tiers, but Gemini 3.1 Pro Preview does not.

Can I use the OpenAI SDK directly with Gemini 3?

Yes. Google provides an OpenAI-compatible endpoint at https://generativelanguage.googleapis.com/v1beta/openai/. LiteLLM is optional if you need routing, budgets, logging, or multiple providers.

Does `reasoning_effort="none"` turn off Gemini 3 thinking?

No. LiteLLM maps it to minimal or low where possible, but Gemini 3 thinking cannot be fully disabled in the same way. Treat it as cost control, not zero reasoning.

Should I use LiteLLM or TokenMix.ai?

Use LiteLLM if you want to operate your own proxy. Use TokenMix.ai if you want hosted OpenAI-compatible access across many models without maintaining proxy infrastructure.

Is Gemini 3.1 Pro too expensive for agents?

It can be. In a one-million-call workload with 2,000 input and 500 output tokens per call, Pro is roughly $10,000 at Standard pricing under 200k prompt tokens, while Flash-Lite is about $1,250.

Can LiteLLM route Gemini to Claude or GPT fallbacks?

Yes, LiteLLM can route across configured providers. If you want that capability as a hosted service instead of an internal deployment, compare TokenMix.ai with OpenRouter and other gateways.

Sources

Google Gemini models: https://ai.google.dev/gemini-api/docs/models
Google Gemini 3 developer guide: https://ai.google.dev/gemini-api/docs/gemini-3
Google Gemini API pricing: https://ai.google.dev/gemini-api/docs/pricing
Google OpenAI compatibility: https://ai.google.dev/gemini-api/docs/openai
LiteLLM Gemini provider docs: https://docs.litellm.ai/docs/providers/gemini
LiteLLM proxy configuration: https://docs.litellm.ai/docs/proxy/configs
LiteLLM routing docs: https://docs.litellm.ai/docs/routing