TokenMix Research Lab · 2026-04-24

LiteLLM Gemini 3 Integration 2026: Setup, Cost, Routing

LiteLLM Gemini 3 Integration 2026: Setup, Cost, Routing

Last Updated: 2026-04-30
Author: TokenMix Research Lab
Data checked: 2026-04-30

Use LiteLLM with Gemini 3 when you need OpenAI-style code, local routing rules, and Google model access through one proxy. Do not use the retired gemini-3-pro-preview model string.

Google's current Gemini model page lists Gemini 3.1 Pro, Gemini 3 Flash, and Gemini 3.1 Flash-Lite, and says Gemini 3 Pro Preview was shut down on March 9, 2026. Google's Gemini 3 guide says Gemini 3 models support a 1 million token input context window and up to 64k output tokens. The pricing page lists Gemini 3.1 Pro Preview at $2.00 per 1M input tokens and $12.00 per 1M output tokens on Standard under 200k prompt tokens, while Gemini 3 Flash is $0.50 / $3.00 and Gemini 3.1 Flash-Lite is $0.25 / $1.50. LiteLLM's Gemini provider docs confirm the gemini/ route, OpenAI-style /chat/completions, streaming, tools, response_format, and Gemini 3 reasoning mappings. This guide is the cleaned-up 2026 integration path.

Table of Contents

Quick Answer

For Google AI Studio API keys, use the LiteLLM gemini/ provider prefix:

model_list:
  - model_name: gemini-3-flash
    litellm_params:
      model: gemini/gemini-3-flash-preview
      api_key: os.environ/GEMINI_API_KEY

Then call LiteLLM through an OpenAI SDK client:

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-litellm-key",
    base_url="http://localhost:4000"
)

response = client.chat.completions.create(
    model="gemini-3-flash",
    messages=[{"role": "user", "content": "Summarize this API migration plan."}],
    reasoning_effort="low",
)

print(response.choices[0].message.content)

The decision is simple. LiteLLM is useful if your team wants a self-managed proxy. TokenMix.ai is cleaner if you want a hosted OpenAI-compatible gateway across Gemini, Claude, GPT, DeepSeek, and open models without maintaining proxy infrastructure.

Confirmed vs Caveat

Claim Status Source / note
LiteLLM supports Google AI Studio Gemini through gemini/ Confirmed LiteLLM Gemini provider docs
LiteLLM exposes Gemini through OpenAI-style chat completions Confirmed LiteLLM lists /chat/completions under supported OpenAI endpoints
Google also has its own OpenAI-compatible Gemini endpoint Confirmed Google OpenAI compatibility docs
Gemini 3 Pro Preview should still be used False Google says it was shut down March 9, 2026
gemini-3.1-pro-preview is the current Pro target Confirmed Google model and pricing pages
Gemini 3 Flash and 3.1 Flash-Lite have API free tiers Confirmed Google Gemini 3 guide and pricing page
LiteLLM removes all Gemini-specific behavior No Thinking, service tiers, thought signatures, and model quirks still matter
LiteLLM is a managed multi-provider API product No It is software you operate unless you buy a hosted LiteLLM setup

Correct Gemini 3 Model Strings

The model string is the easiest place to make a costly mistake.

Use case LiteLLM provider model Local alias in config Status
Best Gemini reasoning gemini/gemini-3.1-pro-preview gemini-3.1-pro Current preview
General agent work gemini/gemini-3-flash-preview gemini-3-flash Current preview
High-volume low-cost tasks gemini/gemini-3.1-flash-lite-preview gemini-3.1-flash-lite Current preview
Legacy Gemini 3 Pro gemini/gemini-3-pro-preview Do not use Shut down March 9, 2026
Stable fallback gemini/gemini-2.5-flash gemini-2.5-flash Safer fallback

If a sample still shows gemini-3-pro-preview, treat it as a pattern example. For production, read the active model list first.

Three Integration Paths

Path Best for Strength Weakness
Google native Gemini SDK New Gemini-first apps Full Google feature access Not OpenAI-compatible
Google OpenAI compatibility endpoint Fast migration from OpenAI SDK No local proxy needed Gemini-specific gaps still exist
LiteLLM proxy Teams with local routing, budgets, logging One proxy across 100+ providers You operate the proxy
TokenMix.ai Production multi-model access Hosted OpenAI-compatible gateway External service dependency

LiteLLM is a good bridge. TokenMix.ai is a good product layer. The difference matters.

LiteLLM Proxy Setup

Install LiteLLM proxy:

uv tool install "litellm[proxy]"

Set your key:

export GEMINI_API_KEY="your-google-ai-studio-key"

Create litellm_config.yaml:

model_list:
  - model_name: gemini-3.1-pro
    litellm_params:
      model: gemini/gemini-3.1-pro-preview
      api_key: os.environ/GEMINI_API_KEY

  - model_name: gemini-3-flash
    litellm_params:
      model: gemini/gemini-3-flash-preview
      api_key: os.environ/GEMINI_API_KEY

  - model_name: gemini-3.1-flash-lite
    litellm_params:
      model: gemini/gemini-3.1-flash-lite-preview
      api_key: os.environ/GEMINI_API_KEY

litellm_settings:
  master_key: sk-your-litellm-master-key

Start the proxy:

litellm --config litellm_config.yaml --port 4000

The proxy now accepts OpenAI-style requests at:

http://localhost:4000/v1/chat/completions

OpenAI SDK Example

Python:

from openai import OpenAI

client = OpenAI(
    api_key="sk-your-litellm-master-key",
    base_url="http://localhost:4000/v1"
)

response = client.chat.completions.create(
    model="gemini-3-flash",
    messages=[
        {"role": "system", "content": "You are a precise API migration assistant."},
        {"role": "user", "content": "Convert this OpenAI-only app plan to a Gemini-ready plan."}
    ],
    reasoning_effort="low",
)

print(response.choices[0].message.content)

Node:

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "sk-your-litellm-master-key",
  baseURL: "http://localhost:4000/v1",
});

const response = await client.chat.completions.create({
  model: "gemini-3-flash",
  messages: [
    { role: "user", content: "Return a short Gemini 3 integration checklist." },
  ],
  reasoning_effort: "low",
});

console.log(response.choices[0].message.content);

Direct Google OpenAI-compatible endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="GEMINI_API_KEY",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

response = client.chat.completions.create(
    model="gemini-3-flash-preview",
    messages=[{"role": "user", "content": "Hello Gemini 3."}]
)

Use direct Google compatibility for one-provider apps. Use LiteLLM or a unified AI API gateway when you need provider routing.

Reasoning And Thinking Settings

Gemini 3 is not just another chat model. Thinking settings affect latency, output quality, and cost.

Setting Gemini 3 behavior LiteLLM handling Use it when
reasoning_effort="low" Lower reasoning depth Maps toward low thinking level Chat, extraction, simple agents
reasoning_effort="medium" Model-dependent support May map to medium or high Balanced workflows
reasoning_effort="high" Maximum reasoning depth Maps to high thinking level Coding, planning, hard analysis
reasoning_effort="none" Cannot fully disable Gemini 3 thinking LiteLLM maps to minimal or low where possible Cost control, not zero reasoning
thinking_level Native Gemini-style control Can be passed through advanced fields Gemini-specific tuning
Thought signatures Preserve reasoning continuity Must be handled carefully in multi-turn flows Function calling and agent loops

Google's Gemini 3 guide says thought signatures must be returned exactly in some strict tool and image cases. If you build long-running agents, test multi-turn tool loops before you ship.

Tools, Streaming, Vision, And Structured Output

Feature Works through LiteLLM? Notes
Streaming Yes Use stream=True in OpenAI-style calls
Function calling Yes LiteLLM accepts OpenAI tool shape and translates to Gemini
Vision input Yes Gemini is natively multimodal; test image payload size
Structured outputs Yes Use response_format, but validate schema behavior
Embeddings Yes LiteLLM lists /embeddings for Google AI Studio
Videos Supported endpoint class Use model-specific docs before production
Image edits Supported endpoint class Separate from normal text chat
Service tiers Partially LiteLLM maps OpenAI service_tier to Gemini tiers

The practical rule: text chat and streaming are straightforward. Tool loops, structured output, and multimodal payloads need integration tests.

Pricing And Cost Math

Google's pricing page separates Standard, Batch, Flex, and Priority. Most teams should estimate Standard first, then use Batch or Flex for offline jobs.

Model Standard input Standard output Free API tier Best use
Gemini 3.1 Pro Preview $2.00 / 1M under 200k prompt tokens $12.00 / 1M under 200k prompt tokens No API free tier Highest reasoning
Gemini 3 Flash Preview $0.50 / 1M text/image/video input $3.00 / 1M output Yes General agents
Gemini 3.1 Flash-Lite Preview $0.25 / 1M text/image/video input $1.50 / 1M output Yes High-volume automation
Gemini 2.5 Flash Check current pricing page Check current pricing page Usually safer fallback Stable fallback

Scenario 1: one heavy analysis request.

Model Input tokens Output tokens Estimated cost
Gemini 3.1 Pro Preview 100,000 5,000 $0.26
Gemini 3 Flash Preview 100,000 5,000 $0.065
Gemini 3.1 Flash-Lite Preview 100,000 5,000 $0.0325

Scenario 2: one million small agent calls per month, each with 2,000 input tokens and 500 output tokens.

Model Monthly input Monthly output Estimated monthly cost
Gemini 3.1 Pro Preview 2B tokens 500M tokens $10,000
Gemini 3 Flash Preview 2B tokens 500M tokens $2,500
Gemini 3.1 Flash-Lite Preview 2B tokens 500M tokens $1,250

That is why routing matters. If every request goes to Pro, the bill can be 8x the Flash-Lite path for simple workloads.

Routing Policy For Production

Start with a cheap model. Escalate only when the task needs it.

Task type First model Escalate to Reason
Classification Gemini 3.1 Flash-Lite Gemini 3 Flash Low-cost, low-risk
Summaries under 20k tokens Gemini 3 Flash Gemini 3.1 Pro Better latency and price
Long document reasoning Gemini 3 Flash Gemini 3.1 Pro Pro only when answer quality matters
Coding agent planning Gemini 3.1 Pro Claude or GPT fallback Hard reasoning is worth the premium
Tool-heavy agents Gemini 3 Flash Gemini 3.1 Pro Test thought signatures
Bulk offline enrichment Gemini 3.1 Flash-Lite Batch Gemini 3 Flash Batch Batch pricing can cut cost

LiteLLM can handle local routing. TokenMix.ai's LLM API gateway can handle hosted routing when you do not want to run the proxy yourself.

When To Use TokenMix.ai Instead

Use TokenMix.ai when the problem is not just "call Gemini." Use it when the problem is provider access, fallbacks, cost-efficient routing, unified billing, and one OpenAI-compatible endpoint.

Requirement LiteLLM proxy TokenMix.ai
Local self-hosted control Strong Not the main point
No proxy operations Weak Strong
Multi-provider OpenAI-compatible access Good if configured Built in
Fast experimentation Good Strong
Centralized hosted gateway Requires deployment Built in
Gemini plus Claude/GPT/DeepSeek fallback Possible Built in
Internal infra ownership Your team TokenMix.ai

If you only need Gemini, Google direct compatibility is enough. If you need an internal proxy, use LiteLLM. If you need a hosted multi-model layer, use TokenMix.ai or compare LiteLLM alternatives.

Common Errors

Error Likely cause Fix
Model not found Using gemini-3-pro-preview Move to gemini-3.1-pro-preview, gemini-3-flash-preview, or gemini-3.1-flash-lite-preview
Auth failure Missing GEMINI_API_KEY Set the environment variable and restart proxy
LiteLLM tries Vertex AI Missing gemini/ prefix Use gemini/gemini-3-flash-preview for Google AI Studio API keys
Higher cost than expected Pro used for all requests Add Flash-Lite and Flash routing
Tool loop breaks Missing thought signatures or model-specific tool behavior Test multi-turn tool calls before release
Streaming works locally but fails behind proxy Reverse proxy buffering Disable buffering and test chunked responses
Structured output is inconsistent Schema too loose or unsupported field Tighten schema and add validation
Latency spikes High reasoning level or Pro model Lower reasoning effort or route simple tasks to Flash-Lite

Final Recommendation

Use LiteLLM + Gemini 3 if you want a self-managed OpenAI-compatible proxy. Use Gemini 3.1 Pro only for hard reasoning. Put Flash or Flash-Lite first for normal agent traffic.

For production teams, the clean routing stack is:

Layer Recommended choice
Single Gemini app Google OpenAI-compatible endpoint
Internal proxy LiteLLM
Hosted multi-provider gateway TokenMix.ai
Heavy reasoning Gemini 3.1 Pro Preview
Everyday agent tasks Gemini 3 Flash Preview
High-volume cheap automation Gemini 3.1 Flash-Lite Preview

FAQ

Does LiteLLM support Gemini 3?

Yes. LiteLLM supports Gemini through the gemini/ provider route and OpenAI-style endpoints. Use current Google model strings, not retired examples.

What model string should I use for Gemini 3 Pro?

Use gemini/gemini-3.1-pro-preview in LiteLLM config. Google says gemini-3-pro-preview was shut down on March 9, 2026.

Is Gemini 3 free through the API?

Some Gemini 3 models have free API tiers. Google says Gemini 3 Flash and Gemini 3.1 Flash-Lite have API free tiers, but Gemini 3.1 Pro Preview does not.

Can I use the OpenAI SDK directly with Gemini 3?

Yes. Google provides an OpenAI-compatible endpoint at https://generativelanguage.googleapis.com/v1beta/openai/. LiteLLM is optional if you need routing, budgets, logging, or multiple providers.

Does reasoning_effort="none" turn off Gemini 3 thinking?

No. LiteLLM maps it to minimal or low where possible, but Gemini 3 thinking cannot be fully disabled in the same way. Treat it as cost control, not zero reasoning.

Should I use LiteLLM or TokenMix.ai?

Use LiteLLM if you want to operate your own proxy. Use TokenMix.ai if you want hosted OpenAI-compatible access across many models without maintaining proxy infrastructure.

Is Gemini 3.1 Pro too expensive for agents?

It can be. In a one-million-call workload with 2,000 input and 500 output tokens per call, Pro is roughly $10,000 at Standard pricing under 200k prompt tokens, while Flash-Lite is about $1,250.

Can LiteLLM route Gemini to Claude or GPT fallbacks?

Yes, LiteLLM can route across configured providers. If you want that capability as a hosted service instead of an internal deployment, compare TokenMix.ai with OpenRouter and other gateways.

Related Articles

Sources