TokenMix Team · 2026-03-21

Building a Multi-Model AI Application with One API
Most production AI systems today rely on a single model. That works until it does not: the model has an outage, a new version regresses on your specific use case, or you discover that a different model handles certain queries better. Multi-model architectures solve these problems and unlock optimization strategies that are impossible with a single model.
This guide covers four production patterns for multi-model systems, all using the TokenMix API as the unified access layer.
Pattern 1: Intelligent Fallback Chain
The simplest multi-model pattern, and the one every production system should have: if your primary model fails, fall back to an alternative instead of returning an error.
import openai
import time
from typing import Optional
client = openai.OpenAI(
base_url="https://api.tokenmix.ai/v1",
api_key="your-tokenmix-api-key",
timeout=30.0
)
class FallbackChain:
def __init__(self, models: list[str]):
self.models = models
def complete(self, messages: list, **kwargs) -> tuple[str, str]:
"""Try each model in order. Returns (response, model_used)."""
last_error = None
for model in self.models:
try:
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
return response.choices[0].message.content, model
except openai.RateLimitError:
last_error = f"{model}: rate limited"
continue
except openai.APITimeoutError:
last_error = f"{model}: timeout"
continue
except openai.APIError as e:
last_error = f"{model}: {e.message}"
continue
raise RuntimeError(f"All models failed. Last error: {last_error}")
# Usage
chain = FallbackChain([
"claude-sonnet-4", # Primary: best quality
"gpt-4o", # Fallback 1: different provider
"gemini-2.0-flash" # Fallback 2: fast and reliable
])
result, model_used = chain.complete(
messages=[{"role": "user", "content": "Analyze this error log..."}],
max_tokens=500
)
print(f"Response from {model_used}: {result}")
Key design decisions:
- Order models by preference, not by cost. Your primary model should be the one that gives the best results for your use case.
- Include models from different providers. If one provider has an outage, another likely will not.
- Set a reasonable timeout. 30 seconds is usually enough; waiting 60 seconds before trying the fallback defeats the purpose.
Pattern 2: A/B Testing Between Models
Before committing to a model change, run both models in parallel and compare results with real traffic. This pattern routes a percentage of traffic to a challenger model and collects quality metrics.
import random
import time
import json
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ABTestResult:
model: str
response: str
latency_ms: float
input_tokens: int
output_tokens: int
@dataclass
class ABTest:
control_model: str
challenger_model: str
challenger_pct: float = 0.1 # 10% traffic to challenger
results: dict = field(default_factory=lambda: {"control": [], "challenger": []})
def route(self) -> str:
if random.random() < self.challenger_pct:
return self.challenger_model
return self.control_model
def complete_and_record(self, messages: list, **kwargs) -> ABTestResult:
model = self.route()
variant = "challenger" if model == self.challenger_model else "control"
start = time.time()
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
latency = (time.time() - start) * 1000
result = ABTestResult(
model=model,
response=response.choices[0].message.content,
latency_ms=latency,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens
)
self.results[variant].append(result)
return result
def summary(self) -> dict:
summary = {}
for variant in ["control", "challenger"]:
results = self.results[variant]
if not results:
continue
summary[variant] = {
"model": results[0].model,
"count": len(results),
"avg_latency_ms": sum(r.latency_ms for r in results) / len(results),
"avg_output_tokens": sum(r.output_tokens for r in results) / len(results)
}
return summary
# Usage
test = ABTest(
control_model="gpt-4o",
challenger_model="claude-sonnet-4",
challenger_pct=0.2 # Send 20% to challenger
)
To make A/B testing actionable, you need quality metrics beyond just latency. Consider logging responses and running periodic evaluation with a judge model or human review.
Pattern 3: Quality Scoring and Automatic Selection
For tasks where output quality varies significantly between models, use a lightweight judge to score responses and automatically select the best one.
import asyncio
from concurrent.futures import ThreadPoolExecutor
def generate_candidates(messages: list, models: list[str], **kwargs) -> list[dict]:
"""Generate responses from multiple models in parallel."""
results = []
def call_model(model):
start = time.time()
try:
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
return {
"model": model,
"response": response.choices[0].message.content,
"latency_ms": (time.time() - start) * 1000,
"tokens": response.usage.completion_tokens
}
except Exception as e:
return {"model": model, "error": str(e)}
with ThreadPoolExecutor(max_workers=len(models)) as executor:
results = list(executor.map(call_model, models))
return [r for r in results if "error" not in r]
def score_response(original_prompt: str, response: str) -> float:
"""Score a response using a fast judge model."""
judge_response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[
{
"role": "system",
"content": (
"Rate this AI response on a scale of 1-10. Consider:\n"
"- Accuracy and correctness\n"
"- Completeness (addresses the full question)\n"
"- Clarity and conciseness\n"
"Respond with only a number."
)
},
{
"role": "user",
"content": f"Prompt: {original_prompt}\n\nResponse: {response}"
}
],
max_tokens=5,
temperature=0
)
try:
return float(judge_response.choices[0].message.content.strip())
except ValueError:
return 5.0 # Default middle score
def best_of_n(messages: list, models: list[str], **kwargs) -> dict:
"""Generate from multiple models, return the highest-scored response."""
candidates = generate_candidates(messages, models, **kwargs)
if not candidates:
raise RuntimeError("All models failed")
prompt = messages[-1]["content"] if messages else ""
for candidate in candidates:
candidate["score"] = score_response(prompt, candidate["response"])
return max(candidates, key=lambda c: c["score"])
# Usage: get the best response from three models
best = best_of_n(
messages=[{"role": "user", "content": "Write a technical summary of WebSockets."}],
models=["claude-sonnet-4", "gpt-4o", "gemini-2.0-flash"]
)
print(f"Best response from {best['model']} (score: {best['score']}):")
print(best["response"])
This pattern costs more per request (you are calling multiple models plus a judge), so use it selectively for high-value outputs where quality matters more than cost.
Pattern 4: Graceful Degradation with Circuit Breakers
In production, you need more than simple fallbacks. A circuit breaker pattern prevents cascading failures by detecting when a model is consistently failing and temporarily removing it from the rotation.
import time
from dataclasses import dataclass
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Model disabled
HALF_OPEN = "half_open" # Testing recovery
@dataclass
class CircuitBreaker:
model: str
failure_threshold: int = 5 # Failures before opening
recovery_timeout: float = 60.0 # Seconds before testing recovery
state: CircuitState = CircuitState.CLOSED
failure_count: int = 0
last_failure_time: float = 0
def can_execute(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
return True
return False
return True # HALF_OPEN: allow one test request
def record_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
class ResilientMultiModel:
def __init__(self, models: list[str]):
self.breakers = {m: CircuitBreaker(model=m) for m in models}
def complete(self, messages: list, **kwargs) -> tuple[str, str]:
for model, breaker in self.breakers.items():
if not breaker.can_execute():
continue
try:
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
breaker.record_success()
return response.choices[0].message.content, model
except (openai.RateLimitError, openai.APITimeoutError, openai.APIError):
breaker.record_failure()
continue
raise RuntimeError("All models are in circuit-open state")
def status(self) -> dict:
return {
model: {
"state": breaker.state.value,
"failures": breaker.failure_count
}
for model, breaker in self.breakers.items()
}
# Usage
system = ResilientMultiModel([
"claude-sonnet-4",
"gpt-4o",
"gemini-2.0-flash"
])
# In your request handler
result, model = system.complete(
messages=[{"role": "user", "content": "Hello!"}]
)
# Monitor circuit breaker states
print(system.status())
Choosing the Right Pattern
| Pattern | When to Use | Cost Impact | Complexity |
|---|---|---|---|
| Fallback Chain | Every production system | None (only pays for successful call) | Low |
| A/B Testing | Before model migrations | Low (small traffic percentage) | Medium |
| Quality Scoring | High-value outputs | High (3x+ per request) | Medium |
| Circuit Breaker | High-availability systems | None | Medium |
Start with the Fallback Chain. It is the highest-value, lowest-effort pattern. Add Circuit Breakers when you need high availability. Use A/B Testing before any model change. Reserve Quality Scoring for your most important outputs.
Why a Unified API Matters
All four patterns above depend on one thing: being able to call different models with the same code. Without a unified API, each pattern would require managing multiple SDKs, API keys, authentication methods, and response formats. TokenMix eliminates this complexity by providing a single OpenAI-compatible endpoint for every model.
This is not just a convenience -- it is an architectural enabler. Multi-model patterns become practical when switching models is a one-line change instead of a week-long integration project.
Production Checklist
Before deploying a multi-model system:
- Log which model handled each request (essential for debugging)
- Monitor per-model latency and error rates separately
- Set up alerts for circuit breaker state changes
- Test fallback behavior by intentionally using an invalid model name
- Ensure your system prompts work well across all models in your pool
- Verify that response parsing handles format differences between models
- Run A/B tests before promoting any model change to 100% traffic