TokenMix Research Lab · 2026-03-21

Building a Multi-Model AI Application with One API
Last Updated: 2026-04-29
Author: TokenMix Research Lab
Use four production patterns — Fallback Chain, A/B Testing, Quality Scoring, Circuit Breakers — over a single TokenMix endpoint to ship resilient multi-model systems without juggling SDKs.
Most production AI systems today rely on a single model. That works until it does not: the model has an outage, a new version regresses on your specific use case, or you discover that a different model handles certain queries better. Multi-model architectures solve these problems and unlock optimization strategies that are impossible with a single model.
This guide covers four production patterns for multi-model systems, all using the TokenMix API as the unified access layer.
Table of Contents
- Pattern 1: Intelligent Fallback Chain
- Pattern 2: A/B Testing Between Models
- Pattern 3: Quality Scoring and Automatic Selection
- Pattern 4: Graceful Degradation with Circuit Breakers
- Which Pattern Should You Pick?
- Why Does a Unified API Matter Here?
- Production Checklist
- Authoritative References
Pattern 1: Intelligent Fallback Chain
Every production system should ship with at least one cross-provider fallback — single-model deployments fail closed during a single provider outage. The simplest multi-model pattern, and the one every production system should have: if your primary model fails, fall back to an alternative instead of returning an error.
import openai
import time
from typing import Optional
client = openai.OpenAI(
base_url="https://api.tokenmix.ai/v1",
api_key="your-tokenmix-api-key",
timeout=30.0
)
class FallbackChain:
def __init__(self, models: list[str]):
self.models = models
def complete(self, messages: list, **kwargs) -> tuple[str, str]:
"""Try each model in order. Returns (response, model_used)."""
last_error = None
for model in self.models:
try:
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
return response.choices[0].message.content, model
except openai.RateLimitError:
last_error = f"{model}: rate limited"
continue
except openai.APITimeoutError:
last_error = f"{model}: timeout"
continue
except openai.APIError as e:
last_error = f"{model}: {e.message}"
continue
raise RuntimeError(f"All models failed. Last error: {last_error}")
# Usage
chain = FallbackChain([
"claude-sonnet-4", # Primary: best quality
"gpt-4o", # Fallback 1: different provider
"gemini-2.0-flash" # Fallback 2: fast and reliable
])
result, model_used = chain.complete(
messages=[{"role": "user", "content": "Analyze this error log..."}],
max_tokens=500
)
print(f"Response from {model_used}: {result}")
Key design decisions:
- Order models by preference, not by cost. Your primary model should be the one that gives the best results for your use case.
- Include models from different providers. If one provider has an outage, another likely will not.
- Set a reasonable timeout. 30 seconds is usually enough; waiting 60 seconds before trying the fallback defeats the purpose.
Pattern 2: A/B Testing Between Models
Route 10-20% of real production traffic to a challenger model and compare quality and latency before flipping fully — this is the only safe way to validate model migrations. Before committing to a model change, run both models in parallel and compare results with real traffic. This pattern routes a percentage of traffic to a challenger model and collects quality metrics.
import random
import time
import json
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ABTestResult:
model: str
response: str
latency_ms: float
input_tokens: int
output_tokens: int
@dataclass
class ABTest:
control_model: str
challenger_model: str
challenger_pct: float = 0.1 # 10% traffic to challenger
results: dict = field(default_factory=lambda: {"control": [], "challenger": []})
def route(self) -> str:
if random.random() < self.challenger_pct:
return self.challenger_model
return self.control_model
def complete_and_record(self, messages: list, **kwargs) -> ABTestResult:
model = self.route()
variant = "challenger" if model == self.challenger_model else "control"
start = time.time()
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
latency = (time.time() - start) * 1000
result = ABTestResult(
model=model,
response=response.choices[0].message.content,
latency_ms=latency,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens
)
self.results[variant].append(result)
return result
def summary(self) -> dict:
summary = {}
for variant in ["control", "challenger"]:
results = self.results[variant]
if not results:
continue
summary[variant] = {
"model": results[0].model,
"count": len(results),
"avg_latency_ms": sum(r.latency_ms for r in results) / len(results),
"avg_output_tokens": sum(r.output_tokens for r in results) / len(results)
}
return summary
# Usage
test = ABTest(
control_model="gpt-4o",
challenger_model="claude-sonnet-4",
challenger_pct=0.2 # Send 20% to challenger
)
To make A/B testing actionable, you need quality metrics beyond just latency. Consider logging responses and running periodic evaluation with a judge model or human review.
Pattern 3: Quality Scoring and Automatic Selection
Best-of-N with a cheap judge model picks the highest-quality candidate from 3 parallel calls — costs ~3.5x per request but delivers measurable quality lift on high-value outputs. For tasks where output quality varies significantly between models, use a lightweight judge to score responses and automatically select the best one.
import asyncio
from concurrent.futures import ThreadPoolExecutor
def generate_candidates(messages: list, models: list[str], **kwargs) -> list[dict]:
"""Generate responses from multiple models in parallel."""
results = []
def call_model(model):
start = time.time()
try:
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
return {
"model": model,
"response": response.choices[0].message.content,
"latency_ms": (time.time() - start) * 1000,
"tokens": response.usage.completion_tokens
}
except Exception as e:
return {"model": model, "error": str(e)}
with ThreadPoolExecutor(max_workers=len(models)) as executor:
results = list(executor.map(call_model, models))
return [r for r in results if "error" not in r]
def score_response(original_prompt: str, response: str) -> float:
"""Score a response using a fast judge model."""
judge_response = client.chat.completions.create(
model="gemini-2.0-flash",
messages=[
{
"role": "system",
"content": (
"Rate this AI response on a scale of 1-10. Consider:\n"
"- Accuracy and correctness\n"
"- Completeness (addresses the full question)\n"
"- Clarity and conciseness\n"
"Respond with only a number."
)
},
{
"role": "user",
"content": f"Prompt: {original_prompt}\n\nResponse: {response}"
}
],
max_tokens=5,
temperature=0
)
try:
return float(judge_response.choices[0].message.content.strip())
except ValueError:
return 5.0 # Default middle score
def best_of_n(messages: list, models: list[str], **kwargs) -> dict:
"""Generate from multiple models, return the highest-scored response."""
candidates = generate_candidates(messages, models, **kwargs)
if not candidates:
raise RuntimeError("All models failed")
prompt = messages[-1]["content"] if messages else ""
for candidate in candidates:
candidate["score"] = score_response(prompt, candidate["response"])
return max(candidates, key=lambda c: c["score"])
# Usage: get the best response from three models
best = best_of_n(
messages=[{"role": "user", "content": "Write a technical summary of WebSockets."}],
models=["claude-sonnet-4", "gpt-4o", "gemini-2.0-flash"]
)
print(f"Best response from {best['model']} (score: {best['score']}):")
print(best["response"])
This pattern costs more per request (you are calling multiple models plus a judge), so use it selectively for high-value outputs where quality matters more than cost.
Pattern 4: Graceful Degradation with Circuit Breakers
Circuit breakers stop cascading failures by tripping a model out of rotation after N consecutive failures, then auto-testing recovery after a cooldown — required for any high-availability system. In production, you need more than simple fallbacks. A circuit breaker pattern prevents cascading failures by detecting when a model is consistently failing and temporarily removing it from the rotation.
import time
from dataclasses import dataclass
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Model disabled
HALF_OPEN = "half_open" # Testing recovery
@dataclass
class CircuitBreaker:
model: str
failure_threshold: int = 5 # Failures before opening
recovery_timeout: float = 60.0 # Seconds before testing recovery
state: CircuitState = CircuitState.CLOSED
failure_count: int = 0
last_failure_time: float = 0
def can_execute(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
return True
return False
return True # HALF_OPEN: allow one test request
def record_success(self):
self.failure_count = 0
self.state = CircuitState.CLOSED
def record_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
class ResilientMultiModel:
def __init__(self, models: list[str]):
self.breakers = {m: CircuitBreaker(model=m) for m in models}
def complete(self, messages: list, **kwargs) -> tuple[str, str]:
for model, breaker in self.breakers.items():
if not breaker.can_execute():
continue
try:
response = client.chat.completions.create(
model=model,
messages=messages,
**kwargs
)
breaker.record_success()
return response.choices[0].message.content, model
except (openai.RateLimitError, openai.APITimeoutError, openai.APIError):
breaker.record_failure()
continue
raise RuntimeError("All models are in circuit-open state")
def status(self) -> dict:
return {
model: {
"state": breaker.state.value,
"failures": breaker.failure_count
}
for model, breaker in self.breakers.items()
}
# Usage
system = ResilientMultiModel([
"claude-sonnet-4",
"gpt-4o",
"gemini-2.0-flash"
])
# In your request handler
result, model = system.complete(
messages=[{"role": "user", "content": "Hello!"}]
)
# Monitor circuit breaker states
print(system.status())
Which Pattern Should You Pick?
Start with Fallback Chain (highest-ROI, lowest-effort), add Circuit Breakers for high-availability systems, A/B test before any migration, reserve Quality Scoring for premium outputs.
| Pattern | When to Use | Cost Impact | Complexity |
|---|---|---|---|
| Fallback Chain | Every production system | None (only pays for successful call) | Low |
| A/B Testing | Before model migrations | Low (small traffic percentage) | Medium |
| Quality Scoring | High-value outputs | High (3x+ per request) | Medium |
| Circuit Breaker | High-availability systems | None | Medium |
Start with the Fallback Chain. It is the highest-value, lowest-effort pattern. Add Circuit Breakers when you need high availability. Use A/B Testing before any model change. Reserve Quality Scoring for your most important outputs.
Why Does a Unified API Matter Here?
All four patterns hinge on one thing: calling different models with the same code path. Without a unified API, each pattern multiplies SDK / auth / format integration cost. All four patterns above depend on one thing: being able to call different models with the same code. Without a unified API, each pattern would require managing multiple SDKs, API keys, authentication methods, and response formats. TokenMix eliminates this complexity by providing a single OpenAI-compatible endpoint for every model.
This is not just a convenience — it is an architectural enabler. Multi-model patterns become practical when switching models is a one-line change instead of a week-long integration project.
Production Checklist
Seven items every multi-model deployment must verify before going live — skip any of these and you ship a fragile system.
Before deploying a multi-model system:
- Log which model handled each request (essential for debugging)
- Monitor per-model latency and error rates separately
- Set up alerts for circuit breaker state changes
- Test fallback behavior by intentionally using an invalid model name
- Ensure your system prompts work well across all models in your pool
- Verify that response parsing handles format differences between models
- Run A/B tests before promoting any model change to 100% traffic
FAQ
Which multi-model pattern should I implement first?
Start with the Fallback Chain. It ships in under an hour, immediately protects you from single-provider outages, and gives you the operational data (latency, error rate per model) you need to build the other patterns confidently. Add monitoring next, then A/B testing or quality scoring depending on what your business needs.
How much latency overhead does a fallback chain add?
Zero on the happy path — the primary model serves the request normally. On failure, latency adds one extra round trip per fallback attempt (~1-3s depending on the next model's TTFT). Set a 5-10s timeout per attempt so a hung primary does not stack into a 30+ second user-facing wait.
Is best-of-N quality scoring worth the 3x cost?
Worth it for high-stakes outputs (legal, medical, financial summaries) where one bad response costs more than the inference. Not worth it for chat, casual content, or anything cheap to re-prompt. Math: if a bad output costs $10 in human review and best-of-3 adds $0.06, it pays back at >0.6% improvement.
When should I use a circuit breaker vs a simple retry?
Use a circuit breaker once you have 2+ providers and want to avoid hammering an outage-affected endpoint. Simple retries work for transient 429s and 500s on a single provider. Combine them: retry 2-3 times on transient errors, trip the breaker after 5-10 consecutive failures, fail over to the next provider.
How do I A/B test models without introducing user-facing bias?
Hash user IDs and assign to model groups stably — the same user always sees the same model during the test. Track per-user metrics (satisfaction, retention, task completion) instead of per-call metrics. Run for at least 7 days to capture weekday/weekend behavior differences before drawing conclusions.
Can I combine all four patterns in one application?
Yes, and most mature production systems do. Typical stack: primary model wrapped in a circuit breaker, fallback chain for outages, A/B testing on a 5-10% user slice for new model evaluation, and best-of-N quality scoring on a defined subset of high-value endpoints. Add each pattern only after the previous one is stable.
What's the cheapest way to prototype these patterns?
Use a unified OpenAI-compatible endpoint like TokenMix — all major models share one API key and SDK, so the only code change per pattern is the model parameter. No new accounts, no separate billing setup per provider. Implement the pattern once, swap models freely, and ship the rollback in one line.
Authoritative References
- OpenAI Platform · Text Generation Guide
- Anthropic Docs · Building with Claude
- Artificial Analysis · LLM Comparison