How to Save Up to 80% on AI API Costs
TokenMix Team ยท 2026-03-09

How to Save Up to 80% on AI API Costs
Most teams overspend on AI APIs not because they use too many tokens, but because they send every request to the most expensive model. After optimizing AI costs across multiple production systems, I have found that the biggest wins come from three areas: routing requests to the right model, caching intelligently, and compressing prompts. Here is a practical breakdown of each.
1. Intelligent Model Routing
The core idea is simple: not every request needs your most capable (and expensive) model. A customer asking "What are your business hours?" does not need the same reasoning power as "Analyze this contract for liability risks."
Building a Request Classifier
The most effective approach is a lightweight classifier that categorizes requests before routing them:
client = openai.OpenAI( base_url="https://api.tokenmix.ai/v1", api_key="your-tokenmix-api-key" )
def classify_request(user_message: str) -> str: """Classify request complexity using a fast, cheap model.""" response = client.chat.completions.create( model="gemini-2.0-flash", # Fast and inexpensive messages=[ { "role": "system", "content": ( "Classify this user request into exactly one category:\n" "SIMPLE - greetings, FAQs, factual lookups\n" "MODERATE - summarization, translation, basic analysis\n" "COMPLEX - multi-step reasoning, code generation, creative writing\n" "Respond with only the category name." ) }, {"role": "user", "content": user_message} ], max_tokens=10, temperature=0 ) return response.choices[0].message.content.strip()
def route_to_model(user_message: str) -> str: """Route to the most cost-effective model for the task.""" complexity = classify_request(user_message) model_map = { "SIMPLE": "gemini-2.0-flash", "MODERATE": "gpt-4o", "COMPLEX": "claude-sonnet-4" } return model_map.get(complexity, "gpt-4o") ```
In production, you will want to refine the classifier with logged examples. After a week of traffic, export misrouted requests and fine-tune the classification prompt. Some teams build a small embedding-based classifier instead, which avoids the LLM call entirely.
Cost Impact
From real deployments, roughly 40-50% of requests in a typical chatbot are SIMPLE, 30-35% are MODERATE, and only 15-25% are COMPLEX. Routing alone often cuts costs by 50-60% with no perceptible quality loss.
2. Semantic Caching
Traditional caching matches exact strings. But users ask the same question in dozens of ways: "How do I reset my password?", "I forgot my password", "Password reset help". Semantic caching matches by meaning.
Implementation Architecture
class SemanticCache: def __init__(self, client, similarity_threshold=0.92): self.client = client self.threshold = similarity_threshold self.cache = {} # In production, use Redis + vector DB
def get_embedding(self, text: str) -> list: """Generate embedding for semantic matching.""" response = self.client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding
def cosine_similarity(self, a: list, b: list) -> float: a, b = np.array(a), np.array(b) return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def lookup(self, query: str) -> Optional[str]: """Find a semantically similar cached response.""" query_embedding = self.get_embedding(query) best_match = None best_score = 0
for cached_query, (embedding, response) in self.cache.items(): score = self.cosine_similarity(query_embedding, embedding) if score > best_score: best_score = score best_match = response
if best_score >= self.threshold: return best_match return None
def store(self, query: str, response: str): embedding = self.get_embedding(query) self.cache[query] = (embedding, response) ```
A few practical notes:
- Set the similarity threshold carefully. 0.92+ works well for factual queries. For creative tasks, you may want to disable caching entirely.
- Add a TTL (time-to-live) to cached entries. Stale responses are worse than expensive fresh ones.
- Cache at the conversation level, not just individual messages. The same question with different conversation history may need different answers.
Cache Hit Rates
In customer support bots, expect 20-35% cache hit rates. In documentation Q&A systems, this can reach 40-60%. Each cache hit saves the full cost of an LLM call.
3. Prompt Compression
Long system prompts and conversation histories are the silent cost killer. Here are three techniques that work:
Trim Conversation History
Instead of sending the entire conversation, keep only the last N turns plus a summary of earlier context:
if len(conversation) <= max_turns: return messages
old_messages = conversation[:-max_turns] recent_messages = conversation[-max_turns:]
Summarize old context with a cheap model
return [ *system_msgs, {"role": "system", "content": f"Previous conversation summary: {summary}"}, *recent_messages ] ```
Optimize System Prompts
Most system prompts are 2-3x longer than they need to be. Run this exercise: take your system prompt, remove every sentence, and add each back only if output quality measurably drops. You will typically cut 40-60% of tokens.
4. Monitoring and Alerting for Cost Spikes
Cost optimization is not a one-time effort. Without monitoring, a single bad deployment or prompt change can erase weeks of savings.
What to Track
- **Cost per request** by model and endpoint. This is your primary metric.
- **Token-to-value ratio**: tokens consumed vs. task completion rate. If you are spending more tokens but not getting better results, something is wrong.
- **Cache hit rate trends**. A sudden drop means your user patterns changed or your cache is misconfigured.
- **P95 token counts per endpoint**. Outliers reveal runaway prompts or recursive calls.
Setting Up Alerts
class CostMonitor: def __init__(self, alert_callback): self.hourly_costs = defaultdict(float) self.alert_callback = alert_callback self.hourly_budget = 10.0 # USD per hour threshold
def record(self, model: str, input_tokens: int, output_tokens: int): hour_key = time.strftime("%Y-%m-%d-%H") # Estimate cost (check TokenMix pricing page for current rates) estimated_cost = (input_tokens + output_tokens) * 0.00001 self.hourly_costs[hour_key] += estimated_cost
if self.hourly_costs[hour_key] > self.hourly_budget: self.alert_callback( f"Cost spike detected: ${self.hourly_costs[hour_key]:.2f} " f"in hour {hour_key} (budget: ${self.hourly_budget})" ) ```
TokenMix provides usage analytics in the dashboard that track per-key and per-model spending. Use these as your source of truth and set up alerts on daily spending thresholds.
5. Platform-Level Savings
Beyond code-level optimizations, your choice of API provider matters:
- **Unified access eliminates overhead.** With TokenMix, you access all major models through a single API key and endpoint. No managing separate accounts, billing, and SDKs for each provider.
- **Pay-as-you-go with no minimums.** You only pay for what you use. Check the pricing page for current per-model rates.
- **Switch models without code changes.** When a cheaper model becomes good enough for your use case, switching is a one-line change.
Putting It All Together
The highest-impact optimizations in order:
1. **Model routing** (50-60% savings, moderate effort) 2. **Prompt compression** (20-30% savings, low effort) 3. **Semantic caching** (15-35% savings depending on use case, higher effort) 4. **Monitoring** (prevents regression, essential for all of the above)
Start with routing. It has the best effort-to-savings ratio, and you can implement a basic version in an afternoon. Then add monitoring before you optimize further, because you cannot improve what you cannot measure.
The real lesson from optimizing AI costs in production: the goal is not to spend less, but to spend smarter. Every dollar saved on a simple query is a dollar you can invest in using the best model where it actually matters.