TokenMix Research Lab · 2026-04-24
Cerebras API Key: Access, Pricing, Speed Tests 2026
Cerebras is the fastest LLM inference provider in 2026 — delivering 1,800+ tokens/second on Llama 3.3 70B thanks to their wafer-scale CS-3 chip architecture. Compared to typical GPU inference at 50-200 tok/s, Cerebras is 10-20× faster. This guide covers how to get a Cerebras API key (5-minute signup), pricing ($0.60-$3.90 per MTok depending on model size), supported models (Llama 3.3 70B, 405B, Qwen3, DeepSeek variants), speed benchmarks, and when it's worth paying more for the speed premium. Plus Cerebras vs Groq comparison. TokenMix.ai routes Cerebras alongside 300+ other models.
Table of Contents
- Confirmed vs Speculation
- Cerebras Signup (5 Minutes)
- Pricing by Model
- Speed Benchmarks
- Cerebras vs Groq
- When Speed Is Worth The Cost
- FAQ
Confirmed vs Speculation
| Claim | Status |
|---|---|
| Cerebras produces 1800+ tok/s on 70B | Confirmed |
| CS-3 wafer-scale chip architecture | Confirmed |
| Llama 3.3 70B / 405B available | Confirmed |
| OpenAI-compatible API | Yes |
| Free tier available | Yes (rate-limited) |
| Faster than Groq | Usually yes for same model |
| Price premium vs standard GPU inference | Yes ~2-3× |
Snapshot note (2026-04-24): Throughput numbers (1800 tok/s on Llama 3.3 70B, etc.) are Cerebras-reported plus community reproductions. Actual throughput varies with prompt length, concurrency, and Cerebras cluster state. Pricing is current per cerebras.ai pricing page — Cerebras has revised tier pricing twice in 2026 as they scaled capacity. Competitor comparison numbers (Groq 550 tok/s, Together 150 tok/s) are midpoint observations and drift as each provider upgrades infrastructure.
Cerebras Signup (5 Minutes)
- Go to cloud.cerebras.ai
- Sign up with email
- Add billing (free tier starts, card required for production)
- Navigate to API Keys → create new
- Copy key (shown once)
First API call:
curl https://api.cerebras.ai/v1/chat/completions \
-H "Authorization: Bearer $CEREBRAS_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b",
"messages": [{"role":"user","content":"Hello"}]
}'
Response returns in under 1 second thanks to high throughput.
Pricing by Model
| Model | Input $/MTok | Output $/MTok | Throughput |
|---|---|---|---|
| llama-3.3-70b | $0.60 | $0.60 | 1800 tok/s |
| llama-3.3-70b-instruct | $0.60 | $0.60 | 1800 tok/s |
| llama-3.1-405b | $3.00 | $3.90 | 969 tok/s |
| llama-3.1-8b | $0.10 | $0.10 | 2200 tok/s |
| qwen-3-32b | $0.50 | $0.50 | 1200 tok/s |
| deepseek-r1-distill-70b | $0.60 | $2.40 | 1500 tok/s |
Free tier: limited requests/day on smaller models. Check dashboard for current limits.
Speed Benchmarks
Measured April 2026, Llama 3.3 70B on each provider:
| Provider | Throughput (tok/s) | TTFT | Hardware |
|---|---|---|---|
| Cerebras | 1,800+ | <100ms | CS-3 wafer |
| Groq | 550 | <100ms | LPU |
| Together.ai | 150 | 200ms | H100 |
| Fireworks | 180 | 150ms | H100 |
| OpenAI (GPT-5.4) | ~80 | 300ms | H100 |
| Anthropic (Opus 4.7) | ~60 | 400ms | varies |
Cerebras is 3× faster than Groq, 10× faster than H100-based inference.
For use cases where throughput matters (real-time voice, multi-agent loops with many turns, batch processing), speed compounds.
Cerebras vs Groq
| Dimension | Cerebras | Groq |
|---|---|---|
| Speed (70B model) | 1800 tok/s | 550 tok/s |
| Pricing (70B) | $0.60/$0.60 | $0.59/$0.79 |
| Supported models | Llama, Qwen, DeepSeek distills | Llama, Mixtral, Qwen |
| API maturity | Production | Production |
| Free tier | Yes | Yes (generous) |
| Custom silicon | CS-3 wafer | LPU |
| First-token latency | <100ms | <100ms |
Pick Cerebras if: absolute speed is critical. Pick Groq if: slightly larger model catalog, comparable-ish speed, generous free tier. Both strong choices for speed-sensitive workloads.
When Speed Is Worth The Cost
Cerebras is more expensive than GPU-based inference (2-3× vs DeepSeek hosted). Worth it for:
- Real-time voice agents — sub-300ms full loop needs fast inference
- Multi-agent workflows — compound latency across 5+ LLM calls
- Interactive RAG — user expects fast response despite retrieval overhead
- Live coding assistance — inline completion speed directly affects UX
- High-concurrency batch — process 1000 queries faster per dollar due to throughput
Not worth for:
- Async batch jobs (latency doesn't matter)
- Cost-first products (DeepSeek or Qwen hosted much cheaper)
- Specialty models (Claude, GPT) not available on Cerebras
FAQ
Can Cerebras run Claude or GPT-4?
No. Cerebras hosts only open-weight models (Llama family, Qwen, DeepSeek distills, some Mistral). For Claude/GPT, use Anthropic/OpenAI direct or aggregator.
Is Cerebras's speed consistent under load?
Generally yes — wafer-scale architecture handles concurrency well. Peak hour slight degradation observed but throughput still 1500+ tok/s. More stable than some GPU-based providers.
What's the free tier limit?
Generous but rate-limited: ~1M free tokens/day on Llama 3.3 70B. Enough to prototype and test. For production, paid tier.
Does Cerebras support function calling / tool use?
Yes on Llama 3.3 70B and newer models. Standard OpenAI tool schema via OpenAI-compatible API.
How does speed feel subjectively in a chat app?
Instant. On a 500-token response, Cerebras finishes in ~0.3 seconds vs 5-10 seconds on GPU-based inference. Users can't read as fast as Cerebras generates.
Is 405B faster on Cerebras than 70B on others?
Yes — 405B at 969 tok/s beats most competitors' 70B speeds. Use Cerebras's 405B as the "quality + speed" combo when budget allows ($3/$3.90 per MTok).
Any Cerebras downsides beyond cost?
Smaller model catalog vs Anthropic/OpenAI. If you need Claude Opus 4.7 or GPT-5.4 specifically, Cerebras doesn't help. For open-weight routing, Cerebras is optimal.
Sources
- Cerebras Inference
- Cerebras Cloud
- Groq API Pricing — TokenMix
- Groq vs OpenAI Speed — TokenMix
- AI API Latency Benchmark — TokenMix
By TokenMix Research Lab · Updated 2026-04-24