TokenMix Research Lab · 2026-06-04

Groq API Access 2026: Free Tier, Rate Limits, Key Setup
Last Updated: 2026-06-04 Author: TokenMix Research Lab Data verified: 2026-06-04 - Groq quickstart, OpenAI compatibility, rate limits, pricing, supported models, Batch API, Flex Processing, error docs, and spend-limit docs
Groq API access is easy: create a GroqCloud key, set GROQ_API_KEY, and call the OpenAI-compatible endpoint. The hard part is free-tier math, because model-specific RPM, RPD, TPM, and TPD caps decide whether your workload actually fits.
Groq publishes concrete free limits by model, not one universal free bucket. As of June 4, 2026, llama-3.3-70b-versatile is listed at 30 RPM, 1K RPD, 12K TPM, and 100K TPD on the Free Plan, while openai/gpt-oss-120b is listed at 30 RPM, 1K RPD, 8K TPM, and 200K TPD (Groq rate limits). The same GPT OSS 120B model is priced at $0.15 input and $0.60 output per 1M tokens on Groq's official pricing page, with prompt cache hits listed at $0.075 input (Groq pricing). Groq also supports OpenAI-compatible calls through https://api.groq.com/openai/v1, but the compatibility is "mostly compatible," not feature-complete (OpenAI compatibility). Batch Processing gives a 50% cost discount for async jobs, while Flex Processing gives paid customers higher rate limits but can fail fast with status 498 when flex capacity is unavailable (Batch API, Flex Processing).
Table of Contents
- Quick Verdict
- Access Checklist
- API Key Setup
- Free Tier Rate Limits
- Free Tier Cost Math
- Official Pricing
- Workload Cost Projections
- Error Codes and Fixes
- Batch Flex and Caching
- Use Case Matrix
- Risks and Caveats
- Final Recommendation
- FAQ
- Sources
- Related Articles
Quick Verdict
| Claim | Status | Source |
|---|---|---|
Groq API access starts with a GroqCloud API key and GROQ_API_KEY |
Confirmed | Groq quickstart |
Groq exposes an OpenAI-compatible base URL at https://api.groq.com/openai/v1 |
Confirmed | OpenAI compatibility |
| Groq's OpenAI compatibility is complete for every OpenAI parameter | False | OpenAI compatibility lists unsupported fields |
| Free plan limits are model-specific, not a single global quota | Confirmed | Groq rate limits |
llama-3.3-70b-versatile free plan limit is 30 RPM, 1K RPD, 12K TPM, 100K TPD |
Confirmed | Groq rate limits |
openai/gpt-oss-120b free plan limit is 30 RPM, 1K RPD, 8K TPM, 200K TPD |
Confirmed | Groq rate limits |
openai/gpt-oss-120b costs $0.15 input and $0.60 output per 1M tokens |
Confirmed | Groq pricing |
| Prompt caching cuts cached input token cost for GPT OSS 120B to $0.075 per 1M | Confirmed | Groq pricing, Prompt caching |
| Batch API gives 50% lower cost than synchronous APIs | Confirmed | Batch API |
| Batch discount and prompt caching discount stack on the same token | False | Batch API, Prompt caching |
| Flex Processing is available to free-tier users | False | Flex Processing says paid customers only |
Flex Processing can fail with status 498 and capacity_exceeded |
Confirmed | Flex Processing, Error codes |
| Free plan is enough for a production app with real traffic | Likely false | Groq positions higher limits, Batch, and Flex behind Developer or paid usage |
| Groq speed numbers on the pricing page should be treated as current official product numbers, not a universal SLA | Confirmed | Groq pricing, Performance tier |
| Future free limits may tighten if GPT OSS demand rises | Speculation | No public Groq commitment found for permanent free-limit stability |
Access Checklist
The correct starting point is simple. The risky part is assuming the free tier means "free production inference."
| Requirement | What to do | Why it matters | Status |
|---|---|---|---|
| GroqCloud account | Sign in to GroqCloud and create an API key | Required for API calls | Confirmed |
| Environment variable | Set GROQ_API_KEY locally or in your secret manager |
Keeps the key out of source code | Confirmed |
| SDK path | Use groq SDK or OpenAI-compatible SDK |
Groq documents both paths | Confirmed |
| Base URL | Use https://api.groq.com/openai/v1 for OpenAI-compatible clients |
Lets existing OpenAI code migrate with minimal changes | Confirmed |
| Model ID | Start with openai/gpt-oss-120b, openai/gpt-oss-20b, llama-3.3-70b-versatile, or llama-3.1-8b-instant |
These are listed production models | Confirmed |
| Free limit check | Check the account Limits page before launch | Docs say account-specific exceptions may exist | Confirmed |
| Spend limit | Add a monthly spend limit before paid traffic | Prevents runaway bills after upgrade | Confirmed |
| Fallback route | Add a second provider if user traffic matters | 429, 498, 5xx, and model changes need graceful degradation | Likely |
If you are comparing Groq against broader free API options, the clean cluster read is Free LLM API 2026. For production routing, pair this with AI API Gateway 2026, because free-tier access without fallback is fragile.
API Key Setup
Groq's native SDK path is the fastest way to start. The OpenAI-compatible path is better if you already have a client wrapper, gateway, eval harness, or routing layer.
| Step | Command or action | Expected result | Status |
|---|---|---|---|
| 1 | Create an API key in GroqCloud | Secret token for API calls | Confirmed |
| 2 | Store it as GROQ_API_KEY |
SDK can authenticate without hardcoding | Confirmed |
| 3 | Install SDK | pip install groq |
Confirmed |
| 4 | Send a chat completion | HTTP 200 if key, model, and payload are valid | Confirmed |
| 5 | Add error handling | Catch 401, 404, 429, 498, 5xx | Confirmed |
| 6 | Add throttling | Respect retry-after and rate-limit headers |
Confirmed |
| 7 | Add spend cap after upgrade | Monthly budget guardrail | Confirmed |
Native Python SDK:
import os
from groq import Groq
client = Groq(api_key=os.environ["GROQ_API_KEY"])
completion = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[
{"role": "system", "content": "Answer in one paragraph."},
{"role": "user", "content": "Explain Groq API rate limits in plain English."},
],
)
print(completion.choices[0].message.content)
OpenAI-compatible Python client:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["GROQ_API_KEY"],
base_url="https://api.groq.com/openai/v1",
)
response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[{"role": "user", "content": "Give me a 5-item API launch checklist."}],
)
print(response.choices[0].message.content)
cURL:
curl https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"messages": [
{"role": "user", "content": "What is the difference between RPM and TPM?"}
]
}'
The practical migration note: if your app already points to OpenAI, changing the base_url may be enough for basic chat completions. It is not enough for every OpenAI feature. Groq documents unsupported fields such as logprobs, logit_bias, top_logprobs, and messages[].name, and says n must equal 1 if supplied (OpenAI compatibility).
Free Tier Rate Limits
Groq's free plan is useful, but the limit that bites first depends on your average request size.
| Model ID | Free RPM | Free RPD | Free TPM | Free TPD | Practical bottleneck | Status |
|---|---|---|---|---|---|---|
llama-3.1-8b-instant |
30 | 14.4K | 6K | 500K | TPM for bursts, TPD for high-volume chat | Confirmed |
llama-3.3-70b-versatile |
30 | 1K | 12K | 100K | TPD for medium prompts | Confirmed |
openai/gpt-oss-120b |
30 | 1K | 8K | 200K | TPM for bursts, TPD for long sessions | Confirmed |
openai/gpt-oss-20b |
30 | 1K | 8K | 200K | TPM for bursts | Confirmed |
meta-llama/llama-4-scout-17b-16e-instruct |
30 | 1K | 30K | 500K | RPD before tokens for short calls | Confirmed |
qwen/qwen3-32b |
60 | 1K | 6K | 500K | TPM before RPD for larger prompts | Confirmed |
groq/compound |
30 | 250 | 70K | Not listed | RPD | Confirmed |
whisper-large-v3-turbo |
20 | 2K | Not token-based | Not token-based | Audio seconds per hour/day | Confirmed |
Groq also publishes response headers for limits, remaining quota, reset windows, and retry-after. The important confirmed detail: x-ratelimit-limit-requests refers to RPD, while x-ratelimit-limit-tokens refers to TPM in the documented header examples (Groq rate limits). Treat headers as the runtime source of truth.
Free Tier Cost Math
Free-tier planning should start with tokens per call, not requests per day. A 1K RPD number can be irrelevant if TPD is lower.
| Scenario | Model | Average tokens per call | Published daily cap that binds | Calls/day before cap | Status |
|---|---|---|---|---|---|
| Short FAQ bot | openai/gpt-oss-120b |
500 total | 1K RPD | 1,000 calls/day | Confirmed math |
| Medium chat | openai/gpt-oss-120b |
2,000 total | 200K TPD | 100 calls/day | Confirmed math |
| Long RAG answer | openai/gpt-oss-120b |
8,000 total | 8K TPM and 200K TPD | 1 call/minute burst, 25 calls/day | Confirmed math |
| Fast extraction | llama-3.1-8b-instant |
500 total | 500K TPD or 14.4K RPD | 1,000 calls/day by tokens, 14.4K by RPD | Confirmed math |
| Llama 3.3 medium chat | llama-3.3-70b-versatile |
1,000 total | 100K TPD | 100 calls/day | Confirmed math |
| Qwen short coding helper | qwen/qwen3-32b |
500 total | 1K RPD | 1,000 calls/day | Confirmed math |
Three concrete takeaways:
openai/gpt-oss-120bat 2,000 total tokens per call is not a 1,000-call/day free backend. It is about 100 calls/day before 200K TPD.llama-3.3-70b-versatilehas a stronger 12K TPM burst window than GPT OSS 120B, but its 100K TPD means the day can end faster for medium chat.qwen/qwen3-32bhas 60 RPM but only 6K TPM, so twelve 500-token calls can exhaust one minute of token capacity even though request capacity looks larger.
This is the same hidden math developers miss when they compare free providers. A free model with impressive RPM can still lose to a lower-RPM provider if the token/day cap is tight. The broader provider comparison sits in Cheapest AI API Providers 2026.
Official Pricing
Groq pricing is pay-per-token for listed LLMs, with separate units for speech and built-in tools. The table below uses only official published prices visible on Groq pricing or supported-model pages as of this verification date.
| Model or tool | Input price | Cached input | Output price | Speed listed by Groq | Context | Status |
|---|---|---|---|---|---|---|
openai/gpt-oss-20b |
$0.075 / 1M | $0.0375 / 1M | $0.30 / 1M | 1,000 TPS | 131,072 | Confirmed |
openai/gpt-oss-120b |
$0.15 / 1M | $0.075 / 1M | $0.60 / 1M | 500 TPS | 131,072 | Confirmed |
llama-3.1-8b-instant |
$0.05 / 1M | Not listed in pricing cache table | $0.08 / 1M | 840 TPS on pricing page, 560 T/sec on models page | 131,072 | Confirmed, with page mismatch |
llama-3.3-70b-versatile |
$0.59 / 1M | Not listed in pricing cache table | $0.79 / 1M | 394 TPS on pricing page, 280 T/sec on models page | 131,072 | Confirmed, with page mismatch |
meta-llama/llama-4-scout-17b-16e-instruct |
$0.11 / 1M | Not listed in pricing cache table | $0.34 / 1M | 594 TPS on pricing page, 750 T/sec on models page | 131,072 | Confirmed, with page mismatch |
qwen/qwen3-32b |
$0.29 / 1M | Not listed in pricing cache table | $0.59 / 1M | 662 TPS on pricing page, 400 T/sec on models page | 131,072 | Confirmed, with page mismatch |
whisper-large-v3 |
Audio | N/A | $0.111 / hour | 217x speed factor | Audio | Confirmed |
whisper-large-v3-turbo |
Audio | N/A | $0.04 / hour | 228x speed factor | Audio | Confirmed |
| Built-in web search basic | N/A | N/A | $5 / 1,000 requests | N/A | Tool call | Confirmed |
| Built-in code execution | N/A | N/A | $0.18 / hour | N/A | Tool runtime | Confirmed |
The speed mismatch between Groq's pricing and model pages is not a contradiction worth over-reading. It is a reminder to label speed as page-reported product data, not as a guaranteed SLA. Groq's enterprise Performance Tier separately describes a 99.9% availability SLA and latency guarantee aligned to enterprise agreements (Performance Tier).
Workload Cost Projections
The paid math is straightforward. Multiply input tokens by input price and output tokens by output price. The tricky part is choosing the model that fits the job.
| Monthly workload | Model | Input tokens | Output tokens | Monthly cost | Batch eligible cost | Status |
|---|---|---|---|---|---|---|
| Small support bot | openai/gpt-oss-20b |
10M | 2M | $1.35 | $0.675 | Confirmed math |
| Small support bot | openai/gpt-oss-120b |
10M | 2M | $2.70 | $1.35 | Confirmed math |
| RAG assistant | openai/gpt-oss-120b |
100M | 20M | $27.00 | $13.50 | Confirmed math |
| RAG assistant with 50% cached input | openai/gpt-oss-120b |
100M | 20M | $19.50 | Not stackable with batch | Confirmed math |
| Lightweight extraction | llama-3.1-8b-instant |
100M | 10M | $5.80 | $2.90 | Confirmed math |
| Qwen coding helper | qwen/qwen3-32b |
50M | 25M | $29.25 | $14.625 | Confirmed math |
| Llama 3.3 long-form chat | llama-3.3-70b-versatile |
50M | 25M | $49.25 | $24.625 | Confirmed math |
| 1,000 hours transcription | whisper-large-v3-turbo |
Audio | Audio | $40.00 | $20.00 if batch-supported | Confirmed math |
Cost example 1: 50 support tickets/day, 30 days, 2K input and 500 output tokens each equals 3M input and 0.75M output tokens/month. On GPT OSS 20B, that is 3M x $0.075 + 0.75M x $0.30 = $0.45/month. The free tier may cover it if daily TPD and burst caps fit.
Cost example 2: 2,000 RAG calls/day, 30 days, 1,500 input and 500 output tokens each equals 90M input and 30M output tokens/month. On GPT OSS 120B, that is 90M x $0.15 + 30M x $0.60 = $31.50/month. Batch would cut eligible async work to $15.75, but live chat needs synchronous routing.
Cost example 3: 100 developer-agent runs/day, 30 days, 20K input and 4K output tokens each equals 60M input and 12M output tokens/month. On Llama 3.3 70B, that is 60M x $0.59 + 12M x $0.79 = $44.88/month. On GPT OSS 120B, the same token shape is $16.20/month. Whether quality is good enough is task-specific; the price gap is confirmed.
For multi-provider fallback and budget caps, compare this with OpenRouter Alternatives 2026. The cost math changes once a gateway adds markup, routing fees, or provider-specific billing rules.
Error Codes and Fixes
The most common Groq setup failure is not model quality. It is a small auth, payload, model, rate-limit, or tier mistake.
| HTTP code or surface error | What it means | First fix | Production fix | Status |
|---|---|---|---|---|
| 400 Bad Request | Invalid syntax or unsupported parameter | Remove unsupported fields, validate JSON | Provider-specific schema guard | Confirmed |
| 401 Unauthorized | Missing or invalid authentication | Check Authorization: Bearer $GROQ_API_KEY |
Secret manager, key rotation, startup health check | Confirmed |
| 403 Forbidden | Permission restriction | Check project/org permissions | Use model permissions and least-privilege keys | Confirmed |
| 404 Not Found | Wrong URL or resource/model not found | Check endpoint and model ID | Pull active model list from /openai/v1/models |
Confirmed |
| 413 Request Entity Too Large | Request body too large | Reduce prompt, file, or request size | Chunking and context budget gate | Confirmed |
| 422 Unprocessable Entity | Semantic failure or model hallucination class error | Retry if safe, validate inputs | Typed retry policy by task | Confirmed |
| 429 Too Many Requests | Too many requests in timeframe | Throttle and respect retry-after |
Token-aware queue with per-model buckets | Confirmed |
| 498 Flex Tier Capacity Exceeded | Flex capacity unavailable | Retry later or fall back to on-demand | Jittered backoff and non-flex fallback | Confirmed |
| 500/502/503/504 | Server-side failure class | Retry later | Circuit breaker and fallback provider | Confirmed |
A minimal 429 handler should inspect headers, not blindly sleep:
import random
import time
def groq_retry_sleep(response_headers, attempt):
retry_after = response_headers.get("retry-after")
if retry_after:
return float(retry_after)
base = min(2 ** attempt, 30)
jitter = random.uniform(0, 0.5 * base)
return base + jitter
def should_retry(status_code):
return status_code in {429, 498, 500, 502, 503, 504}
That function is deliberately conservative. For a real launch, track RPM, RPD, TPM, and TPD per model. A request can be legal by RPM and still illegal by TPM.
Batch Flex and Caching
Groq now has three different cost/capacity levers that developers often mix up.
| Lever | What it does | Discount or capacity effect | Best for | Caveat | Status |
|---|---|---|---|---|---|
| Prompt caching | Reuses matching prompt prefixes | 50% discount on cached input tokens for eligible models | Repeated system prompts, tool definitions, RAG templates | Exact prefix shape matters; cache hit not guaranteed | Confirmed |
| Batch API | Async batch jobs | 50% lower cost than synchronous APIs | Evaluations, extraction, offline generation, transcription | 24-hour to 7-day processing window; no real-time response | Confirmed |
| Flex Processing | Paid service tier | 10x higher rate limits than on-demand while capacity exists | High-throughput workloads that can retry | Paid only; 498 capacity_exceeded possible |
Confirmed |
| Performance Tier | Enterprise provisioned capacity | SLA and latency guarantee by agreement | Production-critical paths | Enterprise only and provisioned capacity pricing | Confirmed |
| Spend limits | Monthly budget cap | Blocks calls after cap is reached | Cost control | 10-15 minute tracking delay can allow small overrun | Confirmed |
The discount rule matters: Batch and prompt caching do not stack. If a 100M input-token GPT OSS 120B batch is eligible for Batch, the input cost is 100M x $0.15 x 50% = $7.50. It is not 100M x $0.075 x 50% = $3.75. Groq says batch tokens are billed at the 50% batch rate regardless of cache status (Batch API, Prompt caching).
Use Case Matrix
| Use case | Start with | Why | Avoid | Status |
|---|---|---|---|---|
| First API test | openai/gpt-oss-120b on Free Plan |
Strong flagship open-weight model, official quickstart path | Assuming free tier supports real launch volume | Confirmed |
| Cheapest paid text extraction | llama-3.1-8b-instant |
$0.05/$0.08 per 1M and fast | Complex reasoning tasks | Confirmed |
| Cheap reasoning experiment | openai/gpt-oss-20b |
$0.075/$0.30 with 128K context | High-quality production reasoning without eval | Likely |
| Stronger open-weight assistant | openai/gpt-oss-120b |
$0.15/$0.60, 128K context, GPT OSS tool capabilities in Groq docs | Treating speed page as SLA | Confirmed |
| Long context RAG | GPT OSS 120B or Llama 3.3 70B | 131,072 context listed | Free-tier TPD exhaustion | Confirmed |
| Short coding helper | qwen/qwen3-32b |
Low output price and 60 RPM free limit | Large bursts over 6K TPM | Confirmed |
| Offline evaluation | Batch API | 50% lower cost, separate from standard limits | User-facing chat | Confirmed |
| Spiky high-throughput jobs | Flex Processing | 10x higher paid limits while capacity exists | Workloads that cannot handle 498 |
Confirmed |
| Enterprise SLA path | Performance Tier | 99.9% availability SLA described for enterprise agreements | Free/developer assumptions | Confirmed |
Risks and Caveats
| Risk | What can go wrong | Mitigation | Status |
|---|---|---|---|
| Free-tier overconfidence | TPD binds before RPD | Calculate tokens per call before launch | Confirmed |
| OpenAI compatibility mismatch | Unsupported fields cause 400 errors | Strip provider-specific unsupported fields | Confirmed |
| Header misread | Request limit header is RPD, token limit header is TPM in docs | Build per-header parsing correctly | Confirmed |
| Flex retry storm | 498 capacity_exceeded causes aggressive retries |
Jittered backoff and on-demand fallback | Confirmed |
| Batch expiration | Long batch window expires under load | Split batches, use longer windows, resubmit failed items | Confirmed |
| Spend limit lag | Tracking delay can exceed monthly cap slightly | Lower cap buffer and monitor first week | Confirmed |
| Model changes | Preview models can be discontinued at short notice | Prefer production models for customer traffic | Confirmed |
| Speed-page drift | Pricing and model pages can show different speed numbers | Do not cite speed as SLA unless enterprise agreement covers it | Likely |
| Future free-limit changes | Popular models may tighten quotas | Monitor docs and keep fallback provider | Speculation |
| Provider concentration | Single-provider outage breaks app | Route through multi-provider gateway | Likely |
Final Recommendation
Use Groq free access for prototypes, evals, and fast short calls. For production, start paid with spend limits, token-aware throttling, Batch for async jobs, and fallback routing. The best default test path is GPT OSS 120B; the best cheap paid path is workload-dependent.
FAQ
How do I get Groq API access?
Create a GroqCloud account, generate an API key, and set it as GROQ_API_KEY. Then use either the Groq SDK or an OpenAI-compatible client pointed at https://api.groq.com/openai/v1.
Is Groq API free in 2026?
Yes, Groq has a Free Plan, but it is rate limited by model. Treat it as a prototype and evaluation lane, not a guaranteed production backend.
What are Groq free tier rate limits?
They vary by model. For example, Groq lists llama-3.3-70b-versatile at 30 RPM, 1K RPD, 12K TPM, and 100K TPD, while openai/gpt-oss-120b is 30 RPM, 1K RPD, 8K TPM, and 200K TPD.
Which Groq model should I start with?
Start with openai/gpt-oss-120b if you want a strong general test model. Start with llama-3.1-8b-instant if the task is simple extraction and cost matters more than reasoning depth.
Is Groq fully OpenAI-compatible?
No. Groq is mostly OpenAI-compatible, but not every OpenAI field is supported. Groq documents unsupported fields such as logprobs, logit_bias, top_logprobs, and messages[].name.
How do I fix Groq 429 errors?
Throttle requests, respect retry-after, and track both request and token buckets. A 429 can be caused by RPM, RPD, TPM, or TPD depending on model and traffic shape.
Does Groq Batch API reduce cost?
Yes. Groq says Batch Processing has 50% lower cost than synchronous APIs. It is best for offline jobs because the processing window is 24 hours to 7 days.
Can I use Groq for production?
Yes, but not casually on the Free Plan. Use production models, paid limits, spend caps, monitoring, retries, and fallback routing before sending real user traffic.
Sources
- Groq Quickstart - official API key and SDK setup path
- Groq OpenAI Compatibility - official base URL and unsupported OpenAI fields
- Groq Rate Limits - official RPM, RPD, TPM, TPD, headers, and 429 behavior
- Groq Pricing - official model pricing, prompt caching prices, speech pricing, and tool prices
- Groq Supported Models - official production and preview model table
- Groq Batch API - official async batch API and 50% cost discount
- Groq Flex Processing - official paid Flex tier behavior and
498 capacity_exceeded - Groq Error Codes - official HTTP status code reference
- Groq Prompt Caching - official cache behavior and non-stacking batch rule
- Groq Spend Limits - official monthly spend cap behavior and tracking delay
- Groq Performance Tier - official enterprise SLA and provisioned-capacity framing
Related Articles
- Groq API Pricing 2026: Free Tier, 315 TPS, $0.05/M Paid Models
- Free LLM API 2026: 15 Limits, No-Card Picks, Real Costs
- Cheapest AI API Providers 2026: Every Provider Ranked by $/M
- 8 OpenRouter Alternatives 2026: Free or Below-Market Pricing
- AI API Gateway 2026: Routing, Fallbacks, Observability, and Cost Control