TokenMix Research Lab · 2026-06-08

Groq API Access 2026: Free Tier, Rate Limits, Key Setup

Last Updated: 2026-06-08 Author: TokenMix Research Lab Data verified: 2026-06-04 - Groq quickstart, OpenAI compatibility, rate limits, pricing, supported models, Batch API, Flex Processing, error docs, and spend-limit docs

Groq API access is easy: create a GroqCloud key, set GROQ_API_KEY, and call the OpenAI-compatible endpoint. The hard part is free-tier math, because model-specific RPM, RPD, TPM, and TPD caps decide whether your workload actually fits.

Groq publishes concrete free limits by model, not one universal free bucket. As of June 4, 2026, llama-3.3-70b-versatile is listed at 30 RPM, 1K RPD, 12K TPM, and 100K TPD on the Free Plan, while openai/gpt-oss-120b is listed at 30 RPM, 1K RPD, 8K TPM, and 200K TPD (Groq rate limits). The same GPT OSS 120B model is priced at $0.15 input and $0.60 output per 1M tokens on Groq's official pricing page, with prompt cache hits listed at $0.075 input (Groq pricing). Groq also supports OpenAI-compatible calls through https://api.groq.com/openai/v1, but the compatibility is "mostly compatible," not feature-complete (OpenAI compatibility). Batch Processing gives a 50% cost discount for async jobs, while Flex Processing gives paid customers higher rate limits but can fail fast with status 498 when flex capacity is unavailable (Batch API, Flex Processing).

Quick Verdict
Access Checklist
API Key Setup
Free Tier Rate Limits
Free Tier Cost Math
Official Pricing
Workload Cost Projections
Error Codes and Fixes
Batch Flex and Caching
Use Case Matrix
Risks and Caveats
Final Recommendation
FAQ
Sources
Related Articles

Quick Verdict

Claim	Status	Source
Groq API access starts with a GroqCloud API key and `GROQ_API_KEY`	Confirmed	Groq quickstart
Groq exposes an OpenAI-compatible base URL at `https://api.groq.com/openai/v1`	Confirmed	OpenAI compatibility
Groq's OpenAI compatibility is complete for every OpenAI parameter	False	OpenAI compatibility lists unsupported fields
Free plan limits are model-specific, not a single global quota	Confirmed	Groq rate limits
`llama-3.3-70b-versatile` free plan limit is 30 RPM, 1K RPD, 12K TPM, 100K TPD	Confirmed	Groq rate limits
`openai/gpt-oss-120b` free plan limit is 30 RPM, 1K RPD, 8K TPM, 200K TPD	Confirmed	Groq rate limits
`openai/gpt-oss-120b` costs $0.15 input and $0.60 output per 1M tokens	Confirmed	Groq pricing
Prompt caching cuts cached input token cost for GPT OSS 120B to $0.075 per 1M	Confirmed	Groq pricing, Prompt caching
Batch API gives 50% lower cost than synchronous APIs	Confirmed	Batch API
Batch discount and prompt caching discount stack on the same token	False	Batch API, Prompt caching
Flex Processing is available to free-tier users	False	Flex Processing says paid customers only
Flex Processing can fail with status `498` and `capacity_exceeded`	Confirmed	Flex Processing, Error codes
Free plan is enough for a production app with real traffic	Likely false	Groq positions higher limits, Batch, and Flex behind Developer or paid usage
Groq speed numbers on the pricing page should be treated as current official product numbers, not a universal SLA	Confirmed	Groq pricing, Performance tier
Future free limits may tighten if GPT OSS demand rises	Speculation	No public Groq commitment found for permanent free-limit stability

Access Checklist

The correct starting point is simple. The risky part is assuming the free tier means "free production inference."

Requirement	What to do	Why it matters	Status
GroqCloud account	Sign in to GroqCloud and create an API key	Required for API calls	Confirmed
Environment variable	Set `GROQ_API_KEY` locally or in your secret manager	Keeps the key out of source code	Confirmed
SDK path	Use `groq` SDK or OpenAI-compatible SDK	Groq documents both paths	Confirmed
Base URL	Use `https://api.groq.com/openai/v1` for OpenAI-compatible clients	Lets existing OpenAI code migrate with minimal changes	Confirmed
Model ID	Start with `openai/gpt-oss-120b`, `openai/gpt-oss-20b`, `llama-3.3-70b-versatile`, or `llama-3.1-8b-instant`	These are listed production models	Confirmed
Free limit check	Check the account Limits page before launch	Docs say account-specific exceptions may exist	Confirmed
Spend limit	Add a monthly spend limit before paid traffic	Prevents runaway bills after upgrade	Confirmed
Fallback route	Add a second provider if user traffic matters	429, 498, 5xx, and model changes need graceful degradation	Likely

If you are comparing Groq against broader free API options, the clean cluster read is Free LLM API 2026. For production routing, pair this with AI API Gateway 2026, because free-tier access without fallback is fragile.

API Key Setup

Groq's native SDK path is the fastest way to start. The OpenAI-compatible path is better if you already have a client wrapper, gateway, eval harness, or routing layer.

Step	Command or action	Expected result	Status
1	Create an API key in GroqCloud	Secret token for API calls	Confirmed
2	Store it as `GROQ_API_KEY`	SDK can authenticate without hardcoding	Confirmed
3	Install SDK	`pip install groq`	Confirmed
4	Send a chat completion	HTTP 200 if key, model, and payload are valid	Confirmed
5	Add error handling	Catch 401, 404, 429, 498, 5xx	Confirmed
6	Add throttling	Respect `retry-after` and rate-limit headers	Confirmed
7	Add spend cap after upgrade	Monthly budget guardrail	Confirmed

Native Python SDK:

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

completion = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "system", "content": "Answer in one paragraph."},
        {"role": "user", "content": "Explain Groq API rate limits in plain English."},
    ],
)

print(completion.choices[0].message.content)

OpenAI-compatible Python client:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1",
)

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Give me a 5-item API launch checklist."}],
)

print(response.choices[0].message.content)

cURL:

curl https://api.groq.com/openai/v1/chat/completions \
  -H "Authorization: Bearer $GROQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "What is the difference between RPM and TPM?"}
    ]
  }'

The practical migration note: if your app already points to OpenAI, changing the base_url may be enough for basic chat completions. It is not enough for every OpenAI feature. Groq documents unsupported fields such as logprobs, logit_bias, top_logprobs, and messages[].name, and says n must equal 1 if supplied (OpenAI compatibility).

Free Tier Rate Limits

Groq's free plan is useful, but the limit that bites first depends on your average request size.

Model ID	Free RPM	Free RPD	Free TPM	Free TPD	Practical bottleneck	Status
`llama-3.1-8b-instant`	30	14.4K	6K	500K	TPM for bursts, TPD for high-volume chat	Confirmed
`llama-3.3-70b-versatile`	30	1K	12K	100K	TPD for medium prompts	Confirmed
`openai/gpt-oss-120b`	30	1K	8K	200K	TPM for bursts, TPD for long sessions	Confirmed
`openai/gpt-oss-20b`	30	1K	8K	200K	TPM for bursts	Confirmed
`meta-llama/llama-4-scout-17b-16e-instruct`	30	1K	30K	500K	RPD before tokens for short calls	Confirmed
`qwen/qwen3-32b`	60	1K	6K	500K	TPM before RPD for larger prompts	Confirmed
`groq/compound`	30	250	70K	Not listed	RPD	Confirmed
`whisper-large-v3-turbo`	20	2K	Not token-based	Not token-based	Audio seconds per hour/day	Confirmed

Groq also publishes response headers for limits, remaining quota, reset windows, and retry-after. The important confirmed detail: x-ratelimit-limit-requests refers to RPD, while x-ratelimit-limit-tokens refers to TPM in the documented header examples (Groq rate limits). Treat headers as the runtime source of truth.

Free Tier Cost Math

Free-tier planning should start with tokens per call, not requests per day. A 1K RPD number can be irrelevant if TPD is lower.

Scenario	Model	Average tokens per call	Published daily cap that binds	Calls/day before cap	Status
Short FAQ bot	`openai/gpt-oss-120b`	500 total	1K RPD	1,000 calls/day	Confirmed math
Medium chat	`openai/gpt-oss-120b`	2,000 total	200K TPD	100 calls/day	Confirmed math
Long RAG answer	`openai/gpt-oss-120b`	8,000 total	8K TPM and 200K TPD	1 call/minute burst, 25 calls/day	Confirmed math
Fast extraction	`llama-3.1-8b-instant`	500 total	500K TPD or 14.4K RPD	1,000 calls/day by tokens, 14.4K by RPD	Confirmed math
Llama 3.3 medium chat	`llama-3.3-70b-versatile`	1,000 total	100K TPD	100 calls/day	Confirmed math
Qwen short coding helper	`qwen/qwen3-32b`	500 total	1K RPD	1,000 calls/day	Confirmed math

Three concrete takeaways:

openai/gpt-oss-120b at 2,000 total tokens per call is not a 1,000-call/day free backend. It is about 100 calls/day before 200K TPD.
llama-3.3-70b-versatile has a stronger 12K TPM burst window than GPT OSS 120B, but its 100K TPD means the day can end faster for medium chat.
qwen/qwen3-32b has 60 RPM but only 6K TPM, so twelve 500-token calls can exhaust one minute of token capacity even though request capacity looks larger.

This is the same hidden math developers miss when they compare free providers. A free model with impressive RPM can still lose to a lower-RPM provider if the token/day cap is tight. The broader provider comparison sits in Cheapest AI API Providers 2026.

Official Pricing

Groq pricing is pay-per-token for listed LLMs, with separate units for speech and built-in tools. The table below uses only official published prices visible on Groq pricing or supported-model pages as of this verification date.

Model or tool	Input price	Cached input	Output price	Speed listed by Groq	Context	Status
`openai/gpt-oss-20b`	$0.075 / 1M	$0.0375 / 1M	$0.30 / 1M	1,000 TPS	131,072	Confirmed
`openai/gpt-oss-120b`	$0.15 / 1M	$0.075 / 1M	$0.60 / 1M	500 TPS	131,072	Confirmed
`llama-3.1-8b-instant`	$0.05 / 1M	Not listed in pricing cache table	$0.08 / 1M	840 TPS on pricing page, 560 T/sec on models page	131,072	Confirmed, with page mismatch
`llama-3.3-70b-versatile`	$0.59 / 1M	Not listed in pricing cache table	$0.79 / 1M	394 TPS on pricing page, 280 T/sec on models page	131,072	Confirmed, with page mismatch
`meta-llama/llama-4-scout-17b-16e-instruct`	$0.11 / 1M	Not listed in pricing cache table	$0.34 / 1M	594 TPS on pricing page, 750 T/sec on models page	131,072	Confirmed, with page mismatch
`qwen/qwen3-32b`	$0.29 / 1M	Not listed in pricing cache table	$0.59 / 1M	662 TPS on pricing page, 400 T/sec on models page	131,072	Confirmed, with page mismatch
`whisper-large-v3`	Audio	N/A	$0.111 / hour	217x speed factor	Audio	Confirmed
`whisper-large-v3-turbo`	Audio	N/A	$0.04 / hour	228x speed factor	Audio	Confirmed
Built-in web search basic	N/A	N/A	$5 / 1,000 requests	N/A	Tool call	Confirmed
Built-in code execution	N/A	N/A	$0.18 / hour	N/A	Tool runtime	Confirmed

The speed mismatch between Groq's pricing and model pages is not a contradiction worth over-reading. It is a reminder to label speed as page-reported product data, not as a guaranteed SLA. Groq's enterprise Performance Tier separately describes a 99.9% availability SLA and latency guarantee aligned to enterprise agreements (Performance Tier).

Workload Cost Projections

The paid math is straightforward. Multiply input tokens by input price and output tokens by output price. The tricky part is choosing the model that fits the job.

Monthly workload	Model	Input tokens	Output tokens	Monthly cost	Batch eligible cost	Status
Small support bot	`openai/gpt-oss-20b`	10M	2M	$1.35	$0.675	Confirmed math
Small support bot	`openai/gpt-oss-120b`	10M	2M	$2.70	$1.35	Confirmed math
RAG assistant	`openai/gpt-oss-120b`	100M	20M	$27.00	$13.50	Confirmed math
RAG assistant with 50% cached input	`openai/gpt-oss-120b`	100M	20M	$19.50	Not stackable with batch	Confirmed math
Lightweight extraction	`llama-3.1-8b-instant`	100M	10M	$5.80	$2.90	Confirmed math
Qwen coding helper	`qwen/qwen3-32b`	50M	25M	$29.25	$14.625	Confirmed math
Llama 3.3 long-form chat	`llama-3.3-70b-versatile`	50M	25M	$49.25	$24.625	Confirmed math
1,000 hours transcription	`whisper-large-v3-turbo`	Audio	Audio	$40.00	$20.00 if batch-supported	Confirmed math

Cost example 1: 50 support tickets/day, 30 days, 2K input and 500 output tokens each equals 3M input and 0.75M output tokens/month. On GPT OSS 20B, that is 3M x $0.075 + 0.75M x $0.30 = $0.45/month. The free tier may cover it if daily TPD and burst caps fit.

Cost example 2: 2,000 RAG calls/day, 30 days, 1,500 input and 500 output tokens each equals 90M input and 30M output tokens/month. On GPT OSS 120B, that is 90M x $0.15 + 30M x $0.60 = $31.50/month. Batch would cut eligible async work to $15.75, but live chat needs synchronous routing.

Cost example 3: 100 developer-agent runs/day, 30 days, 20K input and 4K output tokens each equals 60M input and 12M output tokens/month. On Llama 3.3 70B, that is 60M x $0.59 + 12M x $0.79 = $44.88/month. On GPT OSS 120B, the same token shape is $16.20/month. Whether quality is good enough is task-specific; the price gap is confirmed.

For multi-provider fallback and budget caps, compare this with OpenRouter Alternatives 2026. The cost math changes once a gateway adds markup, routing fees, or provider-specific billing rules.

Error Codes and Fixes

The most common Groq setup failure is not model quality. It is a small auth, payload, model, rate-limit, or tier mistake.

HTTP code or surface error	What it means	First fix	Production fix	Status
400 Bad Request	Invalid syntax or unsupported parameter	Remove unsupported fields, validate JSON	Provider-specific schema guard	Confirmed
401 Unauthorized	Missing or invalid authentication	Check `Authorization: Bearer $GROQ_API_KEY`	Secret manager, key rotation, startup health check	Confirmed
403 Forbidden	Permission restriction	Check project/org permissions	Use model permissions and least-privilege keys	Confirmed
404 Not Found	Wrong URL or resource/model not found	Check endpoint and model ID	Pull active model list from `/openai/v1/models`	Confirmed
413 Request Entity Too Large	Request body too large	Reduce prompt, file, or request size	Chunking and context budget gate	Confirmed
422 Unprocessable Entity	Semantic failure or model hallucination class error	Retry if safe, validate inputs	Typed retry policy by task	Confirmed
429 Too Many Requests	Too many requests in timeframe	Throttle and respect `retry-after`	Token-aware queue with per-model buckets	Confirmed
498 Flex Tier Capacity Exceeded	Flex capacity unavailable	Retry later or fall back to on-demand	Jittered backoff and non-flex fallback	Confirmed
500/502/503/504	Server-side failure class	Retry later	Circuit breaker and fallback provider	Confirmed

A minimal 429 handler should inspect headers, not blindly sleep:

import random
import time

def groq_retry_sleep(response_headers, attempt):
    retry_after = response_headers.get("retry-after")
    if retry_after:
        return float(retry_after)

    base = min(2 ** attempt, 30)
    jitter = random.uniform(0, 0.5 * base)
    return base + jitter

def should_retry(status_code):
    return status_code in {429, 498, 500, 502, 503, 504}

That function is deliberately conservative. For a real launch, track RPM, RPD, TPM, and TPD per model. A request can be legal by RPM and still illegal by TPM.

Batch Flex and Caching

Groq now has three different cost/capacity levers that developers often mix up.

Lever	What it does	Discount or capacity effect	Best for	Caveat	Status
Prompt caching	Reuses matching prompt prefixes	50% discount on cached input tokens for eligible models	Repeated system prompts, tool definitions, RAG templates	Exact prefix shape matters; cache hit not guaranteed	Confirmed
Batch API	Async batch jobs	50% lower cost than synchronous APIs	Evaluations, extraction, offline generation, transcription	24-hour to 7-day processing window; no real-time response	Confirmed
Flex Processing	Paid service tier	10x higher rate limits than on-demand while capacity exists	High-throughput workloads that can retry	Paid only; `498 capacity_exceeded` possible	Confirmed
Performance Tier	Enterprise provisioned capacity	SLA and latency guarantee by agreement	Production-critical paths	Enterprise only and provisioned capacity pricing	Confirmed
Spend limits	Monthly budget cap	Blocks calls after cap is reached	Cost control	10-15 minute tracking delay can allow small overrun	Confirmed

The discount rule matters: Batch and prompt caching do not stack. If a 100M input-token GPT OSS 120B batch is eligible for Batch, the input cost is 100M x $0.15 x 50% = $7.50. It is not 100M x $0.075 x 50% = $3.75. Groq says batch tokens are billed at the 50% batch rate regardless of cache status (Batch API, Prompt caching).

Use Case Matrix

Use case	Start with	Why	Avoid	Status
First API test	`openai/gpt-oss-120b` on Free Plan	Strong flagship open-weight model, official quickstart path	Assuming free tier supports real launch volume	Confirmed
Cheapest paid text extraction	`llama-3.1-8b-instant`	$0.05/$0.08 per 1M and fast	Complex reasoning tasks	Confirmed
Cheap reasoning experiment	`openai/gpt-oss-20b`	$0.075/$0.30 with 128K context	High-quality production reasoning without eval	Likely
Stronger open-weight assistant	`openai/gpt-oss-120b`	$0.15/$0.60, 128K context, GPT OSS tool capabilities in Groq docs	Treating speed page as SLA	Confirmed
Long context RAG	GPT OSS 120B or Llama 3.3 70B	131,072 context listed	Free-tier TPD exhaustion	Confirmed
Short coding helper	`qwen/qwen3-32b`	Low output price and 60 RPM free limit	Large bursts over 6K TPM	Confirmed
Offline evaluation	Batch API	50% lower cost, separate from standard limits	User-facing chat	Confirmed
Spiky high-throughput jobs	Flex Processing	10x higher paid limits while capacity exists	Workloads that cannot handle `498`	Confirmed
Enterprise SLA path	Performance Tier	99.9% availability SLA described for enterprise agreements	Free/developer assumptions	Confirmed

Risks and Caveats

Risk	What can go wrong	Mitigation	Status
Free-tier overconfidence	TPD binds before RPD	Calculate tokens per call before launch	Confirmed
OpenAI compatibility mismatch	Unsupported fields cause 400 errors	Strip provider-specific unsupported fields	Confirmed
Header misread	Request limit header is RPD, token limit header is TPM in docs	Build per-header parsing correctly	Confirmed
Flex retry storm	`498 capacity_exceeded` causes aggressive retries	Jittered backoff and on-demand fallback	Confirmed
Batch expiration	Long batch window expires under load	Split batches, use longer windows, resubmit failed items	Confirmed
Spend limit lag	Tracking delay can exceed monthly cap slightly	Lower cap buffer and monitor first week	Confirmed
Model changes	Preview models can be discontinued at short notice	Prefer production models for customer traffic	Confirmed
Speed-page drift	Pricing and model pages can show different speed numbers	Do not cite speed as SLA unless enterprise agreement covers it	Likely
Future free-limit changes	Popular models may tighten quotas	Monitor docs and keep fallback provider	Speculation
Provider concentration	Single-provider outage breaks app	Route through multi-provider gateway	Likely

Final Recommendation

Use Groq free access for prototypes, evals, and fast short calls. For production, start paid with spend limits, token-aware throttling, Batch for async jobs, and fallback routing. The best default test path is GPT OSS 120B; the best cheap paid path is workload-dependent.

FAQ

How do I get Groq API access?

Create a GroqCloud account, generate an API key, and set it as GROQ_API_KEY. Then use either the Groq SDK or an OpenAI-compatible client pointed at https://api.groq.com/openai/v1.

Is Groq API free in 2026?

Yes, Groq has a Free Plan, but it is rate limited by model. Treat it as a prototype and evaluation lane, not a guaranteed production backend.

What are Groq free tier rate limits?

They vary by model. For example, Groq lists llama-3.3-70b-versatile at 30 RPM, 1K RPD, 12K TPM, and 100K TPD, while openai/gpt-oss-120b is 30 RPM, 1K RPD, 8K TPM, and 200K TPD.

Which Groq model should I start with?

Start with openai/gpt-oss-120b if you want a strong general test model. Start with llama-3.1-8b-instant if the task is simple extraction and cost matters more than reasoning depth.

Is Groq fully OpenAI-compatible?

No. Groq is mostly OpenAI-compatible, but not every OpenAI field is supported. Groq documents unsupported fields such as logprobs, logit_bias, top_logprobs, and messages[].name.

How do I fix Groq 429 errors?

Throttle requests, respect retry-after, and track both request and token buckets. A 429 can be caused by RPM, RPD, TPM, or TPD depending on model and traffic shape.

Does Groq Batch API reduce cost?

Yes. Groq says Batch Processing has 50% lower cost than synchronous APIs. It is best for offline jobs because the processing window is 24 hours to 7 days.

Can I use Groq for production?

Yes, but not casually on the Free Plan. Use production models, paid limits, spend caps, monitoring, retries, and fallback routing before sending real user traffic.

Sources

Groq Quickstart - official API key and SDK setup path
Groq OpenAI Compatibility - official base URL and unsupported OpenAI fields
Groq Rate Limits - official RPM, RPD, TPM, TPD, headers, and 429 behavior
Groq Pricing - official model pricing, prompt caching prices, speech pricing, and tool prices
Groq Supported Models - official production and preview model table
Groq Batch API - official async batch API and 50% cost discount
Groq Flex Processing - official paid Flex tier behavior and 498 capacity_exceeded
Groq Error Codes - official HTTP status code reference
Groq Prompt Caching - official cache behavior and non-stacking batch rule
Groq Spend Limits - official monthly spend cap behavior and tracking delay
Groq Performance Tier - official enterprise SLA and provisioned-capacity framing

2026 Traffic Cluster Update

New or refreshed page	Status	Why it matters
Groq AI Learning 2026	Confirmed	Deeper LPU, Compound, Batch, and Flex cost guide.
Free AI API No Limit 2026	Confirmed	Places Groq free access inside broader free API limits.
Node.js AI API 2026	Confirmed	OpenAI-compatible Node streaming pattern.
AI API Gateway 2026	Confirmed	Fallback and routing architecture after Groq limits.
AI Frameworks 2026	Confirmed	Framework choices for Groq-backed agents.
Internal links guarantee ranking gains	False	Links improve crawl paths, but rankings still depend on query fit, competition, freshness, and engagement.
These additions should improve discovery of the new cluster	Likely	The updated pages now expose fresh crawl paths from existing topic hubs.
Exact traffic lift date	Speculation	No search console data exists yet for pages published on 2026-06-08.