TokenMix Research Lab · 2026-06-04

Groq API Access 2026: Free Tier, Rate Limits, Key Setup

Groq API Access 2026: Free Tier, Rate Limits, Key Setup

Last Updated: 2026-06-04 Author: TokenMix Research Lab Data verified: 2026-06-04 - Groq quickstart, OpenAI compatibility, rate limits, pricing, supported models, Batch API, Flex Processing, error docs, and spend-limit docs

Groq API access is easy: create a GroqCloud key, set GROQ_API_KEY, and call the OpenAI-compatible endpoint. The hard part is free-tier math, because model-specific RPM, RPD, TPM, and TPD caps decide whether your workload actually fits.

Groq publishes concrete free limits by model, not one universal free bucket. As of June 4, 2026, llama-3.3-70b-versatile is listed at 30 RPM, 1K RPD, 12K TPM, and 100K TPD on the Free Plan, while openai/gpt-oss-120b is listed at 30 RPM, 1K RPD, 8K TPM, and 200K TPD (Groq rate limits). The same GPT OSS 120B model is priced at $0.15 input and $0.60 output per 1M tokens on Groq's official pricing page, with prompt cache hits listed at $0.075 input (Groq pricing). Groq also supports OpenAI-compatible calls through https://api.groq.com/openai/v1, but the compatibility is "mostly compatible," not feature-complete (OpenAI compatibility). Batch Processing gives a 50% cost discount for async jobs, while Flex Processing gives paid customers higher rate limits but can fail fast with status 498 when flex capacity is unavailable (Batch API, Flex Processing).

Table of Contents

Quick Verdict

Claim Status Source
Groq API access starts with a GroqCloud API key and GROQ_API_KEY Confirmed Groq quickstart
Groq exposes an OpenAI-compatible base URL at https://api.groq.com/openai/v1 Confirmed OpenAI compatibility
Groq's OpenAI compatibility is complete for every OpenAI parameter False OpenAI compatibility lists unsupported fields
Free plan limits are model-specific, not a single global quota Confirmed Groq rate limits
llama-3.3-70b-versatile free plan limit is 30 RPM, 1K RPD, 12K TPM, 100K TPD Confirmed Groq rate limits
openai/gpt-oss-120b free plan limit is 30 RPM, 1K RPD, 8K TPM, 200K TPD Confirmed Groq rate limits
openai/gpt-oss-120b costs $0.15 input and $0.60 output per 1M tokens Confirmed Groq pricing
Prompt caching cuts cached input token cost for GPT OSS 120B to $0.075 per 1M Confirmed Groq pricing, Prompt caching
Batch API gives 50% lower cost than synchronous APIs Confirmed Batch API
Batch discount and prompt caching discount stack on the same token False Batch API, Prompt caching
Flex Processing is available to free-tier users False Flex Processing says paid customers only
Flex Processing can fail with status 498 and capacity_exceeded Confirmed Flex Processing, Error codes
Free plan is enough for a production app with real traffic Likely false Groq positions higher limits, Batch, and Flex behind Developer or paid usage
Groq speed numbers on the pricing page should be treated as current official product numbers, not a universal SLA Confirmed Groq pricing, Performance tier
Future free limits may tighten if GPT OSS demand rises Speculation No public Groq commitment found for permanent free-limit stability

Access Checklist

The correct starting point is simple. The risky part is assuming the free tier means "free production inference."

Requirement What to do Why it matters Status
GroqCloud account Sign in to GroqCloud and create an API key Required for API calls Confirmed
Environment variable Set GROQ_API_KEY locally or in your secret manager Keeps the key out of source code Confirmed
SDK path Use groq SDK or OpenAI-compatible SDK Groq documents both paths Confirmed
Base URL Use https://api.groq.com/openai/v1 for OpenAI-compatible clients Lets existing OpenAI code migrate with minimal changes Confirmed
Model ID Start with openai/gpt-oss-120b, openai/gpt-oss-20b, llama-3.3-70b-versatile, or llama-3.1-8b-instant These are listed production models Confirmed
Free limit check Check the account Limits page before launch Docs say account-specific exceptions may exist Confirmed
Spend limit Add a monthly spend limit before paid traffic Prevents runaway bills after upgrade Confirmed
Fallback route Add a second provider if user traffic matters 429, 498, 5xx, and model changes need graceful degradation Likely

If you are comparing Groq against broader free API options, the clean cluster read is Free LLM API 2026. For production routing, pair this with AI API Gateway 2026, because free-tier access without fallback is fragile.

API Key Setup

Groq's native SDK path is the fastest way to start. The OpenAI-compatible path is better if you already have a client wrapper, gateway, eval harness, or routing layer.

Step Command or action Expected result Status
1 Create an API key in GroqCloud Secret token for API calls Confirmed
2 Store it as GROQ_API_KEY SDK can authenticate without hardcoding Confirmed
3 Install SDK pip install groq Confirmed
4 Send a chat completion HTTP 200 if key, model, and payload are valid Confirmed
5 Add error handling Catch 401, 404, 429, 498, 5xx Confirmed
6 Add throttling Respect retry-after and rate-limit headers Confirmed
7 Add spend cap after upgrade Monthly budget guardrail Confirmed

Native Python SDK:

import os
from groq import Groq

client = Groq(api_key=os.environ["GROQ_API_KEY"])

completion = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "system", "content": "Answer in one paragraph."},
        {"role": "user", "content": "Explain Groq API rate limits in plain English."},
    ],
)

print(completion.choices[0].message.content)

OpenAI-compatible Python client:

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["GROQ_API_KEY"],
    base_url="https://api.groq.com/openai/v1",
)

response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[{"role": "user", "content": "Give me a 5-item API launch checklist."}],
)

print(response.choices[0].message.content)

cURL:

curl https://api.groq.com/openai/v1/chat/completions \
  -H "Authorization: Bearer $GROQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "What is the difference between RPM and TPM?"}
    ]
  }'

The practical migration note: if your app already points to OpenAI, changing the base_url may be enough for basic chat completions. It is not enough for every OpenAI feature. Groq documents unsupported fields such as logprobs, logit_bias, top_logprobs, and messages[].name, and says n must equal 1 if supplied (OpenAI compatibility).

Free Tier Rate Limits

Groq's free plan is useful, but the limit that bites first depends on your average request size.

Model ID Free RPM Free RPD Free TPM Free TPD Practical bottleneck Status
llama-3.1-8b-instant 30 14.4K 6K 500K TPM for bursts, TPD for high-volume chat Confirmed
llama-3.3-70b-versatile 30 1K 12K 100K TPD for medium prompts Confirmed
openai/gpt-oss-120b 30 1K 8K 200K TPM for bursts, TPD for long sessions Confirmed
openai/gpt-oss-20b 30 1K 8K 200K TPM for bursts Confirmed
meta-llama/llama-4-scout-17b-16e-instruct 30 1K 30K 500K RPD before tokens for short calls Confirmed
qwen/qwen3-32b 60 1K 6K 500K TPM before RPD for larger prompts Confirmed
groq/compound 30 250 70K Not listed RPD Confirmed
whisper-large-v3-turbo 20 2K Not token-based Not token-based Audio seconds per hour/day Confirmed

Groq also publishes response headers for limits, remaining quota, reset windows, and retry-after. The important confirmed detail: x-ratelimit-limit-requests refers to RPD, while x-ratelimit-limit-tokens refers to TPM in the documented header examples (Groq rate limits). Treat headers as the runtime source of truth.

Free Tier Cost Math

Free-tier planning should start with tokens per call, not requests per day. A 1K RPD number can be irrelevant if TPD is lower.

Scenario Model Average tokens per call Published daily cap that binds Calls/day before cap Status
Short FAQ bot openai/gpt-oss-120b 500 total 1K RPD 1,000 calls/day Confirmed math
Medium chat openai/gpt-oss-120b 2,000 total 200K TPD 100 calls/day Confirmed math
Long RAG answer openai/gpt-oss-120b 8,000 total 8K TPM and 200K TPD 1 call/minute burst, 25 calls/day Confirmed math
Fast extraction llama-3.1-8b-instant 500 total 500K TPD or 14.4K RPD 1,000 calls/day by tokens, 14.4K by RPD Confirmed math
Llama 3.3 medium chat llama-3.3-70b-versatile 1,000 total 100K TPD 100 calls/day Confirmed math
Qwen short coding helper qwen/qwen3-32b 500 total 1K RPD 1,000 calls/day Confirmed math

Three concrete takeaways:

  1. openai/gpt-oss-120b at 2,000 total tokens per call is not a 1,000-call/day free backend. It is about 100 calls/day before 200K TPD.
  2. llama-3.3-70b-versatile has a stronger 12K TPM burst window than GPT OSS 120B, but its 100K TPD means the day can end faster for medium chat.
  3. qwen/qwen3-32b has 60 RPM but only 6K TPM, so twelve 500-token calls can exhaust one minute of token capacity even though request capacity looks larger.

This is the same hidden math developers miss when they compare free providers. A free model with impressive RPM can still lose to a lower-RPM provider if the token/day cap is tight. The broader provider comparison sits in Cheapest AI API Providers 2026.

Official Pricing

Groq pricing is pay-per-token for listed LLMs, with separate units for speech and built-in tools. The table below uses only official published prices visible on Groq pricing or supported-model pages as of this verification date.

Model or tool Input price Cached input Output price Speed listed by Groq Context Status
openai/gpt-oss-20b $0.075 / 1M $0.0375 / 1M $0.30 / 1M 1,000 TPS 131,072 Confirmed
openai/gpt-oss-120b $0.15 / 1M $0.075 / 1M $0.60 / 1M 500 TPS 131,072 Confirmed
llama-3.1-8b-instant $0.05 / 1M Not listed in pricing cache table $0.08 / 1M 840 TPS on pricing page, 560 T/sec on models page 131,072 Confirmed, with page mismatch
llama-3.3-70b-versatile $0.59 / 1M Not listed in pricing cache table $0.79 / 1M 394 TPS on pricing page, 280 T/sec on models page 131,072 Confirmed, with page mismatch
meta-llama/llama-4-scout-17b-16e-instruct $0.11 / 1M Not listed in pricing cache table $0.34 / 1M 594 TPS on pricing page, 750 T/sec on models page 131,072 Confirmed, with page mismatch
qwen/qwen3-32b $0.29 / 1M Not listed in pricing cache table $0.59 / 1M 662 TPS on pricing page, 400 T/sec on models page 131,072 Confirmed, with page mismatch
whisper-large-v3 Audio N/A $0.111 / hour 217x speed factor Audio Confirmed
whisper-large-v3-turbo Audio N/A $0.04 / hour 228x speed factor Audio Confirmed
Built-in web search basic N/A N/A $5 / 1,000 requests N/A Tool call Confirmed
Built-in code execution N/A N/A $0.18 / hour N/A Tool runtime Confirmed

The speed mismatch between Groq's pricing and model pages is not a contradiction worth over-reading. It is a reminder to label speed as page-reported product data, not as a guaranteed SLA. Groq's enterprise Performance Tier separately describes a 99.9% availability SLA and latency guarantee aligned to enterprise agreements (Performance Tier).

Workload Cost Projections

The paid math is straightforward. Multiply input tokens by input price and output tokens by output price. The tricky part is choosing the model that fits the job.

Monthly workload Model Input tokens Output tokens Monthly cost Batch eligible cost Status
Small support bot openai/gpt-oss-20b 10M 2M $1.35 $0.675 Confirmed math
Small support bot openai/gpt-oss-120b 10M 2M $2.70 $1.35 Confirmed math
RAG assistant openai/gpt-oss-120b 100M 20M $27.00 $13.50 Confirmed math
RAG assistant with 50% cached input openai/gpt-oss-120b 100M 20M $19.50 Not stackable with batch Confirmed math
Lightweight extraction llama-3.1-8b-instant 100M 10M $5.80 $2.90 Confirmed math
Qwen coding helper qwen/qwen3-32b 50M 25M $29.25 $14.625 Confirmed math
Llama 3.3 long-form chat llama-3.3-70b-versatile 50M 25M $49.25 $24.625 Confirmed math
1,000 hours transcription whisper-large-v3-turbo Audio Audio $40.00 $20.00 if batch-supported Confirmed math

Cost example 1: 50 support tickets/day, 30 days, 2K input and 500 output tokens each equals 3M input and 0.75M output tokens/month. On GPT OSS 20B, that is 3M x $0.075 + 0.75M x $0.30 = $0.45/month. The free tier may cover it if daily TPD and burst caps fit.

Cost example 2: 2,000 RAG calls/day, 30 days, 1,500 input and 500 output tokens each equals 90M input and 30M output tokens/month. On GPT OSS 120B, that is 90M x $0.15 + 30M x $0.60 = $31.50/month. Batch would cut eligible async work to $15.75, but live chat needs synchronous routing.

Cost example 3: 100 developer-agent runs/day, 30 days, 20K input and 4K output tokens each equals 60M input and 12M output tokens/month. On Llama 3.3 70B, that is 60M x $0.59 + 12M x $0.79 = $44.88/month. On GPT OSS 120B, the same token shape is $16.20/month. Whether quality is good enough is task-specific; the price gap is confirmed.

For multi-provider fallback and budget caps, compare this with OpenRouter Alternatives 2026. The cost math changes once a gateway adds markup, routing fees, or provider-specific billing rules.

Error Codes and Fixes

The most common Groq setup failure is not model quality. It is a small auth, payload, model, rate-limit, or tier mistake.

HTTP code or surface error What it means First fix Production fix Status
400 Bad Request Invalid syntax or unsupported parameter Remove unsupported fields, validate JSON Provider-specific schema guard Confirmed
401 Unauthorized Missing or invalid authentication Check Authorization: Bearer $GROQ_API_KEY Secret manager, key rotation, startup health check Confirmed
403 Forbidden Permission restriction Check project/org permissions Use model permissions and least-privilege keys Confirmed
404 Not Found Wrong URL or resource/model not found Check endpoint and model ID Pull active model list from /openai/v1/models Confirmed
413 Request Entity Too Large Request body too large Reduce prompt, file, or request size Chunking and context budget gate Confirmed
422 Unprocessable Entity Semantic failure or model hallucination class error Retry if safe, validate inputs Typed retry policy by task Confirmed
429 Too Many Requests Too many requests in timeframe Throttle and respect retry-after Token-aware queue with per-model buckets Confirmed
498 Flex Tier Capacity Exceeded Flex capacity unavailable Retry later or fall back to on-demand Jittered backoff and non-flex fallback Confirmed
500/502/503/504 Server-side failure class Retry later Circuit breaker and fallback provider Confirmed

A minimal 429 handler should inspect headers, not blindly sleep:

import random
import time

def groq_retry_sleep(response_headers, attempt):
    retry_after = response_headers.get("retry-after")
    if retry_after:
        return float(retry_after)

    base = min(2 ** attempt, 30)
    jitter = random.uniform(0, 0.5 * base)
    return base + jitter

def should_retry(status_code):
    return status_code in {429, 498, 500, 502, 503, 504}

That function is deliberately conservative. For a real launch, track RPM, RPD, TPM, and TPD per model. A request can be legal by RPM and still illegal by TPM.

Batch Flex and Caching

Groq now has three different cost/capacity levers that developers often mix up.

Lever What it does Discount or capacity effect Best for Caveat Status
Prompt caching Reuses matching prompt prefixes 50% discount on cached input tokens for eligible models Repeated system prompts, tool definitions, RAG templates Exact prefix shape matters; cache hit not guaranteed Confirmed
Batch API Async batch jobs 50% lower cost than synchronous APIs Evaluations, extraction, offline generation, transcription 24-hour to 7-day processing window; no real-time response Confirmed
Flex Processing Paid service tier 10x higher rate limits than on-demand while capacity exists High-throughput workloads that can retry Paid only; 498 capacity_exceeded possible Confirmed
Performance Tier Enterprise provisioned capacity SLA and latency guarantee by agreement Production-critical paths Enterprise only and provisioned capacity pricing Confirmed
Spend limits Monthly budget cap Blocks calls after cap is reached Cost control 10-15 minute tracking delay can allow small overrun Confirmed

The discount rule matters: Batch and prompt caching do not stack. If a 100M input-token GPT OSS 120B batch is eligible for Batch, the input cost is 100M x $0.15 x 50% = $7.50. It is not 100M x $0.075 x 50% = $3.75. Groq says batch tokens are billed at the 50% batch rate regardless of cache status (Batch API, Prompt caching).

Use Case Matrix

Use case Start with Why Avoid Status
First API test openai/gpt-oss-120b on Free Plan Strong flagship open-weight model, official quickstart path Assuming free tier supports real launch volume Confirmed
Cheapest paid text extraction llama-3.1-8b-instant $0.05/$0.08 per 1M and fast Complex reasoning tasks Confirmed
Cheap reasoning experiment openai/gpt-oss-20b $0.075/$0.30 with 128K context High-quality production reasoning without eval Likely
Stronger open-weight assistant openai/gpt-oss-120b $0.15/$0.60, 128K context, GPT OSS tool capabilities in Groq docs Treating speed page as SLA Confirmed
Long context RAG GPT OSS 120B or Llama 3.3 70B 131,072 context listed Free-tier TPD exhaustion Confirmed
Short coding helper qwen/qwen3-32b Low output price and 60 RPM free limit Large bursts over 6K TPM Confirmed
Offline evaluation Batch API 50% lower cost, separate from standard limits User-facing chat Confirmed
Spiky high-throughput jobs Flex Processing 10x higher paid limits while capacity exists Workloads that cannot handle 498 Confirmed
Enterprise SLA path Performance Tier 99.9% availability SLA described for enterprise agreements Free/developer assumptions Confirmed

Risks and Caveats

Risk What can go wrong Mitigation Status
Free-tier overconfidence TPD binds before RPD Calculate tokens per call before launch Confirmed
OpenAI compatibility mismatch Unsupported fields cause 400 errors Strip provider-specific unsupported fields Confirmed
Header misread Request limit header is RPD, token limit header is TPM in docs Build per-header parsing correctly Confirmed
Flex retry storm 498 capacity_exceeded causes aggressive retries Jittered backoff and on-demand fallback Confirmed
Batch expiration Long batch window expires under load Split batches, use longer windows, resubmit failed items Confirmed
Spend limit lag Tracking delay can exceed monthly cap slightly Lower cap buffer and monitor first week Confirmed
Model changes Preview models can be discontinued at short notice Prefer production models for customer traffic Confirmed
Speed-page drift Pricing and model pages can show different speed numbers Do not cite speed as SLA unless enterprise agreement covers it Likely
Future free-limit changes Popular models may tighten quotas Monitor docs and keep fallback provider Speculation
Provider concentration Single-provider outage breaks app Route through multi-provider gateway Likely

Final Recommendation

Use Groq free access for prototypes, evals, and fast short calls. For production, start paid with spend limits, token-aware throttling, Batch for async jobs, and fallback routing. The best default test path is GPT OSS 120B; the best cheap paid path is workload-dependent.

FAQ

How do I get Groq API access?

Create a GroqCloud account, generate an API key, and set it as GROQ_API_KEY. Then use either the Groq SDK or an OpenAI-compatible client pointed at https://api.groq.com/openai/v1.

Is Groq API free in 2026?

Yes, Groq has a Free Plan, but it is rate limited by model. Treat it as a prototype and evaluation lane, not a guaranteed production backend.

What are Groq free tier rate limits?

They vary by model. For example, Groq lists llama-3.3-70b-versatile at 30 RPM, 1K RPD, 12K TPM, and 100K TPD, while openai/gpt-oss-120b is 30 RPM, 1K RPD, 8K TPM, and 200K TPD.

Which Groq model should I start with?

Start with openai/gpt-oss-120b if you want a strong general test model. Start with llama-3.1-8b-instant if the task is simple extraction and cost matters more than reasoning depth.

Is Groq fully OpenAI-compatible?

No. Groq is mostly OpenAI-compatible, but not every OpenAI field is supported. Groq documents unsupported fields such as logprobs, logit_bias, top_logprobs, and messages[].name.

How do I fix Groq 429 errors?

Throttle requests, respect retry-after, and track both request and token buckets. A 429 can be caused by RPM, RPD, TPM, or TPD depending on model and traffic shape.

Does Groq Batch API reduce cost?

Yes. Groq says Batch Processing has 50% lower cost than synchronous APIs. It is best for offline jobs because the processing window is 24 hours to 7 days.

Can I use Groq for production?

Yes, but not casually on the Free Plan. Use production models, paid limits, spend caps, monitoring, retries, and fallback routing before sending real user traffic.

Sources

Related Articles