TokenMix Research Lab · 2026-04-12

Groq API Tutorial: Getting Started With the Fastest AI Inference in 2026
Last Updated: 2026-04-29
Author: TokenMix Research Lab
Groq custom LPU hardware delivers Llama 3.3 70B at 200-350 tok/sec — 3-10x faster than GPU providers' 30-80 tok/sec. 500-token response = 1-2 sec on Groq vs 6-16 sec on GPU. Free tier no credit card: 30 RPM + 6K TPM + 500K tokens/day = sufficient for development + light production. OpenAI-compatible API (use openai SDK + base_url="https://api.groq.com/openai/v1"). Trade-off: open models only (Llama/Mistral/Gemma) — no GPT/Claude/Gemini.
Groq is the fastest AI inference provider available today. Its custom LPU (Language Processing Unit) hardware delivers 200-500 tokens per second on Llama 3.3 70B -- 3-10x faster than any GPU-based provider. The free tier requires no credit card, and you can make your first API call in under five minutes. This Groq tutorial covers everything: account setup, model selection, Python and Node.js examples, streaming, rate limit handling, and when to upgrade from the free tier to the Developer plan. Speed and pricing data verified by TokenMix.ai in April 2026.
Table of Contents
- Quick Reference: Groq API Models and Pricing
- Why Groq Is Different: LPU vs GPU
- Getting Started: Free Account Setup
- Your First Groq API Call in Python
- Your First Groq API Call in Node.js
- Model Selection Guide
- Streaming: Real-Time Token Delivery
- Rate Limit Handling and Free Tier Optimization
- Tool Calling and JSON Mode
- Using Groq Through TokenMix.ai
- When to Upgrade From Free to Developer Tier
- Groq vs Other Providers: Speed and Cost
- Should You Choose Groq?
- What's the Bottom Line on the Groq API?
- FAQ
Quick Reference: Groq API Models and Pricing
6 models on Groq. Llama 3.3 70B: $0.59/$0.79, 200-350 tok/s, 128K context (default recommendation). Llama 4 Scout: $0.11/$0.34, 300-500 tok/s, 512K context (largest). Llama 4 Maverick: $0.50/$0.77, 256K context. Mistral Saba: $0.20/$0.60, multilingual. DeepSeek R1 Distill: $0.75/$0.99, reasoning. Gemma 2 9B: $0.20/$0.20 (fastest 400-600 tok/s, cheapest, 8K context). All free tier 6K TPM (Gemma 15K).
| Model | Model ID | Input $/M | Output $/M | Speed (tok/s) | Context | Free Tier |
|---|---|---|---|---|---|---|
| Llama 3.3 70B | llama-3.3-70b-versatile | $0.59 | $0.79 | 200-350 | 128K | 6K TPM |
| Llama 4 Scout | meta-llama/llama-4-scout-17b-16e-instruct | $0.11 | $0.34 | 300-500 | 512K | 6K TPM |
| Llama 4 Maverick | meta-llama/llama-4-maverick-17b-128e-instruct | $0.50 | $0.77 | 150-250 | 256K | 6K TPM |
| Mistral Saba | mistral-saba-24b | $0.20 | $0.60 | 250-400 | 32K | 6K TPM |
| DeepSeek R1 Distill | deepseek-r1-distill-llama-70b | $0.75 | $0.99 | 150-250 | 128K | 6K TPM |
| Gemma 2 9B | gemma2-9b-it | $0.20 | $0.20 | 400-600 | 8K | 6K TPM |
Why Groq Is Different: LPU vs GPU
Custom LPU hardware purpose-built for sequential token generation vs general-purpose GPUs adapted for AI. Llama 3.3 70B speed: Groq 200-350 tok/s vs GPU 30-80 tok/s = 3-10x faster. 500-token response: 1-2 sec on Groq vs 6-16 sec on GPU. Streaming feels instant — tokens appear faster than users can read. Trade-off: open models only (Llama/Mistral/Gemma). For applications where speed matters more than model-specific capabilities (GPT-4.1, Claude), the trade is worth it.
Every other AI inference provider runs models on NVIDIA GPUs. Groq built its own hardware -- the LPU (Language Processing Unit) -- designed specifically for sequential token generation.
The result: Groq serves Llama 3.3 70B at 200-350 tokens/second output speed. The same model on cloud GPU infrastructure typically runs at 30-80 tokens/second. That is a 3-10x speed advantage.
What this means in practice:
- A 500-token response takes 1-2 seconds on Groq vs 6-16 seconds on GPU providers
- Streaming feels instant -- tokens appear faster than most users can read
- Development iteration is faster -- you spend less time waiting for responses during testing
The trade-off: Groq currently runs a limited set of open models (primarily Llama, Mistral, and Gemma). You cannot use GPT-4.1, Claude, or Gemini on Groq. For applications where speed matters more than model-specific capabilities, this trade-off is worth it.
TokenMix.ai latency benchmarks confirm Groq consistently delivers the lowest time-to-first-token and fastest throughput of any API provider.
Getting Started: Free Account Setup
Fastest signup in industry. Three steps under 5 min: (1) console.groq.com sign-up via Google/GitHub/email (no credit card, no waitlist, no approval). (2) Generate API key in dashboard (starts with gsk_). (3) Test with curl on https://api.groq.com/openai/v1/chat/completions. Free tier: 30 RPM + 6K TPM + 500K tokens/day = enough for development, prototyping, light production. All models accessible.
Groq's signup process is the fastest in the industry. No credit card, no waitlist, no approval process.
Step 1: Create an Account
Go to console.groq.com. Click "Sign Up." You can register with Google, GitHub, or email. The process takes under 60 seconds.
Step 2: Get Your API Key
After login, you land on the "API Keys" page. Click "Create API Key." Give it a name (e.g., "my-project"). Copy the key. It starts with gsk_.
Step 3: Test With curl
curl -X POST "https://api.groq.com/openai/v1/chat/completions" \
-H "Authorization: Bearer YOUR_GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b-versatile",
"messages": [{"role": "user", "content": "Hello, what can you do?"}]
}'
Note the endpoint: https://api.groq.com/openai/v1. Groq implements the OpenAI API format, so the path includes /openai/v1.
What You Get on the Free Tier
- No credit card required
- Access to all models
- Rate limits: 6,000 tokens per minute (TPM), 30 requests per minute (RPM) per model
- Daily token limits: 500K tokens/day for most models
- Sufficient for development, prototyping, and light production
Your First Groq API Call in Python
Use standard openai package — no separate Groq SDK. pip install openai + OpenAI(api_key=..., base_url="https://api.groq.com/openai/v1") + model="llama-3.3-70b-versatile". Verify speed: typical 200-token response in 0.7-1.0 sec, generating 200-300 tok/s. Use os.getenv("GROQ_API_KEY") for portable code. Speed feels qualitatively different from cloud GPU providers — ideal for development iteration, less waiting during testing.
Installation
pip install openai
Groq is OpenAI-compatible. You use the standard openai SDK.
Basic Chat Completion
from openai import OpenAI
client = OpenAI(
api_key="YOUR_GROQ_API_KEY",
base_url="https://api.groq.com/openai/v1"
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You are a concise assistant. Answer in 2-3 sentences."},
{"role": "user", "content": "What is a REST API?"}
]
)
print(response.choices[0].message.content)
# Response arrives in <1 second for short outputs
Check Response Speed
import time
start = time.time()
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Write a 200-word explanation of machine learning."}]
)
elapsed = time.time() - start
tokens = response.usage.completion_tokens
speed = tokens / elapsed
print(f"Generated {tokens} tokens in {elapsed:.2f}s ({speed:.0f} tok/s)")
# Typical output: Generated 250 tokens in 0.9s (278 tok/s)
Using Environment Variables
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("GROQ_API_KEY"),
base_url="https://api.groq.com/openai/v1"
)
export GROQ_API_KEY="gsk-your-key-here"
Your First Groq API Call in Node.js
Same pattern as Python: npm install openai + new OpenAI({apiKey, baseURL: "https://api.groq.com/openai/v1"}) + model: "llama-3.3-70b-versatile". Full TypeScript inheritance from openai package. Error handling: instanceof OpenAI.RateLimitError (free tier 30 RPM/6K TPM), OpenAI.AuthenticationError. Edge runtime compatible (Cloudflare Workers, Vercel Edge). Streaming responses via async iterators feel like real-time text generation due to LPU speed.
Installation
npm install openai
Basic Chat Completion
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.GROQ_API_KEY,
baseURL: "https://api.groq.com/openai/v1",
});
const response = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "system", content: "You are a concise assistant." },
{ role: "user", content: "What is a REST API?" },
],
});
console.log(response.choices[0].message.content);
Error Handling
try {
const response = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);
} catch (error) {
if (error instanceof OpenAI.RateLimitError) {
console.error("Rate limited. Free tier: 30 RPM, 6K TPM.");
} else if (error instanceof OpenAI.AuthenticationError) {
console.error("Invalid API key.");
} else {
throw error;
}
}
Model Selection Guide
Default: Llama 3.3 70B for everything (best overall quality on Groq, handles most tasks). Switch only when specific requirements demand: long docs 100K+ tokens → Llama 4 Scout (512K context). Fast simple tasks → Gemma 2 9B (fastest 400-600 tok/s, lowest cost). Reasoning/math → DeepSeek R1 Distill 70B. Multilingual → Mistral Saba. Maximum context + quality → Llama 4 Maverick (256K). Most apps stay on Llama 3.3 70B.
| Task | Best Groq Model | Why |
|---|---|---|
| General chat, Q&A | Llama 3.3 70B | Best overall quality on Groq |
| Long documents (100K+ tokens) | Llama 4 Scout | 512K context window |
| Fast, simple tasks | Gemma 2 9B | Fastest speed, lowest cost |
| Reasoning and math | DeepSeek R1 Distill 70B | Best reasoning on Groq |
| Multilingual tasks | Mistral Saba | Strong multilingual performance |
| Code generation | Llama 3.3 70B | Best code quality on Groq |
| Maximum context + quality | Llama 4 Maverick | 256K context, strong performance |
Default recommendation: Start with Llama 3.3 70B. It handles most tasks well at a good price. Switch to specialized models only when specific requirements demand it.
Streaming: Real-Time Token Delivery
Same OpenAI streaming patterns work — Python stream=True + iterate, Node.js async iterators. Difference: Groq's 200-500 tok/s makes streaming feel like pre-generated text rendering. Tokens arrive faster than users can read. UX qualitatively different from 30-80 tok/s GPU providers where streaming visibly types out word by word. Best for chat interfaces, autocomplete, real-time assistants where speed perception drives engagement.
Groq's speed makes streaming feel qualitatively different from other providers. Tokens arrive so fast that streaming approaches the feel of pre-generated text.
Python Streaming
from openai import OpenAI
client = OpenAI(
api_key="YOUR_GROQ_API_KEY",
base_url="https://api.groq.com/openai/v1"
)
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Explain Docker containers in detail."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Node.js Streaming
const stream = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "user", content: "Explain Docker containers in detail." },
],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
Rate Limit Handling and Free Tier Optimization
Free tier per model: 30 RPM / 6K TPM (Gemma 15K) / 500K tokens/day. Headers expose granular state: x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-tokens (seconds until reset). 4 optimization strategies: (1) Shorter prompts (<200 token system prompt). (2) Gemma 2 9B for simple tasks (15K TPM ceiling). (3) Request queuing 2 sec apart. (4) Local response cache for repeated queries.
Free Tier Limits
| Model | Requests/Min | Tokens/Min | Tokens/Day |
|---|---|---|---|
| Llama 3.3 70B | 30 | 6,000 | 500,000 |
| Llama 4 Scout | 30 | 6,000 | 500,000 |
| Gemma 2 9B | 30 | 15,000 | 500,000 |
| DeepSeek R1 Distill | 30 | 6,000 | 500,000 |
Rate Limit Response Headers
Groq returns detailed rate limit information in every response:
x-ratelimit-limit-requests: 30
x-ratelimit-remaining-requests: 28
x-ratelimit-limit-tokens: 6000
x-ratelimit-remaining-tokens: 4500
x-ratelimit-reset-requests: 2.1s
x-ratelimit-reset-tokens: 45.5s
Free Tier Optimization Strategies
1. Use shorter prompts. The 6K TPM limit means long prompts eat your budget. Keep system prompts under 200 tokens.
2. Use Gemma 2 9B for simple tasks. It has a higher TPM limit (15K) and processes faster. Use Llama 70B only when you need the quality.
3. Implement request queuing. Space requests 2 seconds apart to stay within 30 RPM. Use a simple queue:
import time
from collections import deque
class RateLimiter:
def __init__(self, rpm=30):
self.requests = deque()
self.rpm = rpm
def wait_if_needed(self):
now = time.time()
while self.requests and self.requests[0] < now - 60:
self.requests.popleft()
if len(self.requests) >= self.rpm:
wait_time = 60 - (now - self.requests[0])
time.sleep(wait_time)
self.requests.append(time.time())
limiter = RateLimiter(rpm=28) # Stay slightly under limit
for prompt in prompts:
limiter.wait_if_needed()
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}]
)
4. Cache responses locally. For repeated identical queries, cache the response to avoid hitting the API again.
Tool Calling and JSON Mode
Tool calling: standard openai format with tools array + JSON Schema parameters + tool_choice="auto". Model returns tool_calls array with function name + arguments. JSON mode: response_format={"type": "json_object"} + system prompt specifying schema. Both work identically to OpenAI/DeepSeek patterns thanks to OpenAI-compatible API. Llama 3.3 70B function calling reliability ~88% — adequate for most use cases but lower than GPT-4.1's 97%.
Tool Calling
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
}
],
tool_choice="auto"
)
if response.choices[0].message.tool_calls:
call = response.choices[0].message.tool_calls[0]
print(f"Function: {call.function.name}")
print(f"Args: {call.function.arguments}")
JSON Mode
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "Return valid JSON with keys: name, capital, continent."},
{"role": "user", "content": "Tell me about Brazil."}
],
response_format={"type": "json_object"}
)
import json
data = json.loads(response.choices[0].message.content)
print(data)
Using Groq Through TokenMix.ai
Four benefits of routing through TokenMix.ai: (1) Automatic failover to other providers when Groq rate limits hit (free tier hits caps quickly). (2) Unified billing across Groq + 300+ other models. (3) Higher effective rate limits via request pooling. (4) Switch between Groq and slower-but-better-quality providers without code changes. Same model: "llama-3.3-70b-versatile" parameter works whether routing direct or via TokenMix.ai.
TokenMix.ai provides access to Groq models alongside 300+ other models through a single endpoint.
from openai import OpenAI
client = OpenAI(
api_key="tmx-your-key",
base_url="https://api.tokenmix.ai/v1"
)
# Access Groq's Llama through TokenMix.ai
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Hello from TokenMix.ai"}]
)
Why route through TokenMix.ai:
- Automatic failover to other providers if Groq rate limits hit
- Unified billing with other model usage
- Higher effective rate limits through request pooling
- Switch between Groq and other providers without code changes
When to Upgrade From Free to Developer Tier
Four upgrade signals: (1) Hitting 500K tokens/day daily limits regularly. (2) Users waiting due to 30 RPM rate limits. (3) Moving to production (user-facing apps shouldn't rely on free tier). (4) Need guaranteed availability (free tier deprioritized at peak). Developer tier: 60-120 RPM, 60-120K TPM, unlimited daily tokens, pay-per-use. Cost at 100M tokens/mo: Llama 70B $67, Gemma 2 9B $20, Llama 4 Scout $24.70 — competitive with other providers for same models.
Developer Tier Benefits
| Feature | Free Tier | Developer Tier |
|---|---|---|
| Requests/min | 30 | 60-120 |
| Tokens/min | 6,000 | 60,000-120,000 |
| Tokens/day | 500,000 | Unlimited (pay-per-use) |
| Credit card required | No | Yes |
| Cost | $0 | Pay per token |
Upgrade When You Hit These Signals
- Hitting daily token limits regularly. If 500K tokens/day is not enough for your workload, you need the Developer tier.
- Users waiting due to rate limits. If 30 RPM causes visible delays for users, upgrade for 60-120 RPM.
- Moving to production. Any user-facing application should not rely on free tier limits.
- Need guaranteed availability. Free tier requests are deprioritized during peak usage.
Cost After Upgrading
| Monthly Volume | Llama 3.3 70B Cost | Gemma 2 9B Cost | Llama 4 Scout Cost |
|---|---|---|---|
| 10M tokens | $6.70 | $2.00 | $2.47 |
| 50M tokens | $33.50 | $10.00 | $12.35 |
| 100M tokens | $67.00 | $20.00 | $24.70 |
| 500M tokens | $335.00 | $100.00 | $123.50 |
Even on the paid tier, Groq's pricing is competitive with other providers for the same open models.
Groq vs Other Providers: Speed and Cost
Llama 3.3 70B speed comparison: Groq 200-350 tok/s (cheapest at $67/100M, lowest TTFT 100-300ms) vs Together AI 50-100 tok/s ($88) vs Fireworks 40-80 tok/s ($90). Groq is 3-5x faster than next fastest provider for same Llama model + cheaper. For latency-sensitive applications, this gap is significant. OpenAI GPT-4.1 mini comparable cost ($88) but only 40-80 tok/s. Anthropic Claude Haiku 30-60 tok/s + $208 = 3x more expensive at 1/5 the speed.
| Provider | Model | Output Speed (tok/s) | Cost per 100M tokens | Latency (TTFT) |
|---|---|---|---|---|
| Groq | Llama 3.3 70B | 200-350 | $67 | 100-300ms |
| Together AI | Llama 3.3 70B | 50-100 | $88 | 300-600ms |
| Fireworks | Llama 3.3 70B | 40-80 | $90 | 300-700ms |
| OpenAI | GPT-4.1 mini | 40-80 | $88 | 200-500ms |
| Anthropic | Claude Haiku 3.5 | 30-60 | $208 | 300-800ms |
Groq is 3-5x faster than the next fastest provider for Llama models. For applications where latency directly impacts user experience, this gap is significant.
Should You Choose Groq?
Yes for: fastest possible inference (3-10x faster), real-time chat (sub-second responses feel instant), prototyping with no budget (free tier no card), high-volume open model workloads (fast + cost-competitive), 500K+ context (Llama 4 Scout 512K). No for: GPT-4.1/Claude required (Groq open models only), frontier reasoning (DeepSeek R1 Distill good but not o3/Opus level). Multi-provider flexibility: route Groq + others through TokenMix.ai unified endpoint.
| Situation | Choose Groq? | Reason |
|---|---|---|
| Need fastest possible inference | Yes | 3-10x faster than GPU providers |
| Real-time chat interface | Yes | Sub-second responses feel instant |
| Prototyping with no budget | Yes | Free tier, no credit card |
| Need GPT-4.1 or Claude | No | Groq only serves open models |
| Need frontier reasoning | No | DeepSeek R1 Distill is good but not o3/Opus level |
| High-volume open model workloads | Yes | Fast + cost-competitive |
| Need 500K+ context | Yes | Llama 4 Scout has 512K context |
| Want multi-provider flexibility | Use TokenMix.ai | Access Groq + others through one endpoint |
What's the Bottom Line on the Groq API?
Fastest AI inference + easiest onboarding (no card, <5 min setup, tokens arrive faster than reading speed). Free tier generous enough to build complete apps. Developer tier cost-competitive vs other Llama providers + 3-10x faster. Limitation: open models only — pair Groq with another provider for GPT-4.1/Claude/Gemini access. TokenMix.ai unified endpoint routes latency-sensitive queries to Groq + complex reasoning to other providers automatically. Start with free tier today.
Groq offers the fastest AI inference available and the easiest onboarding in the industry. No credit card, sub-five-minute setup, and tokens that arrive faster than you can read them.
For development and prototyping, the free tier is generous enough to build complete applications. For production, the Developer tier is cost-competitive with other Llama hosting providers while delivering 3-10x better speed.
The limitation is model selection: Groq serves open models only. For applications that need GPT-4.1, Claude, or Gemini, pair Groq with another provider. TokenMix.ai makes this easy by routing all requests through a single endpoint -- send latency-sensitive queries to Groq and complex reasoning queries to other providers automatically.
Start now. The free tier is waiting.
FAQ
How do I get started with Groq API?
Go to console.groq.com, sign up (no credit card needed), create an API key, and make your first call. The entire process takes under five minutes. Use the openai Python or Node.js SDK with base_url="https://api.groq.com/openai/v1". Groq's free tier gives you 30 requests/minute and 6,000 tokens/minute on most models.
Is Groq API free?
Yes, Groq offers a free tier with no credit card required. Free limits include 30 requests/minute, 6,000 tokens/minute (15,000 for Gemma), and 500,000 tokens/day per model. This is sufficient for development and light production use. Paid Developer tier is available when you need higher limits.
Why is Groq so fast?
Groq uses custom LPU (Language Processing Unit) hardware designed specifically for sequential token generation. Unlike GPUs that are general-purpose processors adapted for AI, LPUs are purpose-built for inference. This hardware advantage delivers 200-500 tokens/second on large models -- 3-10x faster than any GPU-based provider.
What models are available on Groq?
Groq serves open-weight models: Llama 3.3 70B, Llama 4 Scout, Llama 4 Maverick, Mistral Saba, DeepSeek R1 Distill, and Gemma 2 9B. Groq does not offer proprietary models like GPT, Claude, or Gemini. For access to all models including Groq, use TokenMix.ai as a unified gateway.
How do I handle Groq rate limits?
Monitor response headers: x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens. Implement exponential backoff on 429 errors. On the free tier, space requests 2 seconds apart to stay within 30 RPM. For production, upgrade to the Developer tier for 60-120 RPM and 60K-120K TPM.
Can I use Groq with the OpenAI SDK?
Yes. Groq implements the OpenAI chat completions API format. Use the standard openai Python package or npm package with base_url="https://api.groq.com/openai/v1" (Python) or baseURL (Node.js). Your existing OpenAI-compatible code works with only the base URL, API key, and model name changed.
Related Articles
- 12 Best LLM API Providers Ranked 2026: Speed, Price, Uptime
- 15 Best Free LLM APIs 2026: Tested Limits, No Credit Card
- 10 OpenAI API Alternatives 2026: One-Line Migration Code
- DeepSeek API Tutorial 2026: First Call in 5 Minutes (Python)
- Function Calling Guide 2026: 346 Token Overhead per Call
Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: Groq API Documentation, Groq Pricing, GroqCloud Console + TokenMix.ai