TokenMix Research Lab · 2026-04-12

Groq API Tutorial: Getting Started With the Fastest AI Inference in 2026
Groq is the fastest AI inference provider available today. Its custom LPU (Language Processing Unit) hardware delivers 200-500 tokens per second on Llama 3.3 70B -- 3-10x faster than any GPU-based provider. The free tier requires no credit card, and you can make your first API call in under five minutes. This Groq tutorial covers everything: account setup, model selection, Python and Node.js examples, streaming, rate limit handling, and when to upgrade from the free tier to the Developer plan. Speed and pricing data verified by TokenMix.ai in April 2026.
Table of Contents
- [Quick Reference: Groq API Models and Pricing]
- [Why Groq Is Different: LPU vs GPU]
- [Getting Started: Free Account Setup]
- [Your First Groq API Call in Python]
- [Your First Groq API Call in Node.js]
- [Model Selection Guide]
- [Streaming: Real-Time Token Delivery]
- [Rate Limit Handling and Free Tier Optimization]
- [Tool Calling and JSON Mode]
- [Using Groq Through TokenMix.ai]
- [When to Upgrade From Free to Developer Tier]
- [Groq vs Other Providers: Speed and Cost]
- [Decision Guide: When to Choose Groq]
- [Conclusion]
- [FAQ]
Quick Reference: Groq API Models and Pricing
| Model | Model ID | Input $/M | Output $/M | Speed (tok/s) | Context | Free Tier |
|---|---|---|---|---|---|---|
| Llama 3.3 70B | llama-3.3-70b-versatile | $0.59 | $0.79 | 200-350 | 128K | 6K TPM |
| Llama 4 Scout | meta-llama/llama-4-scout-17b-16e-instruct | $0.11 | $0.34 | 300-500 | 512K | 6K TPM |
| Llama 4 Maverick | meta-llama/llama-4-maverick-17b-128e-instruct | $0.50 | $0.77 | 150-250 | 256K | 6K TPM |
| Mistral Saba | mistral-saba-24b | $0.20 | $0.60 | 250-400 | 32K | 6K TPM |
| DeepSeek R1 Distill | deepseek-r1-distill-llama-70b | $0.75 | $0.99 | 150-250 | 128K | 6K TPM |
| Gemma 2 9B | gemma2-9b-it | $0.20 | $0.20 | 400-600 | 8K | 6K TPM |
Why Groq Is Different: LPU vs GPU
Every other AI inference provider runs models on NVIDIA GPUs. Groq built its own hardware -- the LPU (Language Processing Unit) -- designed specifically for sequential token generation.
The result: Groq serves Llama 3.3 70B at 200-350 tokens/second output speed. The same model on cloud GPU infrastructure typically runs at 30-80 tokens/second. That is a 3-10x speed advantage.
What this means in practice:
- A 500-token response takes 1-2 seconds on Groq vs 6-16 seconds on GPU providers
- Streaming feels instant -- tokens appear faster than most users can read
- Development iteration is faster -- you spend less time waiting for responses during testing
The trade-off: Groq currently runs a limited set of open models (primarily Llama, Mistral, and Gemma). You cannot use GPT-4.1, Claude, or Gemini on Groq. For applications where speed matters more than model-specific capabilities, this trade-off is worth it.
TokenMix.ai latency benchmarks confirm Groq consistently delivers the lowest time-to-first-token and fastest throughput of any API provider.
Getting Started: Free Account Setup
Groq's signup process is the fastest in the industry. No credit card, no waitlist, no approval process.
Step 1: Create an Account
Go to console.groq.com. Click "Sign Up." You can register with Google, GitHub, or email. The process takes under 60 seconds.
Step 2: Get Your API Key
After login, you land on the "API Keys" page. Click "Create API Key." Give it a name (e.g., "my-project"). Copy the key. It starts with gsk_.
Step 3: Test With curl
curl -X POST "https://api.groq.com/openai/v1/chat/completions" \
-H "Authorization: Bearer YOUR_GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b-versatile",
"messages": [{"role": "user", "content": "Hello, what can you do?"}]
}'
Note the endpoint: https://api.groq.com/openai/v1. Groq implements the OpenAI API format, so the path includes /openai/v1.
What You Get on the Free Tier
- No credit card required
- Access to all models
- Rate limits: 6,000 tokens per minute (TPM), 30 requests per minute (RPM) per model
- Daily token limits: 500K tokens/day for most models
- Sufficient for development, prototyping, and light production
Your First Groq API Call in Python
Installation
pip install openai
Groq is OpenAI-compatible. You use the standard openai SDK.
Basic Chat Completion
from openai import OpenAI
client = OpenAI(
api_key="YOUR_GROQ_API_KEY",
base_url="https://api.groq.com/openai/v1"
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You are a concise assistant. Answer in 2-3 sentences."},
{"role": "user", "content": "What is a REST API?"}
]
)
print(response.choices[0].message.content)
# Response arrives in <1 second for short outputs
Check Response Speed
import time
start = time.time()
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Write a 200-word explanation of machine learning."}]
)
elapsed = time.time() - start
tokens = response.usage.completion_tokens
speed = tokens / elapsed
print(f"Generated {tokens} tokens in {elapsed:.2f}s ({speed:.0f} tok/s)")
# Typical output: Generated 250 tokens in 0.9s (278 tok/s)
Using Environment Variables
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("GROQ_API_KEY"),
base_url="https://api.groq.com/openai/v1"
)
export GROQ_API_KEY="gsk-your-key-here"
Your First Groq API Call in Node.js
Installation
npm install openai
Basic Chat Completion
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.GROQ_API_KEY,
baseURL: "https://api.groq.com/openai/v1",
});
const response = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "system", content: "You are a concise assistant." },
{ role: "user", content: "What is a REST API?" },
],
});
console.log(response.choices[0].message.content);
Error Handling
try {
const response = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Hello" }],
});
console.log(response.choices[0].message.content);
} catch (error) {
if (error instanceof OpenAI.RateLimitError) {
console.error("Rate limited. Free tier: 30 RPM, 6K TPM.");
} else if (error instanceof OpenAI.AuthenticationError) {
console.error("Invalid API key.");
} else {
throw error;
}
}
Model Selection Guide
| Task | Best Groq Model | Why |
|---|---|---|
| General chat, Q&A | Llama 3.3 70B | Best overall quality on Groq |
| Long documents (100K+ tokens) | Llama 4 Scout | 512K context window |
| Fast, simple tasks | Gemma 2 9B | Fastest speed, lowest cost |
| Reasoning and math | DeepSeek R1 Distill 70B | Best reasoning on Groq |
| Multilingual tasks | Mistral Saba | Strong multilingual performance |
| Code generation | Llama 3.3 70B | Best code quality on Groq |
| Maximum context + quality | Llama 4 Maverick | 256K context, strong performance |
Default recommendation: Start with Llama 3.3 70B. It handles most tasks well at a good price. Switch to specialized models only when specific requirements demand it.
Streaming: Real-Time Token Delivery
Groq's speed makes streaming feel qualitatively different from other providers. Tokens arrive so fast that streaming approaches the feel of pre-generated text.
Python Streaming
from openai import OpenAI
client = OpenAI(
api_key="YOUR_GROQ_API_KEY",
base_url="https://api.groq.com/openai/v1"
)
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Explain Docker containers in detail."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Node.js Streaming
const stream = await client.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "user", content: "Explain Docker containers in detail." },
],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
process.stdout.write(content);
}
}
Rate Limit Handling and Free Tier Optimization
Free Tier Limits
| Model | Requests/Min | Tokens/Min | Tokens/Day |
|---|---|---|---|
| Llama 3.3 70B | 30 | 6,000 | 500,000 |
| Llama 4 Scout | 30 | 6,000 | 500,000 |
| Gemma 2 9B | 30 | 15,000 | 500,000 |
| DeepSeek R1 Distill | 30 | 6,000 | 500,000 |
Rate Limit Response Headers
Groq returns detailed rate limit information in every response:
x-ratelimit-limit-requests: 30
x-ratelimit-remaining-requests: 28
x-ratelimit-limit-tokens: 6000
x-ratelimit-remaining-tokens: 4500
x-ratelimit-reset-requests: 2.1s
x-ratelimit-reset-tokens: 45.5s
Free Tier Optimization Strategies
1. Use shorter prompts. The 6K TPM limit means long prompts eat your budget. Keep system prompts under 200 tokens.
2. Use Gemma 2 9B for simple tasks. It has a higher TPM limit (15K) and processes faster. Use Llama 70B only when you need the quality.
3. Implement request queuing. Space requests 2 seconds apart to stay within 30 RPM. Use a simple queue:
import time
from collections import deque
class RateLimiter:
def __init__(self, rpm=30):
self.requests = deque()
self.rpm = rpm
def wait_if_needed(self):
now = time.time()
while self.requests and self.requests[0] < now - 60:
self.requests.popleft()
if len(self.requests) >= self.rpm:
wait_time = 60 - (now - self.requests[0])
time.sleep(wait_time)
self.requests.append(time.time())
limiter = RateLimiter(rpm=28) # Stay slightly under limit
for prompt in prompts:
limiter.wait_if_needed()
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": prompt}]
)
4. Cache responses locally. For repeated identical queries, cache the response to avoid hitting the API again.
Tool Calling and JSON Mode
Tool Calling
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=[
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
}
],
tool_choice="auto"
)
if response.choices[0].message.tool_calls:
call = response.choices[0].message.tool_calls[0]
print(f"Function: {call.function.name}")
print(f"Args: {call.function.arguments}")
JSON Mode
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "Return valid JSON with keys: name, capital, continent."},
{"role": "user", "content": "Tell me about Brazil."}
],
response_format={"type": "json_object"}
)
import json
data = json.loads(response.choices[0].message.content)
print(data)
Using Groq Through TokenMix.ai
TokenMix.ai provides access to Groq models alongside 300+ other models through a single endpoint.
from openai import OpenAI
client = OpenAI(
api_key="tmx-your-key",
base_url="https://api.tokenmix.ai/v1"
)
# Access Groq's Llama through TokenMix.ai
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Hello from TokenMix.ai"}]
)
Why route through TokenMix.ai:
- Automatic failover to other providers if Groq rate limits hit
- Unified billing with other model usage
- Higher effective rate limits through request pooling
- Switch between Groq and other providers without code changes
When to Upgrade From Free to Developer Tier
Developer Tier Benefits
| Feature | Free Tier | Developer Tier |
|---|---|---|
| Requests/min | 30 | 60-120 |
| Tokens/min | 6,000 | 60,000-120,000 |
| Tokens/day | 500,000 | Unlimited (pay-per-use) |
| Credit card required | No | Yes |
| Cost | $0 | Pay per token |
Upgrade When You Hit These Signals
- Hitting daily token limits regularly. If 500K tokens/day is not enough for your workload, you need the Developer tier.
- Users waiting due to rate limits. If 30 RPM causes visible delays for users, upgrade for 60-120 RPM.
- Moving to production. Any user-facing application should not rely on free tier limits.
- Need guaranteed availability. Free tier requests are deprioritized during peak usage.
Cost After Upgrading
| Monthly Volume | Llama 3.3 70B Cost | Gemma 2 9B Cost | Llama 4 Scout Cost |
|---|---|---|---|
| 10M tokens | $6.70 | $2.00 | $2.47 |
| 50M tokens | $33.50 |