TokenMix Research Lab · 2026-04-12

Groq API Tutorial 2026: Free, 315 TPS, First Call in 3 Minutes

Groq API Tutorial: Getting Started With the Fastest AI Inference in 2026

Last Updated: 2026-04-29
Author: TokenMix Research Lab

Groq custom LPU hardware delivers Llama 3.3 70B at 200-350 tok/sec — 3-10x faster than GPU providers' 30-80 tok/sec. 500-token response = 1-2 sec on Groq vs 6-16 sec on GPU. Free tier no credit card: 30 RPM + 6K TPM + 500K tokens/day = sufficient for development + light production. OpenAI-compatible API (use openai SDK + base_url="https://api.groq.com/openai/v1"). Trade-off: open models only (Llama/Mistral/Gemma) — no GPT/Claude/Gemini.

Groq is the fastest AI inference provider available today. Its custom LPU (Language Processing Unit) hardware delivers 200-500 tokens per second on Llama 3.3 70B -- 3-10x faster than any GPU-based provider. The free tier requires no credit card, and you can make your first API call in under five minutes. This Groq tutorial covers everything: account setup, model selection, Python and Node.js examples, streaming, rate limit handling, and when to upgrade from the free tier to the Developer plan. Speed and pricing data verified by TokenMix.ai in April 2026.

Table of Contents


Quick Reference: Groq API Models and Pricing

6 models on Groq. Llama 3.3 70B: $0.59/$0.79, 200-350 tok/s, 128K context (default recommendation). Llama 4 Scout: $0.11/$0.34, 300-500 tok/s, 512K context (largest). Llama 4 Maverick: $0.50/$0.77, 256K context. Mistral Saba: $0.20/$0.60, multilingual. DeepSeek R1 Distill: $0.75/$0.99, reasoning. Gemma 2 9B: $0.20/$0.20 (fastest 400-600 tok/s, cheapest, 8K context). All free tier 6K TPM (Gemma 15K).

Model Model ID Input $/M Output $/M Speed (tok/s) Context Free Tier
Llama 3.3 70B llama-3.3-70b-versatile $0.59 $0.79 200-350 128K 6K TPM
Llama 4 Scout meta-llama/llama-4-scout-17b-16e-instruct $0.11 $0.34 300-500 512K 6K TPM
Llama 4 Maverick meta-llama/llama-4-maverick-17b-128e-instruct $0.50 $0.77 150-250 256K 6K TPM
Mistral Saba mistral-saba-24b $0.20 $0.60 250-400 32K 6K TPM
DeepSeek R1 Distill deepseek-r1-distill-llama-70b $0.75 $0.99 150-250 128K 6K TPM
Gemma 2 9B gemma2-9b-it $0.20 $0.20 400-600 8K 6K TPM

Why Groq Is Different: LPU vs GPU

Custom LPU hardware purpose-built for sequential token generation vs general-purpose GPUs adapted for AI. Llama 3.3 70B speed: Groq 200-350 tok/s vs GPU 30-80 tok/s = 3-10x faster. 500-token response: 1-2 sec on Groq vs 6-16 sec on GPU. Streaming feels instant — tokens appear faster than users can read. Trade-off: open models only (Llama/Mistral/Gemma). For applications where speed matters more than model-specific capabilities (GPT-4.1, Claude), the trade is worth it.

Every other AI inference provider runs models on NVIDIA GPUs. Groq built its own hardware -- the LPU (Language Processing Unit) -- designed specifically for sequential token generation.

The result: Groq serves Llama 3.3 70B at 200-350 tokens/second output speed. The same model on cloud GPU infrastructure typically runs at 30-80 tokens/second. That is a 3-10x speed advantage.

What this means in practice:

The trade-off: Groq currently runs a limited set of open models (primarily Llama, Mistral, and Gemma). You cannot use GPT-4.1, Claude, or Gemini on Groq. For applications where speed matters more than model-specific capabilities, this trade-off is worth it.

TokenMix.ai latency benchmarks confirm Groq consistently delivers the lowest time-to-first-token and fastest throughput of any API provider.


Getting Started: Free Account Setup

Fastest signup in industry. Three steps under 5 min: (1) console.groq.com sign-up via Google/GitHub/email (no credit card, no waitlist, no approval). (2) Generate API key in dashboard (starts with gsk_). (3) Test with curl on https://api.groq.com/openai/v1/chat/completions. Free tier: 30 RPM + 6K TPM + 500K tokens/day = enough for development, prototyping, light production. All models accessible.

Groq's signup process is the fastest in the industry. No credit card, no waitlist, no approval process.

Step 1: Create an Account

Go to console.groq.com. Click "Sign Up." You can register with Google, GitHub, or email. The process takes under 60 seconds.

Step 2: Get Your API Key

After login, you land on the "API Keys" page. Click "Create API Key." Give it a name (e.g., "my-project"). Copy the key. It starts with gsk_.

Step 3: Test With curl

curl -X POST "https://api.groq.com/openai/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_GROQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b-versatile",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}]
  }'

Note the endpoint: https://api.groq.com/openai/v1. Groq implements the OpenAI API format, so the path includes /openai/v1.

What You Get on the Free Tier


Your First Groq API Call in Python

Use standard openai package — no separate Groq SDK. pip install openai + OpenAI(api_key=..., base_url="https://api.groq.com/openai/v1") + model="llama-3.3-70b-versatile". Verify speed: typical 200-token response in 0.7-1.0 sec, generating 200-300 tok/s. Use os.getenv("GROQ_API_KEY") for portable code. Speed feels qualitatively different from cloud GPU providers — ideal for development iteration, less waiting during testing.

Installation

pip install openai

Groq is OpenAI-compatible. You use the standard openai SDK.

Basic Chat Completion

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_GROQ_API_KEY",
    base_url="https://api.groq.com/openai/v1"
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a concise assistant. Answer in 2-3 sentences."},
        {"role": "user", "content": "What is a REST API?"}
    ]
)

print(response.choices[0].message.content)
# Response arrives in <1 second for short outputs

Check Response Speed

import time

start = time.time()
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a 200-word explanation of machine learning."}]
)
elapsed = time.time() - start

tokens = response.usage.completion_tokens
speed = tokens / elapsed

print(f"Generated {tokens} tokens in {elapsed:.2f}s ({speed:.0f} tok/s)")
# Typical output: Generated 250 tokens in 0.9s (278 tok/s)

Using Environment Variables

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("GROQ_API_KEY"),
    base_url="https://api.groq.com/openai/v1"
)
export GROQ_API_KEY="gsk-your-key-here"

Your First Groq API Call in Node.js

Same pattern as Python: npm install openai + new OpenAI({apiKey, baseURL: "https://api.groq.com/openai/v1"}) + model: "llama-3.3-70b-versatile". Full TypeScript inheritance from openai package. Error handling: instanceof OpenAI.RateLimitError (free tier 30 RPM/6K TPM), OpenAI.AuthenticationError. Edge runtime compatible (Cloudflare Workers, Vercel Edge). Streaming responses via async iterators feel like real-time text generation due to LPU speed.

Installation

npm install openai

Basic Chat Completion

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.GROQ_API_KEY,
  baseURL: "https://api.groq.com/openai/v1",
});

const response = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "system", content: "You are a concise assistant." },
    { role: "user", content: "What is a REST API?" },
  ],
});

console.log(response.choices[0].message.content);

Error Handling

try {
  const response = await client.chat.completions.create({
    model: "llama-3.3-70b-versatile",
    messages: [{ role: "user", content: "Hello" }],
  });
  console.log(response.choices[0].message.content);
} catch (error) {
  if (error instanceof OpenAI.RateLimitError) {
    console.error("Rate limited. Free tier: 30 RPM, 6K TPM.");
  } else if (error instanceof OpenAI.AuthenticationError) {
    console.error("Invalid API key.");
  } else {
    throw error;
  }
}

Model Selection Guide

Default: Llama 3.3 70B for everything (best overall quality on Groq, handles most tasks). Switch only when specific requirements demand: long docs 100K+ tokens → Llama 4 Scout (512K context). Fast simple tasks → Gemma 2 9B (fastest 400-600 tok/s, lowest cost). Reasoning/math → DeepSeek R1 Distill 70B. Multilingual → Mistral Saba. Maximum context + quality → Llama 4 Maverick (256K). Most apps stay on Llama 3.3 70B.

Task Best Groq Model Why
General chat, Q&A Llama 3.3 70B Best overall quality on Groq
Long documents (100K+ tokens) Llama 4 Scout 512K context window
Fast, simple tasks Gemma 2 9B Fastest speed, lowest cost
Reasoning and math DeepSeek R1 Distill 70B Best reasoning on Groq
Multilingual tasks Mistral Saba Strong multilingual performance
Code generation Llama 3.3 70B Best code quality on Groq
Maximum context + quality Llama 4 Maverick 256K context, strong performance

Default recommendation: Start with Llama 3.3 70B. It handles most tasks well at a good price. Switch to specialized models only when specific requirements demand it.


Streaming: Real-Time Token Delivery

Same OpenAI streaming patterns work — Python stream=True + iterate, Node.js async iterators. Difference: Groq's 200-500 tok/s makes streaming feel like pre-generated text rendering. Tokens arrive faster than users can read. UX qualitatively different from 30-80 tok/s GPU providers where streaming visibly types out word by word. Best for chat interfaces, autocomplete, real-time assistants where speed perception drives engagement.

Groq's speed makes streaming feel qualitatively different from other providers. Tokens arrive so fast that streaming approaches the feel of pre-generated text.

Python Streaming

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_GROQ_API_KEY",
    base_url="https://api.groq.com/openai/v1"
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain Docker containers in detail."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Node.js Streaming

const stream = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "user", content: "Explain Docker containers in detail." },
  ],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

Rate Limit Handling and Free Tier Optimization

Free tier per model: 30 RPM / 6K TPM (Gemma 15K) / 500K tokens/day. Headers expose granular state: x-ratelimit-remaining-requests, x-ratelimit-remaining-tokens, x-ratelimit-reset-tokens (seconds until reset). 4 optimization strategies: (1) Shorter prompts (<200 token system prompt). (2) Gemma 2 9B for simple tasks (15K TPM ceiling). (3) Request queuing 2 sec apart. (4) Local response cache for repeated queries.

Free Tier Limits

Model Requests/Min Tokens/Min Tokens/Day
Llama 3.3 70B 30 6,000 500,000
Llama 4 Scout 30 6,000 500,000
Gemma 2 9B 30 15,000 500,000
DeepSeek R1 Distill 30 6,000 500,000

Rate Limit Response Headers

Groq returns detailed rate limit information in every response:

x-ratelimit-limit-requests: 30
x-ratelimit-remaining-requests: 28
x-ratelimit-limit-tokens: 6000
x-ratelimit-remaining-tokens: 4500
x-ratelimit-reset-requests: 2.1s
x-ratelimit-reset-tokens: 45.5s

Free Tier Optimization Strategies

1. Use shorter prompts. The 6K TPM limit means long prompts eat your budget. Keep system prompts under 200 tokens.

2. Use Gemma 2 9B for simple tasks. It has a higher TPM limit (15K) and processes faster. Use Llama 70B only when you need the quality.

3. Implement request queuing. Space requests 2 seconds apart to stay within 30 RPM. Use a simple queue:

import time
from collections import deque

class RateLimiter:
    def __init__(self, rpm=30):
        self.requests = deque()
        self.rpm = rpm
    
    def wait_if_needed(self):
        now = time.time()
        while self.requests and self.requests[0] < now - 60:
            self.requests.popleft()
        if len(self.requests) >= self.rpm:
            wait_time = 60 - (now - self.requests[0])
            time.sleep(wait_time)
        self.requests.append(time.time())

limiter = RateLimiter(rpm=28)  # Stay slightly under limit

for prompt in prompts:
    limiter.wait_if_needed()
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}]
    )

4. Cache responses locally. For repeated identical queries, cache the response to avoid hitting the API again.


Tool Calling and JSON Mode

Tool calling: standard openai format with tools array + JSON Schema parameters + tool_choice="auto". Model returns tool_calls array with function name + arguments. JSON mode: response_format={"type": "json_object"} + system prompt specifying schema. Both work identically to OpenAI/DeepSeek patterns thanks to OpenAI-compatible API. Llama 3.3 70B function calling reliability ~88% — adequate for most use cases but lower than GPT-4.1's 97%.

Tool Calling

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get weather for a city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string", "description": "City name"},
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                    },
                    "required": ["city"]
                }
            }
        }
    ],
    tool_choice="auto"
)

if response.choices[0].message.tool_calls:
    call = response.choices[0].message.tool_calls[0]
    print(f"Function: {call.function.name}")
    print(f"Args: {call.function.arguments}")

JSON Mode

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "Return valid JSON with keys: name, capital, continent."},
        {"role": "user", "content": "Tell me about Brazil."}
    ],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)
print(data)

Using Groq Through TokenMix.ai

Four benefits of routing through TokenMix.ai: (1) Automatic failover to other providers when Groq rate limits hit (free tier hits caps quickly). (2) Unified billing across Groq + 300+ other models. (3) Higher effective rate limits via request pooling. (4) Switch between Groq and slower-but-better-quality providers without code changes. Same model: "llama-3.3-70b-versatile" parameter works whether routing direct or via TokenMix.ai.

TokenMix.ai provides access to Groq models alongside 300+ other models through a single endpoint.

from openai import OpenAI

client = OpenAI(
    api_key="tmx-your-key",
    base_url="https://api.tokenmix.ai/v1"
)

# Access Groq's Llama through TokenMix.ai
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello from TokenMix.ai"}]
)

Why route through TokenMix.ai:


When to Upgrade From Free to Developer Tier

Four upgrade signals: (1) Hitting 500K tokens/day daily limits regularly. (2) Users waiting due to 30 RPM rate limits. (3) Moving to production (user-facing apps shouldn't rely on free tier). (4) Need guaranteed availability (free tier deprioritized at peak). Developer tier: 60-120 RPM, 60-120K TPM, unlimited daily tokens, pay-per-use. Cost at 100M tokens/mo: Llama 70B $67, Gemma 2 9B $20, Llama 4 Scout $24.70 — competitive with other providers for same models.

Developer Tier Benefits

Feature Free Tier Developer Tier
Requests/min 30 60-120
Tokens/min 6,000 60,000-120,000
Tokens/day 500,000 Unlimited (pay-per-use)
Credit card required No Yes
Cost $0 Pay per token

Upgrade When You Hit These Signals

  1. Hitting daily token limits regularly. If 500K tokens/day is not enough for your workload, you need the Developer tier.
  2. Users waiting due to rate limits. If 30 RPM causes visible delays for users, upgrade for 60-120 RPM.
  3. Moving to production. Any user-facing application should not rely on free tier limits.
  4. Need guaranteed availability. Free tier requests are deprioritized during peak usage.

Cost After Upgrading

Monthly Volume Llama 3.3 70B Cost Gemma 2 9B Cost Llama 4 Scout Cost
10M tokens $6.70 $2.00 $2.47
50M tokens $33.50 $10.00 $12.35
100M tokens $67.00 $20.00 $24.70
500M tokens $335.00 $100.00 $123.50

Even on the paid tier, Groq's pricing is competitive with other providers for the same open models.


Groq vs Other Providers: Speed and Cost

Llama 3.3 70B speed comparison: Groq 200-350 tok/s (cheapest at $67/100M, lowest TTFT 100-300ms) vs Together AI 50-100 tok/s ($88) vs Fireworks 40-80 tok/s ($90). Groq is 3-5x faster than next fastest provider for same Llama model + cheaper. For latency-sensitive applications, this gap is significant. OpenAI GPT-4.1 mini comparable cost ($88) but only 40-80 tok/s. Anthropic Claude Haiku 30-60 tok/s + $208 = 3x more expensive at 1/5 the speed.

Provider Model Output Speed (tok/s) Cost per 100M tokens Latency (TTFT)
Groq Llama 3.3 70B 200-350 $67 100-300ms
Together AI Llama 3.3 70B 50-100 $88 300-600ms
Fireworks Llama 3.3 70B 40-80 $90 300-700ms
OpenAI GPT-4.1 mini 40-80 $88 200-500ms
Anthropic Claude Haiku 3.5 30-60 $208 300-800ms

Groq is 3-5x faster than the next fastest provider for Llama models. For applications where latency directly impacts user experience, this gap is significant.


Should You Choose Groq?

Yes for: fastest possible inference (3-10x faster), real-time chat (sub-second responses feel instant), prototyping with no budget (free tier no card), high-volume open model workloads (fast + cost-competitive), 500K+ context (Llama 4 Scout 512K). No for: GPT-4.1/Claude required (Groq open models only), frontier reasoning (DeepSeek R1 Distill good but not o3/Opus level). Multi-provider flexibility: route Groq + others through TokenMix.ai unified endpoint.

Situation Choose Groq? Reason
Need fastest possible inference Yes 3-10x faster than GPU providers
Real-time chat interface Yes Sub-second responses feel instant
Prototyping with no budget Yes Free tier, no credit card
Need GPT-4.1 or Claude No Groq only serves open models
Need frontier reasoning No DeepSeek R1 Distill is good but not o3/Opus level
High-volume open model workloads Yes Fast + cost-competitive
Need 500K+ context Yes Llama 4 Scout has 512K context
Want multi-provider flexibility Use TokenMix.ai Access Groq + others through one endpoint

What's the Bottom Line on the Groq API?

Fastest AI inference + easiest onboarding (no card, <5 min setup, tokens arrive faster than reading speed). Free tier generous enough to build complete apps. Developer tier cost-competitive vs other Llama providers + 3-10x faster. Limitation: open models only — pair Groq with another provider for GPT-4.1/Claude/Gemini access. TokenMix.ai unified endpoint routes latency-sensitive queries to Groq + complex reasoning to other providers automatically. Start with free tier today.

Groq offers the fastest AI inference available and the easiest onboarding in the industry. No credit card, sub-five-minute setup, and tokens that arrive faster than you can read them.

For development and prototyping, the free tier is generous enough to build complete applications. For production, the Developer tier is cost-competitive with other Llama hosting providers while delivering 3-10x better speed.

The limitation is model selection: Groq serves open models only. For applications that need GPT-4.1, Claude, or Gemini, pair Groq with another provider. TokenMix.ai makes this easy by routing all requests through a single endpoint -- send latency-sensitive queries to Groq and complex reasoning queries to other providers automatically.

Start now. The free tier is waiting.


FAQ

How do I get started with Groq API?

Go to console.groq.com, sign up (no credit card needed), create an API key, and make your first call. The entire process takes under five minutes. Use the openai Python or Node.js SDK with base_url="https://api.groq.com/openai/v1". Groq's free tier gives you 30 requests/minute and 6,000 tokens/minute on most models.

Is Groq API free?

Yes, Groq offers a free tier with no credit card required. Free limits include 30 requests/minute, 6,000 tokens/minute (15,000 for Gemma), and 500,000 tokens/day per model. This is sufficient for development and light production use. Paid Developer tier is available when you need higher limits.

Why is Groq so fast?

Groq uses custom LPU (Language Processing Unit) hardware designed specifically for sequential token generation. Unlike GPUs that are general-purpose processors adapted for AI, LPUs are purpose-built for inference. This hardware advantage delivers 200-500 tokens/second on large models -- 3-10x faster than any GPU-based provider.

What models are available on Groq?

Groq serves open-weight models: Llama 3.3 70B, Llama 4 Scout, Llama 4 Maverick, Mistral Saba, DeepSeek R1 Distill, and Gemma 2 9B. Groq does not offer proprietary models like GPT, Claude, or Gemini. For access to all models including Groq, use TokenMix.ai as a unified gateway.

How do I handle Groq rate limits?

Monitor response headers: x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens. Implement exponential backoff on 429 errors. On the free tier, space requests 2 seconds apart to stay within 30 RPM. For production, upgrade to the Developer tier for 60-120 RPM and 60K-120K TPM.

Can I use Groq with the OpenAI SDK?

Yes. Groq implements the OpenAI chat completions API format. Use the standard openai Python package or npm package with base_url="https://api.groq.com/openai/v1" (Python) or baseURL (Node.js). Your existing OpenAI-compatible code works with only the base URL, API key, and model name changed.


Related Articles


Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: Groq API Documentation, Groq Pricing, GroqCloud Console + TokenMix.ai