TokenMix Research Lab · 2026-04-12

Groq API Tutorial 2026: Free, 315 TPS, First Call in 3 Minutes

Groq API Tutorial: Getting Started With the Fastest AI Inference in 2026

Groq is the fastest AI inference provider available today. Its custom LPU (Language Processing Unit) hardware delivers 200-500 tokens per second on Llama 3.3 70B -- 3-10x faster than any GPU-based provider. The free tier requires no credit card, and you can make your first API call in under five minutes. This Groq tutorial covers everything: account setup, model selection, Python and Node.js examples, streaming, rate limit handling, and when to upgrade from the free tier to the Developer plan. Speed and pricing data verified by TokenMix.ai in April 2026.

[Quick Reference: Groq API Models and Pricing]
[Why Groq Is Different: LPU vs GPU]
[Getting Started: Free Account Setup]
[Your First Groq API Call in Python]
[Your First Groq API Call in Node.js]
[Model Selection Guide]
[Streaming: Real-Time Token Delivery]
[Rate Limit Handling and Free Tier Optimization]
[Tool Calling and JSON Mode]
[Using Groq Through TokenMix.ai]
[When to Upgrade From Free to Developer Tier]
[Groq vs Other Providers: Speed and Cost]
[Decision Guide: When to Choose Groq]
[Conclusion]
[FAQ]

Quick Reference: Groq API Models and Pricing

Model	Model ID	Input $/M	Output $/M	Speed (tok/s)	Context	Free Tier
Llama 3.3 70B	llama-3.3-70b-versatile	$0.59	$0.79	200-350	128K	6K TPM
Llama 4 Scout	meta-llama/llama-4-scout-17b-16e-instruct	$0.11	$0.34	300-500	512K	6K TPM
Llama 4 Maverick	meta-llama/llama-4-maverick-17b-128e-instruct	$0.50	$0.77	150-250	256K	6K TPM
Mistral Saba	mistral-saba-24b	$0.20	$0.60	250-400	32K	6K TPM
DeepSeek R1 Distill	deepseek-r1-distill-llama-70b	$0.75	$0.99	150-250	128K	6K TPM
Gemma 2 9B	gemma2-9b-it	$0.20	$0.20	400-600	8K	6K TPM

Why Groq Is Different: LPU vs GPU

Every other AI inference provider runs models on NVIDIA GPUs. Groq built its own hardware -- the LPU (Language Processing Unit) -- designed specifically for sequential token generation.

The result: Groq serves Llama 3.3 70B at 200-350 tokens/second output speed. The same model on cloud GPU infrastructure typically runs at 30-80 tokens/second. That is a 3-10x speed advantage.

What this means in practice:

A 500-token response takes 1-2 seconds on Groq vs 6-16 seconds on GPU providers
Streaming feels instant -- tokens appear faster than most users can read
Development iteration is faster -- you spend less time waiting for responses during testing

The trade-off: Groq currently runs a limited set of open models (primarily Llama, Mistral, and Gemma). You cannot use GPT-4.1, Claude, or Gemini on Groq. For applications where speed matters more than model-specific capabilities, this trade-off is worth it.

TokenMix.ai latency benchmarks confirm Groq consistently delivers the lowest time-to-first-token and fastest throughput of any API provider.

Getting Started: Free Account Setup

Groq's signup process is the fastest in the industry. No credit card, no waitlist, no approval process.

Step 1: Create an Account

Go to console.groq.com. Click "Sign Up." You can register with Google, GitHub, or email. The process takes under 60 seconds.

Step 2: Get Your API Key

After login, you land on the "API Keys" page. Click "Create API Key." Give it a name (e.g., "my-project"). Copy the key. It starts with gsk_.

Step 3: Test With curl

curl -X POST "https://api.groq.com/openai/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_GROQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b-versatile",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}]
  }'

Note the endpoint: https://api.groq.com/openai/v1. Groq implements the OpenAI API format, so the path includes /openai/v1.

What You Get on the Free Tier

No credit card required
Access to all models
Rate limits: 6,000 tokens per minute (TPM), 30 requests per minute (RPM) per model
Daily token limits: 500K tokens/day for most models
Sufficient for development, prototyping, and light production

Your First Groq API Call in Python

Installation

pip install openai

Groq is OpenAI-compatible. You use the standard openai SDK.

Basic Chat Completion

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_GROQ_API_KEY",
    base_url="https://api.groq.com/openai/v1"
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a concise assistant. Answer in 2-3 sentences."},
        {"role": "user", "content": "What is a REST API?"}
    ]
)

print(response.choices[0].message.content)
# Response arrives in <1 second for short outputs

Check Response Speed

import time

start = time.time()
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a 200-word explanation of machine learning."}]
)
elapsed = time.time() - start

tokens = response.usage.completion_tokens
speed = tokens / elapsed

print(f"Generated {tokens} tokens in {elapsed:.2f}s ({speed:.0f} tok/s)")
# Typical output: Generated 250 tokens in 0.9s (278 tok/s)

Using Environment Variables

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("GROQ_API_KEY"),
    base_url="https://api.groq.com/openai/v1"
)

export GROQ_API_KEY="gsk-your-key-here"

Your First Groq API Call in Node.js

Installation

npm install openai

Basic Chat Completion

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.GROQ_API_KEY,
  baseURL: "https://api.groq.com/openai/v1",
});

const response = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "system", content: "You are a concise assistant." },
    { role: "user", content: "What is a REST API?" },
  ],
});

console.log(response.choices[0].message.content);

Error Handling

try {
  const response = await client.chat.completions.create({
    model: "llama-3.3-70b-versatile",
    messages: [{ role: "user", content: "Hello" }],
  });
  console.log(response.choices[0].message.content);
} catch (error) {
  if (error instanceof OpenAI.RateLimitError) {
    console.error("Rate limited. Free tier: 30 RPM, 6K TPM.");
  } else if (error instanceof OpenAI.AuthenticationError) {
    console.error("Invalid API key.");
  } else {
    throw error;
  }
}

Model Selection Guide

Task	Best Groq Model	Why
General chat, Q&A	Llama 3.3 70B	Best overall quality on Groq
Long documents (100K+ tokens)	Llama 4 Scout	512K context window
Fast, simple tasks	Gemma 2 9B	Fastest speed, lowest cost
Reasoning and math	DeepSeek R1 Distill 70B	Best reasoning on Groq
Multilingual tasks	Mistral Saba	Strong multilingual performance
Code generation	Llama 3.3 70B	Best code quality on Groq
Maximum context + quality	Llama 4 Maverick	256K context, strong performance

Default recommendation: Start with Llama 3.3 70B. It handles most tasks well at a good price. Switch to specialized models only when specific requirements demand it.

Streaming: Real-Time Token Delivery

Groq's speed makes streaming feel qualitatively different from other providers. Tokens arrive so fast that streaming approaches the feel of pre-generated text.

Python Streaming

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_GROQ_API_KEY",
    base_url="https://api.groq.com/openai/v1"
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain Docker containers in detail."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Node.js Streaming

const stream = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "user", content: "Explain Docker containers in detail." },
  ],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

Rate Limit Handling and Free Tier Optimization

Free Tier Limits

Model	Requests/Min	Tokens/Min	Tokens/Day
Llama 3.3 70B	30	6,000	500,000
Llama 4 Scout	30	6,000	500,000
Gemma 2 9B	30	15,000	500,000
DeepSeek R1 Distill	30	6,000	500,000

Rate Limit Response Headers

Groq returns detailed rate limit information in every response:

x-ratelimit-limit-requests: 30
x-ratelimit-remaining-requests: 28
x-ratelimit-limit-tokens: 6000
x-ratelimit-remaining-tokens: 4500
x-ratelimit-reset-requests: 2.1s
x-ratelimit-reset-tokens: 45.5s

Free Tier Optimization Strategies

1. Use shorter prompts. The 6K TPM limit means long prompts eat your budget. Keep system prompts under 200 tokens.

2. Use Gemma 2 9B for simple tasks. It has a higher TPM limit (15K) and processes faster. Use Llama 70B only when you need the quality.

3. Implement request queuing. Space requests 2 seconds apart to stay within 30 RPM. Use a simple queue:

import time
from collections import deque

class RateLimiter:
    def __init__(self, rpm=30):
        self.requests = deque()
        self.rpm = rpm
    
    def wait_if_needed(self):
        now = time.time()
        while self.requests and self.requests[0] < now - 60:
            self.requests.popleft()
        if len(self.requests) >= self.rpm:
            wait_time = 60 - (now - self.requests[0])
            time.sleep(wait_time)
        self.requests.append(time.time())

limiter = RateLimiter(rpm=28)  # Stay slightly under limit

for prompt in prompts:
    limiter.wait_if_needed()
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}]
    )

4. Cache responses locally. For repeated identical queries, cache the response to avoid hitting the API again.

Tool Calling and JSON Mode

Tool Calling

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get weather for a city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string", "description": "City name"},
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                    },
                    "required": ["city"]
                }
            }
        }
    ],
    tool_choice="auto"
)

if response.choices[0].message.tool_calls:
    call = response.choices[0].message.tool_calls[0]
    print(f"Function: {call.function.name}")
    print(f"Args: {call.function.arguments}")

JSON Mode

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "Return valid JSON with keys: name, capital, continent."},
        {"role": "user", "content": "Tell me about Brazil."}
    ],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)
print(data)

Using Groq Through TokenMix.ai

TokenMix.ai provides access to Groq models alongside 300+ other models through a single endpoint.

from openai import OpenAI

client = OpenAI(
    api_key="tmx-your-key",
    base_url="https://api.tokenmix.ai/v1"
)

# Access Groq's Llama through TokenMix.ai
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello from TokenMix.ai"}]
)

Why route through TokenMix.ai:

Automatic failover to other providers if Groq rate limits hit
Unified billing with other model usage
Higher effective rate limits through request pooling
Switch between Groq and other providers without code changes

When to Upgrade From Free to Developer Tier

Developer Tier Benefits

Feature	Free Tier	Developer Tier
Requests/min	30	60-120
Tokens/min	6,000	60,000-120,000
Tokens/day	500,000	Unlimited (pay-per-use)
Credit card required	No	Yes
Cost	$0	Pay per token

Upgrade When You Hit These Signals

Hitting daily token limits regularly. If 500K tokens/day is not enough for your workload, you need the Developer tier.
Users waiting due to rate limits. If 30 RPM causes visible delays for users, upgrade for 60-120 RPM.
Moving to production. Any user-facing application should not rely on free tier limits.
Need guaranteed availability. Free tier requests are deprioritized during peak usage.

Cost After Upgrading

Monthly Volume	Llama 3.3 70B Cost	Gemma 2 9B Cost	Llama 4 Scout Cost
10M tokens	$6.70	$2.00	$2.47
50M tokens	$33.50	0.00	2.35
100M tokens	$67.00	$20.00	$24.70
500M tokens	$335.00	00.00	23.50

Even on the paid tier, Groq's pricing is competitive with other providers for the same open models.

Groq vs Other Providers: Speed and Cost

Provider	Model	Output Speed (tok/s)	Cost per 100M tokens	Latency (TTFT)
Groq	Llama 3.3 70B	200-350	$67	100-300ms
Together AI	Llama 3.3 70B	50-100	$88	300-600ms
Fireworks	Llama 3.3 70B	40-80	$90	300-700ms
OpenAI	GPT-4.1 mini	40-80	$88	200-500ms
Anthropic	Claude Haiku 3.5	30-60	$208	300-800ms

Groq is 3-5x faster than the next fastest provider for Llama models. For applications where latency directly impacts user experience, this gap is significant.

Decision Guide: When to Choose Groq

Situation	Choose Groq?	Reason
Need fastest possible inference	Yes	3-10x faster than GPU providers
Real-time chat interface	Yes	Sub-second responses feel instant
Prototyping with no budget	Yes	Free tier, no credit card
Need GPT-4.1 or Claude	No	Groq only serves open models
Need frontier reasoning	No	DeepSeek R1 Distill is good but not o3/Opus level
High-volume open model workloads	Yes	Fast + cost-competitive
Need 500K+ context	Yes	Llama 4 Scout has 512K context
Want multi-provider flexibility	Use TokenMix.ai	Access Groq + others through one endpoint

Conclusion

Groq offers the fastest AI inference available and the easiest onboarding in the industry. No credit card, sub-five-minute setup, and tokens that arrive faster than you can read them.

For development and prototyping, the free tier is generous enough to build complete applications. For production, the Developer tier is cost-competitive with other Llama hosting providers while delivering 3-10x better speed.

The limitation is model selection: Groq serves open models only. For applications that need GPT-4.1, Claude, or Gemini, pair Groq with another provider. TokenMix.ai makes this easy by routing all requests through a single endpoint -- send latency-sensitive queries to Groq and complex reasoning queries to other providers automatically.

Start now. The free tier is waiting.

FAQ

How do I get started with Groq API?

Go to console.groq.com, sign up (no credit card needed), create an API key, and make your first call. The entire process takes under five minutes. Use the openai Python or Node.js SDK with base_url="https://api.groq.com/openai/v1". Groq's free tier gives you 30 requests/minute and 6,000 tokens/minute on most models.

Is Groq API free?

Yes, Groq offers a free tier with no credit card required. Free limits include 30 requests/minute, 6,000 tokens/minute (15,000 for Gemma), and 500,000 tokens/day per model. This is sufficient for development and light production use. Paid Developer tier is available when you need higher limits.

Why is Groq so fast?

Groq uses custom LPU (Language Processing Unit) hardware designed specifically for sequential token generation. Unlike GPUs that are general-purpose processors adapted for AI, LPUs are purpose-built for inference. This hardware advantage delivers 200-500 tokens/second on large models -- 3-10x faster than any GPU-based provider.

What models are available on Groq?

Groq serves open-weight models: Llama 3.3 70B, Llama 4 Scout, Llama 4 Maverick, Mistral Saba, DeepSeek R1 Distill, and Gemma 2 9B. Groq does not offer proprietary models like GPT, Claude, or Gemini. For access to all models including Groq, use TokenMix.ai as a unified gateway.

How do I handle Groq rate limits?

Monitor response headers: x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens. Implement exponential backoff on 429 errors. On the free tier, space requests 2 seconds apart to stay within 30 RPM. For production, upgrade to the Developer tier for 60-120 RPM and 60K-120K TPM.

Can I use Groq with the OpenAI SDK?

Yes. Groq implements the OpenAI chat completions API format. Use the standard openai Python package or npm package with base_url="https://api.groq.com/openai/v1" (Python) or baseURL (Node.js). Your existing OpenAI-compatible code works with only the base URL, API key, and model name changed.

Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: Groq API Documentation, Groq Pricing, GroqCloud Console + TokenMix.ai

Groq API Tutorial: Getting Started With the Fastest AI Inference in 2026

Table of Contents

Quick Reference: Groq API Models and Pricing

Why Groq Is Different: LPU vs GPU

Getting Started: Free Account Setup

Step 1: Create an Account

Step 2: Get Your API Key

Step 3: Test With curl

What You Get on the Free Tier

Your First Groq API Call in Python

Installation

Basic Chat Completion

Check Response Speed

Using Environment Variables

Your First Groq API Call in Node.js

Installation

Basic Chat Completion

Error Handling

Model Selection Guide

Streaming: Real-Time Token Delivery

Python Streaming

Node.js Streaming

Rate Limit Handling and Free Tier Optimization

Free Tier Limits

Rate Limit Response Headers

Free Tier Optimization Strategies

Tool Calling and JSON Mode

Tool Calling

JSON Mode

Using Groq Through TokenMix.ai

When to Upgrade From Free to Developer Tier

Developer Tier Benefits

Upgrade When You Hit These Signals

Cost After Upgrading

Groq vs Other Providers: Speed and Cost

Decision Guide: When to Choose Groq

Conclusion

FAQ

How do I get started with Groq API?

Is Groq API free?

Why is Groq so fast?

What models are available on Groq?

How do I handle Groq rate limits?

Can I use Groq with the OpenAI SDK?