TokenMix Research Lab ยท 2026-04-12

Groq API Tutorial 2026: Free, 315 TPS, First Call in 3 Minutes

Groq API Tutorial: Getting Started With the Fastest AI Inference in 2026

Groq is the fastest AI inference provider available today. Its custom LPU (Language Processing Unit) hardware delivers 200-500 tokens per second on Llama 3.3 70B -- 3-10x faster than any GPU-based provider. The free tier requires no credit card, and you can make your first API call in under five minutes. This Groq tutorial covers everything: account setup, model selection, Python and Node.js examples, streaming, rate limit handling, and when to upgrade from the free tier to the Developer plan. Speed and pricing data verified by TokenMix.ai in April 2026.

Table of Contents


Quick Reference: Groq API Models and Pricing

Model Model ID Input $/M Output $/M Speed (tok/s) Context Free Tier
Llama 3.3 70B llama-3.3-70b-versatile $0.59 $0.79 200-350 128K 6K TPM
Llama 4 Scout meta-llama/llama-4-scout-17b-16e-instruct $0.11 $0.34 300-500 512K 6K TPM
Llama 4 Maverick meta-llama/llama-4-maverick-17b-128e-instruct $0.50 $0.77 150-250 256K 6K TPM
Mistral Saba mistral-saba-24b $0.20 $0.60 250-400 32K 6K TPM
DeepSeek R1 Distill deepseek-r1-distill-llama-70b $0.75 $0.99 150-250 128K 6K TPM
Gemma 2 9B gemma2-9b-it $0.20 $0.20 400-600 8K 6K TPM

Why Groq Is Different: LPU vs GPU

Every other AI inference provider runs models on NVIDIA GPUs. Groq built its own hardware -- the LPU (Language Processing Unit) -- designed specifically for sequential token generation.

The result: Groq serves Llama 3.3 70B at 200-350 tokens/second output speed. The same model on cloud GPU infrastructure typically runs at 30-80 tokens/second. That is a 3-10x speed advantage.

What this means in practice:

The trade-off: Groq currently runs a limited set of open models (primarily Llama, Mistral, and Gemma). You cannot use GPT-4.1, Claude, or Gemini on Groq. For applications where speed matters more than model-specific capabilities, this trade-off is worth it.

TokenMix.ai latency benchmarks confirm Groq consistently delivers the lowest time-to-first-token and fastest throughput of any API provider.


Getting Started: Free Account Setup

Groq's signup process is the fastest in the industry. No credit card, no waitlist, no approval process.

Step 1: Create an Account

Go to console.groq.com. Click "Sign Up." You can register with Google, GitHub, or email. The process takes under 60 seconds.

Step 2: Get Your API Key

After login, you land on the "API Keys" page. Click "Create API Key." Give it a name (e.g., "my-project"). Copy the key. It starts with gsk_.

Step 3: Test With curl

curl -X POST "https://api.groq.com/openai/v1/chat/completions" \
  -H "Authorization: Bearer YOUR_GROQ_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b-versatile",
    "messages": [{"role": "user", "content": "Hello, what can you do?"}]
  }'

Note the endpoint: https://api.groq.com/openai/v1. Groq implements the OpenAI API format, so the path includes /openai/v1.

What You Get on the Free Tier


Your First Groq API Call in Python

Installation

pip install openai

Groq is OpenAI-compatible. You use the standard openai SDK.

Basic Chat Completion

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_GROQ_API_KEY",
    base_url="https://api.groq.com/openai/v1"
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a concise assistant. Answer in 2-3 sentences."},
        {"role": "user", "content": "What is a REST API?"}
    ]
)

print(response.choices[0].message.content)
# Response arrives in <1 second for short outputs

Check Response Speed

import time

start = time.time()
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a 200-word explanation of machine learning."}]
)
elapsed = time.time() - start

tokens = response.usage.completion_tokens
speed = tokens / elapsed

print(f"Generated {tokens} tokens in {elapsed:.2f}s ({speed:.0f} tok/s)")
# Typical output: Generated 250 tokens in 0.9s (278 tok/s)

Using Environment Variables

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.getenv("GROQ_API_KEY"),
    base_url="https://api.groq.com/openai/v1"
)
export GROQ_API_KEY="gsk-your-key-here"

Your First Groq API Call in Node.js

Installation

npm install openai

Basic Chat Completion

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.GROQ_API_KEY,
  baseURL: "https://api.groq.com/openai/v1",
});

const response = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "system", content: "You are a concise assistant." },
    { role: "user", content: "What is a REST API?" },
  ],
});

console.log(response.choices[0].message.content);

Error Handling

try {
  const response = await client.chat.completions.create({
    model: "llama-3.3-70b-versatile",
    messages: [{ role: "user", content: "Hello" }],
  });
  console.log(response.choices[0].message.content);
} catch (error) {
  if (error instanceof OpenAI.RateLimitError) {
    console.error("Rate limited. Free tier: 30 RPM, 6K TPM.");
  } else if (error instanceof OpenAI.AuthenticationError) {
    console.error("Invalid API key.");
  } else {
    throw error;
  }
}

Model Selection Guide

Task Best Groq Model Why
General chat, Q&A Llama 3.3 70B Best overall quality on Groq
Long documents (100K+ tokens) Llama 4 Scout 512K context window
Fast, simple tasks Gemma 2 9B Fastest speed, lowest cost
Reasoning and math DeepSeek R1 Distill 70B Best reasoning on Groq
Multilingual tasks Mistral Saba Strong multilingual performance
Code generation Llama 3.3 70B Best code quality on Groq
Maximum context + quality Llama 4 Maverick 256K context, strong performance

Default recommendation: Start with Llama 3.3 70B. It handles most tasks well at a good price. Switch to specialized models only when specific requirements demand it.


Streaming: Real-Time Token Delivery

Groq's speed makes streaming feel qualitatively different from other providers. Tokens arrive so fast that streaming approaches the feel of pre-generated text.

Python Streaming

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_GROQ_API_KEY",
    base_url="https://api.groq.com/openai/v1"
)

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Explain Docker containers in detail."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Node.js Streaming

const stream = await client.chat.completions.create({
  model: "llama-3.3-70b-versatile",
  messages: [
    { role: "user", content: "Explain Docker containers in detail." },
  ],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) {
    process.stdout.write(content);
  }
}

Rate Limit Handling and Free Tier Optimization

Free Tier Limits

Model Requests/Min Tokens/Min Tokens/Day
Llama 3.3 70B 30 6,000 500,000
Llama 4 Scout 30 6,000 500,000
Gemma 2 9B 30 15,000 500,000
DeepSeek R1 Distill 30 6,000 500,000

Rate Limit Response Headers

Groq returns detailed rate limit information in every response:

x-ratelimit-limit-requests: 30
x-ratelimit-remaining-requests: 28
x-ratelimit-limit-tokens: 6000
x-ratelimit-remaining-tokens: 4500
x-ratelimit-reset-requests: 2.1s
x-ratelimit-reset-tokens: 45.5s

Free Tier Optimization Strategies

1. Use shorter prompts. The 6K TPM limit means long prompts eat your budget. Keep system prompts under 200 tokens.

2. Use Gemma 2 9B for simple tasks. It has a higher TPM limit (15K) and processes faster. Use Llama 70B only when you need the quality.

3. Implement request queuing. Space requests 2 seconds apart to stay within 30 RPM. Use a simple queue:

import time
from collections import deque

class RateLimiter:
    def __init__(self, rpm=30):
        self.requests = deque()
        self.rpm = rpm
    
    def wait_if_needed(self):
        now = time.time()
        while self.requests and self.requests[0] < now - 60:
            self.requests.popleft()
        if len(self.requests) >= self.rpm:
            wait_time = 60 - (now - self.requests[0])
            time.sleep(wait_time)
        self.requests.append(time.time())

limiter = RateLimiter(rpm=28)  # Stay slightly under limit

for prompt in prompts:
    limiter.wait_if_needed()
    response = client.chat.completions.create(
        model="llama-3.3-70b-versatile",
        messages=[{"role": "user", "content": prompt}]
    )

4. Cache responses locally. For repeated identical queries, cache the response to avoid hitting the API again.


Tool Calling and JSON Mode

Tool Calling

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "get_weather",
                "description": "Get weather for a city",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "city": {"type": "string", "description": "City name"},
                        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                    },
                    "required": ["city"]
                }
            }
        }
    ],
    tool_choice="auto"
)

if response.choices[0].message.tool_calls:
    call = response.choices[0].message.tool_calls[0]
    print(f"Function: {call.function.name}")
    print(f"Args: {call.function.arguments}")

JSON Mode

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "Return valid JSON with keys: name, capital, continent."},
        {"role": "user", "content": "Tell me about Brazil."}
    ],
    response_format={"type": "json_object"}
)

import json
data = json.loads(response.choices[0].message.content)
print(data)

Using Groq Through TokenMix.ai

TokenMix.ai provides access to Groq models alongside 300+ other models through a single endpoint.

from openai import OpenAI

client = OpenAI(
    api_key="tmx-your-key",
    base_url="https://api.tokenmix.ai/v1"
)

# Access Groq's Llama through TokenMix.ai
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello from TokenMix.ai"}]
)

Why route through TokenMix.ai:


When to Upgrade From Free to Developer Tier

Developer Tier Benefits

Feature Free Tier Developer Tier
Requests/min 30 60-120
Tokens/min 6,000 60,000-120,000
Tokens/day 500,000 Unlimited (pay-per-use)
Credit card required No Yes
Cost $0 Pay per token

Upgrade When You Hit These Signals

  1. Hitting daily token limits regularly. If 500K tokens/day is not enough for your workload, you need the Developer tier.
  2. Users waiting due to rate limits. If 30 RPM causes visible delays for users, upgrade for 60-120 RPM.
  3. Moving to production. Any user-facing application should not rely on free tier limits.
  4. Need guaranteed availability. Free tier requests are deprioritized during peak usage.

Cost After Upgrading

Monthly Volume Llama 3.3 70B Cost Gemma 2 9B Cost Llama 4 Scout Cost
10M tokens $6.70 $2.00 $2.47
50M tokens $33.50 0.00 2.35
100M tokens $67.00 $20.00 $24.70
500M tokens $335.00 00.00 23.50

Even on the paid tier, Groq's pricing is competitive with other providers for the same open models.


Groq vs Other Providers: Speed and Cost

Provider Model Output Speed (tok/s) Cost per 100M tokens Latency (TTFT)
Groq Llama 3.3 70B 200-350 $67 100-300ms
Together AI Llama 3.3 70B 50-100 $88 300-600ms
Fireworks Llama 3.3 70B 40-80 $90 300-700ms
OpenAI GPT-4.1 mini 40-80 $88 200-500ms
Anthropic Claude Haiku 3.5 30-60 $208 300-800ms

Groq is 3-5x faster than the next fastest provider for Llama models. For applications where latency directly impacts user experience, this gap is significant.


Decision Guide: When to Choose Groq

Situation Choose Groq? Reason
Need fastest possible inference Yes 3-10x faster than GPU providers
Real-time chat interface Yes Sub-second responses feel instant
Prototyping with no budget Yes Free tier, no credit card
Need GPT-4.1 or Claude No Groq only serves open models
Need frontier reasoning No DeepSeek R1 Distill is good but not o3/Opus level
High-volume open model workloads Yes Fast + cost-competitive
Need 500K+ context Yes Llama 4 Scout has 512K context
Want multi-provider flexibility Use TokenMix.ai Access Groq + others through one endpoint

Conclusion

Groq offers the fastest AI inference available and the easiest onboarding in the industry. No credit card, sub-five-minute setup, and tokens that arrive faster than you can read them.

For development and prototyping, the free tier is generous enough to build complete applications. For production, the Developer tier is cost-competitive with other Llama hosting providers while delivering 3-10x better speed.

The limitation is model selection: Groq serves open models only. For applications that need GPT-4.1, Claude, or Gemini, pair Groq with another provider. TokenMix.ai makes this easy by routing all requests through a single endpoint -- send latency-sensitive queries to Groq and complex reasoning queries to other providers automatically.

Start now. The free tier is waiting.


FAQ

How do I get started with Groq API?

Go to console.groq.com, sign up (no credit card needed), create an API key, and make your first call. The entire process takes under five minutes. Use the openai Python or Node.js SDK with base_url="https://api.groq.com/openai/v1". Groq's free tier gives you 30 requests/minute and 6,000 tokens/minute on most models.

Is Groq API free?

Yes, Groq offers a free tier with no credit card required. Free limits include 30 requests/minute, 6,000 tokens/minute (15,000 for Gemma), and 500,000 tokens/day per model. This is sufficient for development and light production use. Paid Developer tier is available when you need higher limits.

Why is Groq so fast?

Groq uses custom LPU (Language Processing Unit) hardware designed specifically for sequential token generation. Unlike GPUs that are general-purpose processors adapted for AI, LPUs are purpose-built for inference. This hardware advantage delivers 200-500 tokens/second on large models -- 3-10x faster than any GPU-based provider.

What models are available on Groq?

Groq serves open-weight models: Llama 3.3 70B, Llama 4 Scout, Llama 4 Maverick, Mistral Saba, DeepSeek R1 Distill, and Gemma 2 9B. Groq does not offer proprietary models like GPT, Claude, or Gemini. For access to all models including Groq, use TokenMix.ai as a unified gateway.

How do I handle Groq rate limits?

Monitor response headers: x-ratelimit-remaining-requests and x-ratelimit-remaining-tokens. Implement exponential backoff on 429 errors. On the free tier, space requests 2 seconds apart to stay within 30 RPM. For production, upgrade to the Developer tier for 60-120 RPM and 60K-120K TPM.

Can I use Groq with the OpenAI SDK?

Yes. Groq implements the OpenAI chat completions API format. Use the standard openai Python package or npm package with base_url="https://api.groq.com/openai/v1" (Python) or baseURL (Node.js). Your existing OpenAI-compatible code works with only the base URL, API key, and model name changed.


Author: TokenMix Research Lab | Last Updated: April 2026 | Data Source: Groq API Documentation, Groq Pricing, GroqCloud Console + TokenMix.ai