Replicate Pricing Guide: Per-Prediction Costs for Image, Video, and LLM Models (2026)
Replicate pricing works differently from every other AI API provider. Instead of charging per token or per request, Replicate charges per second of compute time on the GPU your model runs on. This per-prediction pricing model makes Replicate extremely cost-effective for image generation (as low as $0.003 per image with Flux Dev) but surprisingly expensive for large language model inference compared to dedicated providers. TokenMix.ai cost analysis shows Replicate is 3-5x cheaper than competitors for image workloads but 2-4x more expensive for LLM inference.
This guide breaks down Replicate API pricing for image models (Flux, SDXL), video models, and LLMs, with real cost calculations for different workloads.
Table of Contents
[Quick Comparison: Replicate vs Alternatives by Workload]
[How Replicate Pricing Works: The Per-Prediction Model]
[Image Model Pricing on Replicate]
[Video Model Pricing on Replicate]
[LLM Pricing on Replicate]
[Replicate Cost vs Dedicated Providers]
[Cost Optimization Strategies]
[Cost Analysis for Different Workloads]
[How to Choose: Decision Guide]
[Conclusion]
[FAQ]
Quick Comparison: Replicate vs Alternatives by Workload
Workload Type
Replicate Cost
Best Alternative
Alternative Cost
Replicate Advantage?
Image generation (Flux Pro)
$0.055/image
Fireworks AI
$0.04/image
No -- Fireworks is cheaper
Image generation (Flux Dev)
$0.003/image
Self-hosted (A100)
$0.002/image
Yes -- simpler, comparable price
Image generation (SDXL)
$0.002/image
RunPod
$0.001/image
Yes -- no setup, near-parity
Video generation (Minimax)
$0.30-0.80/clip
RunwayML
$0.50-1.00/clip
Yes -- 30-40% cheaper
LLM inference (Llama 70B)
~$2.80/1M tokens
Together AI
$0.88/1M tokens
No -- 3x more expensive
LLM inference (Llama 8B)
~$0.60/1M tokens
Groq
$0.05/1M tokens
No -- 12x more expensive
Custom model hosting
$0.00115/sec (A40)
AWS SageMaker
.20/hour
Yes -- pay only when running
How Replicate Pricing Works: The Per-Prediction Model
Replicate's pricing model is unique in the AI API market. Every model on Replicate runs on a specific GPU hardware tier, and you pay per second of compute time on that hardware.
Hardware Tiers and Rates (April 2026)
Hardware
Cost per Second
Cost per Hour
Typical Use
CPU
$0.000100
$0.36
Lightweight processing
Nvidia T4
$0.000225
$0.81
Small models, inference
Nvidia A40 (Large)
$0.001150
$4.14
Medium models, image gen
Nvidia A100 (40GB)
$0.001150
$4.14
Large models
Nvidia A100 (80GB)
$0.001400
$5.04
Very large models
Nvidia H100
$0.003200
1.52
Largest models, fast inference
2x Nvidia A40
$0.002300
$8.28
Multi-GPU workloads
4x Nvidia A100
$0.006000
$21.60
70B+ parameter models
How Prediction Billing Works
When you send a request to a model on Replicate:
If the model is cold (no active instance), it spins up -- you are billed for boot time (typically 5-30 seconds, model-dependent)
The prediction runs -- you are billed for run time
If no requests arrive for the idle timeout period, the model shuts down
This means your actual cost per prediction depends on three factors:
Hardware tier the model runs on
Run time per prediction (varies by model complexity and input)
Cold start frequency -- if your traffic is bursty, you pay for cold starts repeatedly
Cold Start: The Hidden Cost
Cold starts are Replicate's biggest gotcha. If a model is not already warm (running), the first request triggers a boot that can take 5-30 seconds depending on model size. During this boot time, you are billed.
TokenMix.ai tracked cold start behavior across popular Replicate models:
Model
Cold Start Time
Cold Start Cost
Warm Prediction Time
Warm Prediction Cost
Flux 1.1 Pro
12-18s
$0.014-0.021
3-5s
$0.035-0.058
Flux 1 Dev
8-12s
$0.009-0.014
2-4s
$0.002-0.005
SDXL
10-15s
$0.012-0.017
3-5s
$0.003-0.006
Llama 3.3 70B
25-40s
$0.150-0.240
Varies
Varies
Whisper Large
8-12s
$0.009-0.014
5-30s
$0.006-0.035
For high-frequency workloads, models stay warm and cold starts are negligible. For low-frequency or bursty workloads, cold start costs can double or triple your effective per-prediction cost.
Image Model Pricing on Replicate
Image generation is Replicate's strongest value proposition. The pay-per-prediction model means you only pay for what you generate, with no minimum commitments.
Flux Model Pricing
Model
Hardware
Avg Run Time
Cost per Image
Quality
Flux 1.1 Pro
A100 80GB
3-5s
$0.04-0.07
Highest quality
Flux 1.1 Pro (Ultra)
A100 80GB
5-8s
$0.07-0.11
Highest, larger resolution
Flux 1 Dev
A40
2-4s
$0.002-0.005
Good quality, fast
Flux 1 Schnell
A40
1-2s
$0.001-0.002
Fast draft quality
Flux Kontext Pro
A100 80GB
3-6s
$0.04-0.08
Best for editing/context
SDXL and Other Image Models
Model
Hardware
Avg Run Time
Cost per Image
SDXL 1.0
A40
3-5s
$0.003-0.006
SDXL Turbo
A40
1-2s
$0.001-0.002
Stable Diffusion 3.5
A40
4-6s
$0.005-0.007
Ideogram 2.0
A100 80GB
4-7s
$0.006-0.010
Recraft V3
A100 80GB
3-5s
$0.004-0.007
Image Pricing Compared to Alternatives
Model
Replicate
Fireworks AI
BFL API (Direct)
fal.ai
Flux 1.1 Pro
$0.055/image
$0.04/image
$0.04/image
$0.035/image
Flux 1 Dev
$0.003/image
$0.025/image
N/A
$0.025/image
SDXL
$0.004/image
$0.013/image
N/A
$0.010/image
Interesting pattern: Replicate is more expensive for Flux Pro (the premium model) but dramatically cheaper for Flux Dev and SDXL. The reason is hardware tier -- Dev and SDXL run on A40 GPUs ($0.00115/sec) while Pro runs on A100 80GB ($0.0014/sec) with longer generation times.
For teams doing high-volume image generation with Flux Dev or SDXL, Replicate offers some of the lowest per-image costs available.
Video Model Pricing on Replicate
Video generation on Replicate follows the same per-second compute billing, but costs add up faster because video models run longer.
Model
Hardware
Avg Run Time
Cost per Clip
Clip Length
Minimax Video (Hailuo)
H100
60-120s
$0.19-0.38
5-6 seconds
Kling 1.6 Pro
H100
90-180s
$0.29-0.58
5-10 seconds
Luma Dream Machine
A100 80GB
30-60s
$0.04-0.08
4-5 seconds
Wan 2.1 (1080p)
H100
120-240s
$0.38-0.77
5-8 seconds
Video Pricing vs Alternatives
Provider
Minimax 5s clip
Kling 5s clip
Platform Model
Replicate
$0.19-0.38
$0.29-0.58
Pay per second
RunwayML (Gen-3)
$0.50/clip
N/A
Credit-based
Pika Labs
$0.40/clip
N/A
Subscription
fal.ai
$0.25-0.45
$0.30-0.55
Pay per second
Replicate is competitive on video pricing, particularly for Minimax and Luma models. The per-second billing model benefits video workloads because you do not pay subscription fees for idle capacity.
LLM Pricing on Replicate
This is where Replicate's pricing model breaks down for cost-conscious users. Running LLMs on Replicate is significantly more expensive than using dedicated inference providers.
LLM Cost Estimates on Replicate
Model
Hardware
Per 1M Input Tokens (est.)
Per 1M Output Tokens (est.)
Llama 3.3 8B
A40
$0.40-0.60
$0.80-1.20
Llama 3.3 70B
4x A100
.80-2.80
$3.60-5.60
Mixtral 8x22B
4x A100
$2.00-3.00
$4.00-6.00
Qwen 3 72B
4x A100
.80-2.80
$3.60-5.60
Note: These are estimates because Replicate bills per second of compute, not per token. Actual cost depends on tokenization speed, batch size, and prompt length.
LLM Pricing: Replicate vs Dedicated Providers
Model
Replicate (est.)
Together AI
Groq
Fireworks AI
Llama 3.3 70B (per 1M tokens)
~$2.80
$0.88
$0.59
$0.90
Llama 3.3 8B (per 1M tokens)
~$0.60
$0.18
$0.05
$0.20
Replicate costs 3-5x more than dedicated inference providers for LLMs. The per-second billing model is inefficient for text generation because you pay for GPU time that includes tokenization overhead, model loading, and other fixed costs that dedicated providers amortize across millions of users.
TokenMix.ai recommendation: Do not use Replicate for production LLM inference. Use dedicated providers (Together AI, Groq, Fireworks) for text generation, and reserve Replicate for image, video, and custom model workloads where its per-prediction pricing shines.
Low-latency requirements: Cold starts add 5-30 seconds. Not suitable for real-time applications.
Predictable high-volume workloads: Per-second billing loses to reserved capacity pricing.
Cost Optimization Strategies
1. Keep Models Warm
Replicate models shut down after an idle timeout (typically 5-15 minutes). For frequently used models, configure a longer idle timeout or send periodic keep-alive requests. The cost of keeping a model warm on an A40 for an hour ($4.14) is often cheaper than paying for 10+ cold starts.
2. Use Webhooks Instead of Polling
Replicate supports async predictions with webhook callbacks. This reduces the need for polling and lets you batch-process results efficiently.
3. Choose the Right Hardware Tier
Many models offer multiple hardware options. Flux Dev on A40 costs $0.003/image; on A100 it would cost more. Always check if a lighter GPU tier is available for your model.
4. Batch Similar Requests
Some models support batch inputs (multiple images in one prediction). This reduces per-item cold start amortization.
5. Use TokenMix.ai for LLM Workloads
Route LLM inference through TokenMix.ai to dedicated providers (Together AI, Groq) at 3-5x lower cost, and keep Replicate for image and video workloads where it excels.
Replicate pricing is excellent for image and video workloads and poor for LLM inference. That is the core takeaway.
For image generation, Replicate's per-prediction model delivers costs as low as $0.002 per image with Flux Dev or SDXL -- cheaper than most alternatives at moderate volumes. For video generation, Replicate offers competitive per-clip pricing without subscription requirements.
For LLM inference, Replicate costs 3-5x more than dedicated providers. The per-second GPU billing model is structurally inefficient for text generation, where providers like Together AI and Groq have optimized specifically for token throughput.
The optimal strategy for mixed workloads: use Replicate for image and video generation, and route LLM inference through TokenMix.ai to access Together AI, Groq, and Fireworks at the lowest available prices. TokenMix.ai provides a unified API for both workflows, simplifying billing and enabling automatic cost optimization across providers.
Check current Replicate model pricing and compare with alternatives on TokenMix.ai.
FAQ
How does Replicate pricing work compared to per-token pricing?
Replicate charges per second of GPU compute time rather than per token. You pay based on which GPU hardware your model runs on (ranging from $0.000225/sec for T4 to $0.0032/sec for H100) and how long each prediction takes. This makes Replicate cheaper for short, compute-intensive tasks like image generation and more expensive for token-heavy LLM workloads.
Is Replicate cheaper than Midjourney for image generation?
For Flux Dev and SDXL models, yes. Replicate costs $0.002-0.005 per image versus Midjourney's subscription model (
0-60/month for limited generations). At low volume, Replicate is significantly cheaper. At high volume (1,000+ images/day), Midjourney's unlimited plan may be more cost-effective depending on the model quality you need.
Why is Replicate expensive for LLM inference?
Replicate's per-second GPU billing means you pay for all compute time including model loading, tokenization overhead, and GPU idle time between token generation steps. Dedicated inference providers like Together AI and Groq optimize specifically for token throughput, amortizing fixed costs across millions of users, resulting in 3-5x lower per-token costs.
What are Replicate cold starts and how do they affect cost?
Cold starts occur when a model has no active GPU instance and must boot before processing your request. Boot time ranges from 5-30 seconds depending on model size, and you are billed for this time. For infrequent use, cold start costs can double your effective per-prediction price. Keeping models warm or batching requests mitigates this.
Can I use Replicate with TokenMix.ai?
TokenMix.ai focuses on LLM inference optimization, providing unified access to Together AI, Groq, Fireworks, and other text generation providers. For mixed workloads, the recommended approach is using Replicate directly for image and video generation while routing LLM inference through TokenMix.ai for optimal pricing.
What is the cheapest way to generate images on Replicate?
Use Flux Schnell ($0.001-0.002 per image) for draft-quality images or Flux Dev ($0.002-0.005 per image) for good-quality images. Both run on A40 GPUs at $0.00115/second. Avoid Flux Pro ($0.04-0.07 per image) unless you need the highest quality. Keep models warm to avoid cold start surcharges.