TokenMix Research Lab · 2026-06-22

Fish Audio Review 2026: TTS API Pricing & Voice Cloning

Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - Fish Audio developer docs, Fish Audio plan page, OpenAudio S1 launch blog, fishaudio/fish-speech GitHub, Hugging Face openaudio-s1-mini card, Artificial Analysis TTS data, TTS-Arena V2 leaderboard

Fish Audio's text-to-speech API costs $15 per 1 million UTF-8 bytes (roughly 180,000 English words), which the docs frame as about 12 hours of speech, or near $0.75 to $1.25 per audio hour depending on speaking rate (Fish Audio pricing docs). That price buys the model that topped TTS-Arena-V2 human ratings at launch and clones a voice from 10 to 30 seconds of reference audio (OpenAudio S1 blog) — but it also runs roughly 3.7x more expensive than Google's standard TTS, has no API free tier, and the "open source" label is a stretch because only the small S1-mini weights are downloadable, under a non-commercial license (Hugging Face S1-mini card).

This review separates what Fish Audio documents itself from what is vendor-reported or community-measured, and tags every claim Confirmed, Likely, or vendor-reported. Pricing and rate limits come from the official developer docs; quality numbers come from Fish Audio's S1 launch post, the TTS-Arena V2 leaderboard, and Artificial Analysis, which do not all sit on the same scoring scale, so treat cross-source ELO figures as directional rather than directly comparable.

Quick Verdict
What Fish Audio Actually Is
API Pricing Breakdown
Consumer Plans: Free, Plus, Pro
Pricing vs ElevenLabs, Google, Amazon
Quality Benchmarks
Voice Cloning, Languages and Capabilities
Cost per Task
How to Access and Self-Host
Where Fish Audio Loses
Use Case Matrix
Final Recommendation
FAQ
About TokenMix
Sources
Related Articles

Quick Verdict

Fish Audio is a top-tier voice quality TTS with fast cloning, priced at a premium and only partially open. Pick it for expressive cloned voices, not for cheapest-possible batch narration.

Claim	Status	Source
TTS API costs $15 per 1M UTF-8 bytes	Confirmed	Fish Audio pricing docs
ASR (transcription) costs $0.36 per audio hour	Confirmed	Fish Audio pricing docs
There is no API free tier (only the web plan has one)	Confirmed	Fish Audio pricing docs
OpenAudio S1 topped TTS-Arena-V2 at launch	Confirmed	OpenAudio S1 blog
S1 reaches 0.008 WER on Seed-TTS-Eval English	Confirmed	OpenAudio S1 blog
Voice cloning works from 10-30 seconds of audio	Confirmed	Fish Audio
Fully open source and free for commercial use	False	Only S1-mini weights ship, under CC-BY-NC-SA-4.0 (HF card)
S2 Pro is the current flagship model	Confirmed	fish-speech GitHub
S2 Pro hits ~100ms time-to-first-audio	Likely (vendor README, single source)	fish-speech GitHub
Roughly 3.7x more expensive than Google standard TTS	Confirmed	Fish Audio blog, Google Cloud TTS pricing
TokenMix serves Fish Audio or any TTS model	False	TokenMix relays text, image and video models, not audio (TokenMix models)

The short answer: if you need a distinctive cloned voice with emotion control and you can absorb a premium per-character rate, Fish Audio is one of the best. If you need cheap, plain narration at scale, Google or a budget neural voice wins on cost.

What Fish Audio Actually Is

Fish Audio is one company running two brand surfaces and several models, which is the first thing that confuses buyers. The consumer platform lives at fish.audio, the open-research brand is OpenAudio, and the code repository is fishaudio/fish-speech with around 30,000 GitHub stars (fish-speech GitHub).

Model	Params	Open weights?	Released	Notes
S2 Pro	4B + 400M fast decoder	No (proprietary)	2026 (current)	API-only flagship, lowest latency
OpenAudio S1	4B	No (proprietary)	Jun 2025	TTS-Arena-V2 #1 at launch
OpenAudio S1-mini	0.5B (distilled)	Yes, CC-BY-NC-SA-4.0	Jun 2025	Gated download, non-commercial only
Fish Speech v1.5	not disclosed	Open code, older	May 2025	Earlier generation, still benchmarked

The practical split: S1 is the model that made Fish Audio famous in June 2025, S2 Pro is the newer flagship documented mainly through the vendor's own GitHub README, and only S1-mini gives you downloadable weights — and those are non-commercial. So the honest description is "source-available with one small open model," not "fully open source." For a broader market view, see the TTS API comparison and the realtime voice API roundup.

API Pricing Breakdown

Fish Audio bills TTS by output bytes, not by characters, which quietly raises the real cost for non-Latin languages. The headline rate is $15 per 1 million UTF-8 bytes for both S1 and S2 Pro (Fish Audio pricing docs).

Service	Model id	Price	Unit
Text-to-speech	s1 / s2-pro	$15.00	per 1M UTF-8 bytes
Transcription (ASR)	transcribe-1	$0.36	per audio hour
Voice design	voice-design-1	$0.01	per successful request

The byte-billing detail matters. English is roughly one byte per character, so $15 per 1M bytes is close to $15 per 1M characters. But Chinese, Japanese, Korean and Arabic run three to four bytes per character, so the same script costs three to four times more to synthesize than the character count suggests. Rate limits scale with prepaid balance: 5 concurrent requests under $100, 15 at $100 or more, 50 at $1,000 or more, then enterprise custom (Fish Audio pricing docs). There is no free API tier, so the cheapest real test path is the consumer plan below or a small prepaid balance.

Consumer Plans: Free, Plus, Pro

For non-developers, Fish Audio sells a credit-based subscription, and for heavy volume it is more expensive per minute than the API. The free web tier does allow commercial use, which is unusual (Fish Audio plans).

Plan	Price	Credits / month	Approx minutes	Chars per generation
Free	$0	8,000	up to 7 min	500
Plus	$11/mo ($132/yr)	250,000	~200 min	15,000
Pro	$75/mo ($900/yr)	2,000,000	~1,620 min	30,000

Run the per-minute math and the gap is clear. Plus works out to about $0.055 per minute, or $3.30 per hour of generated audio, and Pro to roughly $0.046 per minute, or $2.78 per hour. The pay-as-you-go API, at near $0.75 to $1.25 per audio hour, is the cheaper route for any serious volume — the subscription mainly buys a UI, voice library access, and longer per-generation limits. Credits reset monthly and do not roll over (Fish Audio plans).

Pricing vs ElevenLabs, Google, Amazon

Fish Audio sits at the premium end of the TTS market, priced near Amazon's neural voices and well above Google's standard tier. The comparison below uses each provider's published list price (Fish Audio blog, Google Cloud TTS pricing).

Provider	Tier	Price	Unit
Fish Audio	S1 / S2 Pro	$15.00	per 1M bytes
Google Cloud TTS	Standard	$4.00	per 1M chars
Google Cloud TTS	Neural2 / Studio	$16.00	per 1M chars
Amazon Polly	Standard	$4.00	per 1M chars
Amazon Polly	Neural	$16.00	per 1M chars
ElevenLabs	Creator/Pro	subscription	per-credit tiers

The takeaway: Fish Audio is roughly 3.7x the cost of Google or Amazon standard voices, and about on par with their top neural tiers. You are paying for cloning quality and expressiveness, not for cheap bulk speech. If your use case is plain IVR prompts or large undifferentiated narration, the standard tiers are hard to beat on cost; if you need a recognizable, emotional, cloned voice, the premium is the point. For the speech-to-text side, compare against Whisper API pricing.

Quality Benchmarks

On accuracy and human preference, OpenAudio S1 is genuinely top-tier, with the strongest evidence on word error rate and the TTS-Arena leaderboard. The numbers below are real and sourced; the ELO figures come from different leaderboards and should not be subtracted from each other.

Benchmark	Score	Model	Status	Source
Seed-TTS-Eval WER (English)	0.008 (0.8%)	S1	Confirmed	OpenAudio S1 blog
Seed-TTS-Eval CER (English)	0.004	S1	Confirmed	OpenAudio S1 blog
Seed-TTS-Eval WER	0.011	S1-mini	Confirmed	HF S1-mini card
TTS-Arena-V2 human ranking	#1 at launch	S1	Confirmed	OpenAudio S1 blog
Artificial Analysis ELO	~1,074	S1	Likely	Artificial Analysis
Chinese WER	0.54%	S2 Pro	Likely (vendor README)	fish-speech GitHub
English WER	0.99%	S2 Pro	Likely (vendor README)	fish-speech GitHub
Time-to-first-audio	~100ms	S2 Pro	Likely (vendor README)	fish-speech GitHub

A 0.8% word error rate on Seed-TTS-Eval is close to the top of any open or commercial TTS, and a verified #1 on TTS-Arena-V2 is human-preference evidence, not a vendor self-test. The S2 Pro latency and WER figures, by contrast, come only from Fish Audio's own GitHub README, so treat them as vendor-reported until a third party replicates them. No reliable Mean Opinion Score is published, so this review does not quote one.

Voice Cloning, Languages and Capabilities

Fish Audio's differentiator is fast, expressive zero-shot voice cloning, not raw language count. It clones a usable voice from 10 to 30 seconds of reference audio and supports inline emotion and tone markers (Fish Audio).

Capability	Detail	Status
Voice cloning	Zero-shot from 10-30s reference	Confirmed
Emotion / tone control	50+ markers plus vocalizations (laugh, sob, whisper)	Confirmed
Languages (S1)	13 languages incl. EN, ZH, JA, DE, FR, ES, KO	Confirmed
Languages (S2 Pro)	"80+" claimed	Likely (vendor README)
Multi-speaker dialogue	Multi-turn, multi-speaker supported	Confirmed
Streaming	Streaming output supported	Confirmed

The S1 language list of 13 is the confirmed figure; the "80+ languages" claim attaches only to S2 Pro through the vendor README, so do not promise a customer 80 languages without testing the specific one you need. The emotion markers are the standout practical feature — most TTS APIs give you a flat read, while Fish Audio lets you script laughter, whispering, and tonal shifts inline.

Cost per Task

Modeling three real workloads makes the premium concrete. All figures use the $15 per 1M bytes API rate and approximate 1 byte per English character.

Task	Volume	Fish Audio API cost	Note
Audiobook (one book)	~100,000 words	~$8.33	~556K bytes at 5.56 bytes/word
IVR / notification voice	10,000 messages x ~120 chars	~$18.00	1.2M bytes total
Podcast intro/outro pack	50 clips x ~400 chars	~$0.30	20K bytes total

A full 100,000-word audiobook at about $8.33 in API cost is cheap in absolute terms; the same job on Google standard would run near $2.25, so Fish Audio's premium is roughly $6 per book — trivial if the cloned narrator voice is the product, meaningful if you are mass-producing generic audio. The IVR example at $18 for 10,000 prompts is where the premium starts to compound at scale, and where a standard neural voice may be the rational pick. For broader cost planning across providers, use the TTS API comparison.

How to Access and Self-Host

You can reach Fish Audio three ways, but only one of them is free, and it is non-commercial. The hosted REST API is the production path; the open weights are limited to the small S1-mini model.

Path	What you get	Best for	Caveat
Hosted REST API	S1, S2 Pro, ASR, voice design	Production apps	Premium price, own API schema
Hugging Face S1-mini	0.5B weights, gated	Research, local tests	CC-BY-NC-SA-4.0, non-commercial
GitHub fish-speech	Code, Docker self-host	Tinkering, evaluation	Research license, not full commercial open source

Two integration notes matter for developers. First, Fish Audio is not advertised as OpenAI-compatible, so it does not drop into an OpenAI SDK the way a chat model would — you integrate against its own API schema. Second, the repository license is a proprietary "Fish Audio Research License," and only S1-mini's weights are downloadable under a non-commercial Creative Commons license, so a commercial self-host of the flagship model is not on the table. If you are building a voice app, the typical architecture is an LLM for the text and a TTS like Fish Audio for the audio; TokenMix can serve the LLM brain through one OpenAI-compatible endpoint, but the speech synthesis itself stays with a dedicated TTS provider.

Where Fish Audio Loses

Fish Audio loses on price, true openness, and predictability of its newer claims. None of these are dealbreakers for the right use case, but they are real.

Weak spot	Evidence	Pick instead
Premium price	$15/1M bytes vs $4 Google standard	Google / Amazon standard for plain speech
Not truly open source	Only S1-mini, non-commercial license	Fully open TTS if you need self-host rights
S2 Pro claims single-sourced	Latency/WER only in vendor README	Wait for third-party replication
Byte billing on CJK	3-4 bytes per character	Char-billed provider for heavy CJK volume
No OpenAI-compatible API	Custom schema integration	Provider with OpenAI-style TTS endpoint
No API free tier	Prepaid balance required	Free-tier TTS for prototyping

The pattern is consistent: Fish Audio is a quality-first, premium product. Where a project is cost-sensitive and the voice does not need to be distinctive, a standard neural voice from Google or Amazon is the rational choice. Where the cloned voice is the differentiator, the premium and the licensing friction are usually acceptable.

Use Case Matrix

Start Fish Audio where voice quality and cloning are the product, and route plain bulk speech elsewhere.

Use case	Fish Audio fit	Better alternative	Why
Cloned brand/character voice	Strong	None comparable on quality	Best-in-class cloning + emotion
Audiobook with custom narrator	Strong	Google neural if cost-critical	Expressive long-form reads
Multilingual expressive content	Strong (test the language)	ElevenLabs for some langs	13 confirmed langs, 80+ claimed
High-volume IVR / notifications	Medium	Google / Amazon standard	Premium price compounds at scale
Cheapest possible narration	Weak	Google / Amazon standard	3.7x the cost
Fully self-hosted commercial TTS	Weak	Open-licensed TTS model	Only S1-mini, non-commercial
Speech-to-text / transcription	Medium	Dedicated ASR like Whisper	ASR exists but TTS is the strength

If your real problem is routing and cost control across many AI models rather than picking one voice, pair this with the AI API gateway guide and the voice AI API roundup.

Final Recommendation

Use Fish Audio when a distinctive, emotional, cloned voice is the point and you can absorb a premium near $15 per 1M bytes. Choose Google or Amazon standard voices for cheap, plain, high-volume speech, prototype on the commercial-use free web tier before committing a prepaid balance, and treat S2 Pro's latency and language claims as vendor-reported until independent tests land. For developers, the pay-as-you-go API is cheaper per hour than the consumer subscription, so skip the Plus plan if you are building software rather than clicking in a UI.

FAQ

How much does the Fish Audio API cost?

Text-to-speech costs $15 per 1 million UTF-8 bytes for both S1 and S2 Pro, roughly 180,000 English words or near $0.75 to $1.25 per audio hour depending on speaking rate. Transcription is $0.36 per audio hour and voice design is $0.01 per successful request.

Is Fish Audio free?

There is no free API tier. The consumer web plan has a free level with 8,000 credits per month and commercial use allowed, but API access requires a prepaid balance. The only free download is the small S1-mini model, and its license is non-commercial.

Is Fish Audio open source?

Partly. The fish-speech code is on GitHub under a proprietary "Fish Audio Research License," and only the distilled S1-mini weights are downloadable, under a non-commercial CC-BY-NC-SA-4.0 license. The flagship S1 and S2 Pro models are API-only, so this is best described as source-available, not fully open source.

How good is Fish Audio's voice quality?

Top-tier. OpenAudio S1 ranked #1 on TTS-Arena-V2 human evaluations at launch and reaches 0.008 word error rate on Seed-TTS-Eval English. S2 Pro reports even lower error rates, but those numbers currently come only from Fish Audio's own GitHub README.

How fast can Fish Audio clone a voice?

It performs zero-shot cloning from roughly 10 to 30 seconds of reference audio, and supports more than 50 emotion and tone markers including laughing, sobbing, and whispering.

Is Fish Audio cheaper than ElevenLabs or Google?

It is not cheaper than Google or Amazon standard voices, which run about $4 per 1M characters versus Fish Audio's $15 per 1M bytes. It sits near the top neural tiers of Google and Amazon (~$16 per 1M characters). ElevenLabs uses subscription credit tiers, so direct comparison depends on volume.

Does TokenMix offer Fish Audio?

No. TokenMix is an AI API relay for text, image, and video models and does not serve Fish Audio or any text-to-speech model. For a voice application, you can route the language model through TokenMix and call Fish Audio separately for the speech synthesis.

Which Fish Audio model should I use?

Use S2 Pro for the lowest latency and the newest quality, S1 if you want the well-documented and independently benchmarked model, and S1-mini only for non-commercial local experiments. Verify your specific target language before committing, since the broad language claims attach to S2 Pro.

About TokenMix

TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. TokenMix relays text, image, and video models and does not currently serve text-to-speech, so this review is published as independent model intelligence, not a sales page.

Sources

Fish Audio - Official site - product, voice cloning, capabilities
OpenAudio - Introducing S1 launch blog - S1 quality claims, WER, TTS-Arena-V2 ranking
Fish Audio Docs - Pricing and Rate Limits - API price, ASR price, rate limits
Fish Audio - Plans and Pricing - consumer subscription tiers and credits
fishaudio/fish-speech GitHub - S2 Pro, self-host, license, vendor benchmarks
Hugging Face - openaudio-s1-mini model card - open weights, non-commercial license, S1-mini WER
Artificial Analysis - Fish Audio TTS - third-party ELO and price/speed data
Google Cloud Text-to-Speech pricing - competitor standard and neural pricing