TokenMix Research Lab · 2026-06-22

Fish Audio Review 2026: TTS API Pricing & Voice Cloning

Fish Audio Review 2026: TTS API Pricing & Voice Cloning

Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - Fish Audio developer docs, Fish Audio plan page, OpenAudio S1 launch blog, fishaudio/fish-speech GitHub, Hugging Face openaudio-s1-mini card, Artificial Analysis TTS data, TTS-Arena V2 leaderboard

Fish Audio's text-to-speech API costs $15 per 1 million UTF-8 bytes (roughly 180,000 English words), which the docs frame as about 12 hours of speech, or near $0.75 to $1.25 per audio hour depending on speaking rate (Fish Audio pricing docs). That price buys the model that topped TTS-Arena-V2 human ratings at launch and clones a voice from 10 to 30 seconds of reference audio (OpenAudio S1 blog) — but it also runs roughly 3.7x more expensive than Google's standard TTS, has no API free tier, and the "open source" label is a stretch because only the small S1-mini weights are downloadable, under a non-commercial license (Hugging Face S1-mini card).

This review separates what Fish Audio documents itself from what is vendor-reported or community-measured, and tags every claim Confirmed, Likely, or vendor-reported. Pricing and rate limits come from the official developer docs; quality numbers come from Fish Audio's S1 launch post, the TTS-Arena V2 leaderboard, and Artificial Analysis, which do not all sit on the same scoring scale, so treat cross-source ELO figures as directional rather than directly comparable.

Table of Contents

Quick Verdict

Fish Audio is a top-tier voice quality TTS with fast cloning, priced at a premium and only partially open. Pick it for expressive cloned voices, not for cheapest-possible batch narration.

Claim Status Source
TTS API costs $15 per 1M UTF-8 bytes Confirmed Fish Audio pricing docs
ASR (transcription) costs $0.36 per audio hour Confirmed Fish Audio pricing docs
There is no API free tier (only the web plan has one) Confirmed Fish Audio pricing docs
OpenAudio S1 topped TTS-Arena-V2 at launch Confirmed OpenAudio S1 blog
S1 reaches 0.008 WER on Seed-TTS-Eval English Confirmed OpenAudio S1 blog
Voice cloning works from 10-30 seconds of audio Confirmed Fish Audio
Fully open source and free for commercial use False Only S1-mini weights ship, under CC-BY-NC-SA-4.0 (HF card)
S2 Pro is the current flagship model Confirmed fish-speech GitHub
S2 Pro hits ~100ms time-to-first-audio Likely (vendor README, single source) fish-speech GitHub
Roughly 3.7x more expensive than Google standard TTS Confirmed Fish Audio blog, Google Cloud TTS pricing
TokenMix serves Fish Audio or any TTS model False TokenMix relays text, image and video models, not audio (TokenMix models)

The short answer: if you need a distinctive cloned voice with emotion control and you can absorb a premium per-character rate, Fish Audio is one of the best. If you need cheap, plain narration at scale, Google or a budget neural voice wins on cost.

What Fish Audio Actually Is

Fish Audio is one company running two brand surfaces and several models, which is the first thing that confuses buyers. The consumer platform lives at fish.audio, the open-research brand is OpenAudio, and the code repository is fishaudio/fish-speech with around 30,000 GitHub stars (fish-speech GitHub).

Model Params Open weights? Released Notes
S2 Pro 4B + 400M fast decoder No (proprietary) 2026 (current) API-only flagship, lowest latency
OpenAudio S1 4B No (proprietary) Jun 2025 TTS-Arena-V2 #1 at launch
OpenAudio S1-mini 0.5B (distilled) Yes, CC-BY-NC-SA-4.0 Jun 2025 Gated download, non-commercial only
Fish Speech v1.5 not disclosed Open code, older May 2025 Earlier generation, still benchmarked

The practical split: S1 is the model that made Fish Audio famous in June 2025, S2 Pro is the newer flagship documented mainly through the vendor's own GitHub README, and only S1-mini gives you downloadable weights — and those are non-commercial. So the honest description is "source-available with one small open model," not "fully open source." For a broader market view, see the TTS API comparison and the realtime voice API roundup.

API Pricing Breakdown

Fish Audio bills TTS by output bytes, not by characters, which quietly raises the real cost for non-Latin languages. The headline rate is $15 per 1 million UTF-8 bytes for both S1 and S2 Pro (Fish Audio pricing docs).

Service Model id Price Unit
Text-to-speech s1 / s2-pro $15.00 per 1M UTF-8 bytes
Transcription (ASR) transcribe-1 $0.36 per audio hour
Voice design voice-design-1 $0.01 per successful request

The byte-billing detail matters. English is roughly one byte per character, so $15 per 1M bytes is close to $15 per 1M characters. But Chinese, Japanese, Korean and Arabic run three to four bytes per character, so the same script costs three to four times more to synthesize than the character count suggests. Rate limits scale with prepaid balance: 5 concurrent requests under $100, 15 at $100 or more, 50 at $1,000 or more, then enterprise custom (Fish Audio pricing docs). There is no free API tier, so the cheapest real test path is the consumer plan below or a small prepaid balance.

Consumer Plans: Free, Plus, Pro

For non-developers, Fish Audio sells a credit-based subscription, and for heavy volume it is more expensive per minute than the API. The free web tier does allow commercial use, which is unusual (Fish Audio plans).

Plan Price Credits / month Approx minutes Chars per generation
Free $0 8,000 up to 7 min 500
Plus $11/mo ($132/yr) 250,000 ~200 min 15,000
Pro $75/mo ($900/yr) 2,000,000 ~1,620 min 30,000

Run the per-minute math and the gap is clear. Plus works out to about $0.055 per minute, or $3.30 per hour of generated audio, and Pro to roughly $0.046 per minute, or $2.78 per hour. The pay-as-you-go API, at near $0.75 to $1.25 per audio hour, is the cheaper route for any serious volume — the subscription mainly buys a UI, voice library access, and longer per-generation limits. Credits reset monthly and do not roll over (Fish Audio plans).

Pricing vs ElevenLabs, Google, Amazon

Fish Audio sits at the premium end of the TTS market, priced near Amazon's neural voices and well above Google's standard tier. The comparison below uses each provider's published list price (Fish Audio blog, Google Cloud TTS pricing).

Provider Tier Price Unit
Fish Audio S1 / S2 Pro $15.00 per 1M bytes
Google Cloud TTS Standard $4.00 per 1M chars
Google Cloud TTS Neural2 / Studio $16.00 per 1M chars
Amazon Polly Standard $4.00 per 1M chars
Amazon Polly Neural $16.00 per 1M chars
ElevenLabs Creator/Pro subscription per-credit tiers

The takeaway: Fish Audio is roughly 3.7x the cost of Google or Amazon standard voices, and about on par with their top neural tiers. You are paying for cloning quality and expressiveness, not for cheap bulk speech. If your use case is plain IVR prompts or large undifferentiated narration, the standard tiers are hard to beat on cost; if you need a recognizable, emotional, cloned voice, the premium is the point. For the speech-to-text side, compare against Whisper API pricing.

Quality Benchmarks

On accuracy and human preference, OpenAudio S1 is genuinely top-tier, with the strongest evidence on word error rate and the TTS-Arena leaderboard. The numbers below are real and sourced; the ELO figures come from different leaderboards and should not be subtracted from each other.

Benchmark Score Model Status Source
Seed-TTS-Eval WER (English) 0.008 (0.8%) S1 Confirmed OpenAudio S1 blog
Seed-TTS-Eval CER (English) 0.004 S1 Confirmed OpenAudio S1 blog
Seed-TTS-Eval WER 0.011 S1-mini Confirmed HF S1-mini card
TTS-Arena-V2 human ranking #1 at launch S1 Confirmed OpenAudio S1 blog
Artificial Analysis ELO ~1,074 S1 Likely Artificial Analysis
Chinese WER 0.54% S2 Pro Likely (vendor README) fish-speech GitHub
English WER 0.99% S2 Pro Likely (vendor README) fish-speech GitHub
Time-to-first-audio ~100ms S2 Pro Likely (vendor README) fish-speech GitHub

A 0.8% word error rate on Seed-TTS-Eval is close to the top of any open or commercial TTS, and a verified #1 on TTS-Arena-V2 is human-preference evidence, not a vendor self-test. The S2 Pro latency and WER figures, by contrast, come only from Fish Audio's own GitHub README, so treat them as vendor-reported until a third party replicates them. No reliable Mean Opinion Score is published, so this review does not quote one.

Voice Cloning, Languages and Capabilities

Fish Audio's differentiator is fast, expressive zero-shot voice cloning, not raw language count. It clones a usable voice from 10 to 30 seconds of reference audio and supports inline emotion and tone markers (Fish Audio).

Capability Detail Status
Voice cloning Zero-shot from 10-30s reference Confirmed
Emotion / tone control 50+ markers plus vocalizations (laugh, sob, whisper) Confirmed
Languages (S1) 13 languages incl. EN, ZH, JA, DE, FR, ES, KO Confirmed
Languages (S2 Pro) "80+" claimed Likely (vendor README)
Multi-speaker dialogue Multi-turn, multi-speaker supported Confirmed
Streaming Streaming output supported Confirmed

The S1 language list of 13 is the confirmed figure; the "80+ languages" claim attaches only to S2 Pro through the vendor README, so do not promise a customer 80 languages without testing the specific one you need. The emotion markers are the standout practical feature — most TTS APIs give you a flat read, while Fish Audio lets you script laughter, whispering, and tonal shifts inline.

Cost per Task

Modeling three real workloads makes the premium concrete. All figures use the $15 per 1M bytes API rate and approximate 1 byte per English character.

Task Volume Fish Audio API cost Note
Audiobook (one book) ~100,000 words ~$8.33 ~556K bytes at 5.56 bytes/word
IVR / notification voice 10,000 messages x ~120 chars ~$18.00 1.2M bytes total
Podcast intro/outro pack 50 clips x ~400 chars ~$0.30 20K bytes total

A full 100,000-word audiobook at about $8.33 in API cost is cheap in absolute terms; the same job on Google standard would run near $2.25, so Fish Audio's premium is roughly $6 per book — trivial if the cloned narrator voice is the product, meaningful if you are mass-producing generic audio. The IVR example at $18 for 10,000 prompts is where the premium starts to compound at scale, and where a standard neural voice may be the rational pick. For broader cost planning across providers, use the TTS API comparison.

How to Access and Self-Host

You can reach Fish Audio three ways, but only one of them is free, and it is non-commercial. The hosted REST API is the production path; the open weights are limited to the small S1-mini model.

Path What you get Best for Caveat
Hosted REST API S1, S2 Pro, ASR, voice design Production apps Premium price, own API schema
Hugging Face S1-mini 0.5B weights, gated Research, local tests CC-BY-NC-SA-4.0, non-commercial
GitHub fish-speech Code, Docker self-host Tinkering, evaluation Research license, not full commercial open source

Two integration notes matter for developers. First, Fish Audio is not advertised as OpenAI-compatible, so it does not drop into an OpenAI SDK the way a chat model would — you integrate against its own API schema. Second, the repository license is a proprietary "Fish Audio Research License," and only S1-mini's weights are downloadable under a non-commercial Creative Commons license, so a commercial self-host of the flagship model is not on the table. If you are building a voice app, the typical architecture is an LLM for the text and a TTS like Fish Audio for the audio; TokenMix can serve the LLM brain through one OpenAI-compatible endpoint, but the speech synthesis itself stays with a dedicated TTS provider.

Where Fish Audio Loses

Fish Audio loses on price, true openness, and predictability of its newer claims. None of these are dealbreakers for the right use case, but they are real.

Weak spot Evidence Pick instead
Premium price $15/1M bytes vs $4 Google standard Google / Amazon standard for plain speech
Not truly open source Only S1-mini, non-commercial license Fully open TTS if you need self-host rights
S2 Pro claims single-sourced Latency/WER only in vendor README Wait for third-party replication
Byte billing on CJK 3-4 bytes per character Char-billed provider for heavy CJK volume
No OpenAI-compatible API Custom schema integration Provider with OpenAI-style TTS endpoint
No API free tier Prepaid balance required Free-tier TTS for prototyping

The pattern is consistent: Fish Audio is a quality-first, premium product. Where a project is cost-sensitive and the voice does not need to be distinctive, a standard neural voice from Google or Amazon is the rational choice. Where the cloned voice is the differentiator, the premium and the licensing friction are usually acceptable.

Use Case Matrix

Start Fish Audio where voice quality and cloning are the product, and route plain bulk speech elsewhere.

Use case Fish Audio fit Better alternative Why
Cloned brand/character voice Strong None comparable on quality Best-in-class cloning + emotion
Audiobook with custom narrator Strong Google neural if cost-critical Expressive long-form reads
Multilingual expressive content Strong (test the language) ElevenLabs for some langs 13 confirmed langs, 80+ claimed
High-volume IVR / notifications Medium Google / Amazon standard Premium price compounds at scale
Cheapest possible narration Weak Google / Amazon standard 3.7x the cost
Fully self-hosted commercial TTS Weak Open-licensed TTS model Only S1-mini, non-commercial
Speech-to-text / transcription Medium Dedicated ASR like Whisper ASR exists but TTS is the strength

If your real problem is routing and cost control across many AI models rather than picking one voice, pair this with the AI API gateway guide and the voice AI API roundup.

Final Recommendation

Use Fish Audio when a distinctive, emotional, cloned voice is the point and you can absorb a premium near $15 per 1M bytes. Choose Google or Amazon standard voices for cheap, plain, high-volume speech, prototype on the commercial-use free web tier before committing a prepaid balance, and treat S2 Pro's latency and language claims as vendor-reported until independent tests land. For developers, the pay-as-you-go API is cheaper per hour than the consumer subscription, so skip the Plus plan if you are building software rather than clicking in a UI.

FAQ

How much does the Fish Audio API cost?

Text-to-speech costs $15 per 1 million UTF-8 bytes for both S1 and S2 Pro, roughly 180,000 English words or near $0.75 to $1.25 per audio hour depending on speaking rate. Transcription is $0.36 per audio hour and voice design is $0.01 per successful request.

Is Fish Audio free?

There is no free API tier. The consumer web plan has a free level with 8,000 credits per month and commercial use allowed, but API access requires a prepaid balance. The only free download is the small S1-mini model, and its license is non-commercial.

Is Fish Audio open source?

Partly. The fish-speech code is on GitHub under a proprietary "Fish Audio Research License," and only the distilled S1-mini weights are downloadable, under a non-commercial CC-BY-NC-SA-4.0 license. The flagship S1 and S2 Pro models are API-only, so this is best described as source-available, not fully open source.

How good is Fish Audio's voice quality?

Top-tier. OpenAudio S1 ranked #1 on TTS-Arena-V2 human evaluations at launch and reaches 0.008 word error rate on Seed-TTS-Eval English. S2 Pro reports even lower error rates, but those numbers currently come only from Fish Audio's own GitHub README.

How fast can Fish Audio clone a voice?

It performs zero-shot cloning from roughly 10 to 30 seconds of reference audio, and supports more than 50 emotion and tone markers including laughing, sobbing, and whispering.

Is Fish Audio cheaper than ElevenLabs or Google?

It is not cheaper than Google or Amazon standard voices, which run about $4 per 1M characters versus Fish Audio's $15 per 1M bytes. It sits near the top neural tiers of Google and Amazon (~$16 per 1M characters). ElevenLabs uses subscription credit tiers, so direct comparison depends on volume.

Does TokenMix offer Fish Audio?

No. TokenMix is an AI API relay for text, image, and video models and does not serve Fish Audio or any text-to-speech model. For a voice application, you can route the language model through TokenMix and call Fish Audio separately for the speech synthesis.

Which Fish Audio model should I use?

Use S2 Pro for the lowest latency and the newest quality, S1 if you want the well-documented and independently benchmarked model, and S1-mini only for non-commercial local experiments. Verify your specific target language before committing, since the broad language claims attach to S2 Pro.

About TokenMix

TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. TokenMix relays text, image, and video models and does not currently serve text-to-speech, so this review is published as independent model intelligence, not a sales page.

Sources

Related Articles