TokenMix Research Lab · 2026-06-22

Fish Audio Review 2026: TTS API Pricing & Voice Cloning
Last Updated: 2026-06-22 Author: TokenMix Research Lab Data verified: 2026-06-22 - Fish Audio developer docs, Fish Audio plan page, OpenAudio S1 launch blog, fishaudio/fish-speech GitHub, Hugging Face openaudio-s1-mini card, Artificial Analysis TTS data, TTS-Arena V2 leaderboard
Fish Audio's text-to-speech API costs $15 per 1 million UTF-8 bytes (roughly 180,000 English words), which the docs frame as about 12 hours of speech, or near $0.75 to $1.25 per audio hour depending on speaking rate (Fish Audio pricing docs). That price buys the model that topped TTS-Arena-V2 human ratings at launch and clones a voice from 10 to 30 seconds of reference audio (OpenAudio S1 blog) — but it also runs roughly 3.7x more expensive than Google's standard TTS, has no API free tier, and the "open source" label is a stretch because only the small S1-mini weights are downloadable, under a non-commercial license (Hugging Face S1-mini card).
This review separates what Fish Audio documents itself from what is vendor-reported or community-measured, and tags every claim Confirmed, Likely, or vendor-reported. Pricing and rate limits come from the official developer docs; quality numbers come from Fish Audio's S1 launch post, the TTS-Arena V2 leaderboard, and Artificial Analysis, which do not all sit on the same scoring scale, so treat cross-source ELO figures as directional rather than directly comparable.
Table of Contents
- Quick Verdict
- What Fish Audio Actually Is
- API Pricing Breakdown
- Consumer Plans: Free, Plus, Pro
- Pricing vs ElevenLabs, Google, Amazon
- Quality Benchmarks
- Voice Cloning, Languages and Capabilities
- Cost per Task
- How to Access and Self-Host
- Where Fish Audio Loses
- Use Case Matrix
- Final Recommendation
- FAQ
- About TokenMix
- Sources
- Related Articles
Quick Verdict
Fish Audio is a top-tier voice quality TTS with fast cloning, priced at a premium and only partially open. Pick it for expressive cloned voices, not for cheapest-possible batch narration.
| Claim | Status | Source |
|---|---|---|
| TTS API costs $15 per 1M UTF-8 bytes | Confirmed | Fish Audio pricing docs |
| ASR (transcription) costs $0.36 per audio hour | Confirmed | Fish Audio pricing docs |
| There is no API free tier (only the web plan has one) | Confirmed | Fish Audio pricing docs |
| OpenAudio S1 topped TTS-Arena-V2 at launch | Confirmed | OpenAudio S1 blog |
| S1 reaches 0.008 WER on Seed-TTS-Eval English | Confirmed | OpenAudio S1 blog |
| Voice cloning works from 10-30 seconds of audio | Confirmed | Fish Audio |
| Fully open source and free for commercial use | False | Only S1-mini weights ship, under CC-BY-NC-SA-4.0 (HF card) |
| S2 Pro is the current flagship model | Confirmed | fish-speech GitHub |
| S2 Pro hits ~100ms time-to-first-audio | Likely (vendor README, single source) | fish-speech GitHub |
| Roughly 3.7x more expensive than Google standard TTS | Confirmed | Fish Audio blog, Google Cloud TTS pricing |
| TokenMix serves Fish Audio or any TTS model | False | TokenMix relays text, image and video models, not audio (TokenMix models) |
The short answer: if you need a distinctive cloned voice with emotion control and you can absorb a premium per-character rate, Fish Audio is one of the best. If you need cheap, plain narration at scale, Google or a budget neural voice wins on cost.
What Fish Audio Actually Is
Fish Audio is one company running two brand surfaces and several models, which is the first thing that confuses buyers. The consumer platform lives at fish.audio, the open-research brand is OpenAudio, and the code repository is fishaudio/fish-speech with around 30,000 GitHub stars (fish-speech GitHub).
| Model | Params | Open weights? | Released | Notes |
|---|---|---|---|---|
| S2 Pro | 4B + 400M fast decoder | No (proprietary) | 2026 (current) | API-only flagship, lowest latency |
| OpenAudio S1 | 4B | No (proprietary) | Jun 2025 | TTS-Arena-V2 #1 at launch |
| OpenAudio S1-mini | 0.5B (distilled) | Yes, CC-BY-NC-SA-4.0 | Jun 2025 | Gated download, non-commercial only |
| Fish Speech v1.5 | not disclosed | Open code, older | May 2025 | Earlier generation, still benchmarked |
The practical split: S1 is the model that made Fish Audio famous in June 2025, S2 Pro is the newer flagship documented mainly through the vendor's own GitHub README, and only S1-mini gives you downloadable weights — and those are non-commercial. So the honest description is "source-available with one small open model," not "fully open source." For a broader market view, see the TTS API comparison and the realtime voice API roundup.
API Pricing Breakdown
Fish Audio bills TTS by output bytes, not by characters, which quietly raises the real cost for non-Latin languages. The headline rate is $15 per 1 million UTF-8 bytes for both S1 and S2 Pro (Fish Audio pricing docs).
| Service | Model id | Price | Unit |
|---|---|---|---|
| Text-to-speech | s1 / s2-pro | $15.00 | per 1M UTF-8 bytes |
| Transcription (ASR) | transcribe-1 | $0.36 | per audio hour |
| Voice design | voice-design-1 | $0.01 | per successful request |
The byte-billing detail matters. English is roughly one byte per character, so $15 per 1M bytes is close to $15 per 1M characters. But Chinese, Japanese, Korean and Arabic run three to four bytes per character, so the same script costs three to four times more to synthesize than the character count suggests. Rate limits scale with prepaid balance: 5 concurrent requests under $100, 15 at $100 or more, 50 at $1,000 or more, then enterprise custom (Fish Audio pricing docs). There is no free API tier, so the cheapest real test path is the consumer plan below or a small prepaid balance.
Consumer Plans: Free, Plus, Pro
For non-developers, Fish Audio sells a credit-based subscription, and for heavy volume it is more expensive per minute than the API. The free web tier does allow commercial use, which is unusual (Fish Audio plans).
| Plan | Price | Credits / month | Approx minutes | Chars per generation |
|---|---|---|---|---|
| Free | $0 | 8,000 | up to 7 min | 500 |
| Plus | $11/mo ($132/yr) | 250,000 | ~200 min | 15,000 |
| Pro | $75/mo ($900/yr) | 2,000,000 | ~1,620 min | 30,000 |
Run the per-minute math and the gap is clear. Plus works out to about $0.055 per minute, or $3.30 per hour of generated audio, and Pro to roughly $0.046 per minute, or $2.78 per hour. The pay-as-you-go API, at near $0.75 to $1.25 per audio hour, is the cheaper route for any serious volume — the subscription mainly buys a UI, voice library access, and longer per-generation limits. Credits reset monthly and do not roll over (Fish Audio plans).
Pricing vs ElevenLabs, Google, Amazon
Fish Audio sits at the premium end of the TTS market, priced near Amazon's neural voices and well above Google's standard tier. The comparison below uses each provider's published list price (Fish Audio blog, Google Cloud TTS pricing).
| Provider | Tier | Price | Unit |
|---|---|---|---|
| Fish Audio | S1 / S2 Pro | $15.00 | per 1M bytes |
| Google Cloud TTS | Standard | $4.00 | per 1M chars |
| Google Cloud TTS | Neural2 / Studio | $16.00 | per 1M chars |
| Amazon Polly | Standard | $4.00 | per 1M chars |
| Amazon Polly | Neural | $16.00 | per 1M chars |
| ElevenLabs | Creator/Pro | subscription | per-credit tiers |
The takeaway: Fish Audio is roughly 3.7x the cost of Google or Amazon standard voices, and about on par with their top neural tiers. You are paying for cloning quality and expressiveness, not for cheap bulk speech. If your use case is plain IVR prompts or large undifferentiated narration, the standard tiers are hard to beat on cost; if you need a recognizable, emotional, cloned voice, the premium is the point. For the speech-to-text side, compare against Whisper API pricing.
Quality Benchmarks
On accuracy and human preference, OpenAudio S1 is genuinely top-tier, with the strongest evidence on word error rate and the TTS-Arena leaderboard. The numbers below are real and sourced; the ELO figures come from different leaderboards and should not be subtracted from each other.
| Benchmark | Score | Model | Status | Source |
|---|---|---|---|---|
| Seed-TTS-Eval WER (English) | 0.008 (0.8%) | S1 | Confirmed | OpenAudio S1 blog |
| Seed-TTS-Eval CER (English) | 0.004 | S1 | Confirmed | OpenAudio S1 blog |
| Seed-TTS-Eval WER | 0.011 | S1-mini | Confirmed | HF S1-mini card |
| TTS-Arena-V2 human ranking | #1 at launch | S1 | Confirmed | OpenAudio S1 blog |
| Artificial Analysis ELO | ~1,074 | S1 | Likely | Artificial Analysis |
| Chinese WER | 0.54% | S2 Pro | Likely (vendor README) | fish-speech GitHub |
| English WER | 0.99% | S2 Pro | Likely (vendor README) | fish-speech GitHub |
| Time-to-first-audio | ~100ms | S2 Pro | Likely (vendor README) | fish-speech GitHub |
A 0.8% word error rate on Seed-TTS-Eval is close to the top of any open or commercial TTS, and a verified #1 on TTS-Arena-V2 is human-preference evidence, not a vendor self-test. The S2 Pro latency and WER figures, by contrast, come only from Fish Audio's own GitHub README, so treat them as vendor-reported until a third party replicates them. No reliable Mean Opinion Score is published, so this review does not quote one.
Voice Cloning, Languages and Capabilities
Fish Audio's differentiator is fast, expressive zero-shot voice cloning, not raw language count. It clones a usable voice from 10 to 30 seconds of reference audio and supports inline emotion and tone markers (Fish Audio).
| Capability | Detail | Status |
|---|---|---|
| Voice cloning | Zero-shot from 10-30s reference | Confirmed |
| Emotion / tone control | 50+ markers plus vocalizations (laugh, sob, whisper) | Confirmed |
| Languages (S1) | 13 languages incl. EN, ZH, JA, DE, FR, ES, KO | Confirmed |
| Languages (S2 Pro) | "80+" claimed | Likely (vendor README) |
| Multi-speaker dialogue | Multi-turn, multi-speaker supported | Confirmed |
| Streaming | Streaming output supported | Confirmed |
The S1 language list of 13 is the confirmed figure; the "80+ languages" claim attaches only to S2 Pro through the vendor README, so do not promise a customer 80 languages without testing the specific one you need. The emotion markers are the standout practical feature — most TTS APIs give you a flat read, while Fish Audio lets you script laughter, whispering, and tonal shifts inline.
Cost per Task
Modeling three real workloads makes the premium concrete. All figures use the $15 per 1M bytes API rate and approximate 1 byte per English character.
| Task | Volume | Fish Audio API cost | Note |
|---|---|---|---|
| Audiobook (one book) | ~100,000 words | ~$8.33 | ~556K bytes at 5.56 bytes/word |
| IVR / notification voice | 10,000 messages x ~120 chars | ~$18.00 | 1.2M bytes total |
| Podcast intro/outro pack | 50 clips x ~400 chars | ~$0.30 | 20K bytes total |
A full 100,000-word audiobook at about $8.33 in API cost is cheap in absolute terms; the same job on Google standard would run near $2.25, so Fish Audio's premium is roughly $6 per book — trivial if the cloned narrator voice is the product, meaningful if you are mass-producing generic audio. The IVR example at $18 for 10,000 prompts is where the premium starts to compound at scale, and where a standard neural voice may be the rational pick. For broader cost planning across providers, use the TTS API comparison.
How to Access and Self-Host
You can reach Fish Audio three ways, but only one of them is free, and it is non-commercial. The hosted REST API is the production path; the open weights are limited to the small S1-mini model.
| Path | What you get | Best for | Caveat |
|---|---|---|---|
| Hosted REST API | S1, S2 Pro, ASR, voice design | Production apps | Premium price, own API schema |
| Hugging Face S1-mini | 0.5B weights, gated | Research, local tests | CC-BY-NC-SA-4.0, non-commercial |
| GitHub fish-speech | Code, Docker self-host | Tinkering, evaluation | Research license, not full commercial open source |
Two integration notes matter for developers. First, Fish Audio is not advertised as OpenAI-compatible, so it does not drop into an OpenAI SDK the way a chat model would — you integrate against its own API schema. Second, the repository license is a proprietary "Fish Audio Research License," and only S1-mini's weights are downloadable under a non-commercial Creative Commons license, so a commercial self-host of the flagship model is not on the table. If you are building a voice app, the typical architecture is an LLM for the text and a TTS like Fish Audio for the audio; TokenMix can serve the LLM brain through one OpenAI-compatible endpoint, but the speech synthesis itself stays with a dedicated TTS provider.
Where Fish Audio Loses
Fish Audio loses on price, true openness, and predictability of its newer claims. None of these are dealbreakers for the right use case, but they are real.
| Weak spot | Evidence | Pick instead |
|---|---|---|
| Premium price | $15/1M bytes vs $4 Google standard | Google / Amazon standard for plain speech |
| Not truly open source | Only S1-mini, non-commercial license | Fully open TTS if you need self-host rights |
| S2 Pro claims single-sourced | Latency/WER only in vendor README | Wait for third-party replication |
| Byte billing on CJK | 3-4 bytes per character | Char-billed provider for heavy CJK volume |
| No OpenAI-compatible API | Custom schema integration | Provider with OpenAI-style TTS endpoint |
| No API free tier | Prepaid balance required | Free-tier TTS for prototyping |
The pattern is consistent: Fish Audio is a quality-first, premium product. Where a project is cost-sensitive and the voice does not need to be distinctive, a standard neural voice from Google or Amazon is the rational choice. Where the cloned voice is the differentiator, the premium and the licensing friction are usually acceptable.
Use Case Matrix
Start Fish Audio where voice quality and cloning are the product, and route plain bulk speech elsewhere.
| Use case | Fish Audio fit | Better alternative | Why |
|---|---|---|---|
| Cloned brand/character voice | Strong | None comparable on quality | Best-in-class cloning + emotion |
| Audiobook with custom narrator | Strong | Google neural if cost-critical | Expressive long-form reads |
| Multilingual expressive content | Strong (test the language) | ElevenLabs for some langs | 13 confirmed langs, 80+ claimed |
| High-volume IVR / notifications | Medium | Google / Amazon standard | Premium price compounds at scale |
| Cheapest possible narration | Weak | Google / Amazon standard | 3.7x the cost |
| Fully self-hosted commercial TTS | Weak | Open-licensed TTS model | Only S1-mini, non-commercial |
| Speech-to-text / transcription | Medium | Dedicated ASR like Whisper | ASR exists but TTS is the strength |
If your real problem is routing and cost control across many AI models rather than picking one voice, pair this with the AI API gateway guide and the voice AI API roundup.
Final Recommendation
Use Fish Audio when a distinctive, emotional, cloned voice is the point and you can absorb a premium near $15 per 1M bytes. Choose Google or Amazon standard voices for cheap, plain, high-volume speech, prototype on the commercial-use free web tier before committing a prepaid balance, and treat S2 Pro's latency and language claims as vendor-reported until independent tests land. For developers, the pay-as-you-go API is cheaper per hour than the consumer subscription, so skip the Plus plan if you are building software rather than clicking in a UI.
FAQ
How much does the Fish Audio API cost?
Text-to-speech costs $15 per 1 million UTF-8 bytes for both S1 and S2 Pro, roughly 180,000 English words or near $0.75 to $1.25 per audio hour depending on speaking rate. Transcription is $0.36 per audio hour and voice design is $0.01 per successful request.
Is Fish Audio free?
There is no free API tier. The consumer web plan has a free level with 8,000 credits per month and commercial use allowed, but API access requires a prepaid balance. The only free download is the small S1-mini model, and its license is non-commercial.
Is Fish Audio open source?
Partly. The fish-speech code is on GitHub under a proprietary "Fish Audio Research License," and only the distilled S1-mini weights are downloadable, under a non-commercial CC-BY-NC-SA-4.0 license. The flagship S1 and S2 Pro models are API-only, so this is best described as source-available, not fully open source.
How good is Fish Audio's voice quality?
Top-tier. OpenAudio S1 ranked #1 on TTS-Arena-V2 human evaluations at launch and reaches 0.008 word error rate on Seed-TTS-Eval English. S2 Pro reports even lower error rates, but those numbers currently come only from Fish Audio's own GitHub README.
How fast can Fish Audio clone a voice?
It performs zero-shot cloning from roughly 10 to 30 seconds of reference audio, and supports more than 50 emotion and tone markers including laughing, sobbing, and whispering.
Is Fish Audio cheaper than ElevenLabs or Google?
It is not cheaper than Google or Amazon standard voices, which run about $4 per 1M characters versus Fish Audio's $15 per 1M bytes. It sits near the top neural tiers of Google and Amazon (~$16 per 1M characters). ElevenLabs uses subscription credit tiers, so direct comparison depends on volume.
Does TokenMix offer Fish Audio?
No. TokenMix is an AI API relay for text, image, and video models and does not serve Fish Audio or any text-to-speech model. For a voice application, you can route the language model through TokenMix and call Fish Audio separately for the speech synthesis.
Which Fish Audio model should I use?
Use S2 Pro for the lowest latency and the newest quality, S1 if you want the well-documented and independently benchmarked model, and S1-mini only for non-commercial local experiments. Verify your specific target language before committing, since the broad language claims attach to S2 Pro.
About TokenMix
TokenMix.ai is an AI API relay that routes Claude, OpenAI, Gemini, DeepSeek, Qwen, and other large language models through a single OpenAI-compatible endpoint at https://api.tokenmix.ai/v1. Current model availability and per-token rates are listed on the pricing page and the model catalog. Integration uses the standard OpenAI SDK; details in the OpenAI compatibility reference. TokenMix relays text, image, and video models and does not currently serve text-to-speech, so this review is published as independent model intelligence, not a sales page.
Sources
- Fish Audio - Official site - product, voice cloning, capabilities
- OpenAudio - Introducing S1 launch blog - S1 quality claims, WER, TTS-Arena-V2 ranking
- Fish Audio Docs - Pricing and Rate Limits - API price, ASR price, rate limits
- Fish Audio - Plans and Pricing - consumer subscription tiers and credits
- fishaudio/fish-speech GitHub - S2 Pro, self-host, license, vendor benchmarks
- Hugging Face - openaudio-s1-mini model card - open weights, non-commercial license, S1-mini WER
- Artificial Analysis - Fish Audio TTS - third-party ELO and price/speed data
- Google Cloud Text-to-Speech pricing - competitor standard and neural pricing