Overview

Fish Speech is a leading-edge open-source text-to-speech (TTS) system developed by Fish Audio. It utilizes a sophisticated architecture consisting of a VQ-GAN based acoustic tokenizing system and a Large Language Model (LLM) for semantic processing, representing a paradigm shift toward 'Audio-as-a-Language.' This dual-stage approach allows the model to capture high-fidelity nuances in human speech, including emotional prosody and breathing patterns, without the robotic artifacts common in traditional concatenative or parametric synthesis. By 2026, Fish Speech has solidified its market position as the primary open-source alternative to proprietary systems like ElevenLabs, offering comparable zero-shot cloning capabilities with significantly lower latency. The model supports over 8 core languages (English, Chinese, Japanese, German, French, Spanish, Korean, and Arabic) and enables developers to fine-tune on custom datasets or deploy via highly optimized inference engines. Its operational utility spans from real-time gaming NPCs to automated localization workflows, benefiting from a permissive licensing model and a robust community-driven ecosystem that continuously optimizes its parameter efficiency for edge deployment.

Common tasks

Zero-shot voice cloning High-fidelity text-to-speech synthesis Multilingual speech translation Speech-to-speech transformation Real-time audio streaming