A Generative Flow for Text-to-Speech via Monotonic Alignment Search, enabling fast, diverse, and controllable speech synthesis without external aligners.

Glow-TTS is a flow-based generative model for parallel text-to-speech (TTS) that eliminates the need for external aligners. It leverages the properties of flows and dynamic programming to search for the most probable monotonic alignment between text and the latent representation of speech. This allows for robust TTS that generalizes to long utterances, and generative flows enable fast, diverse, and controllable speech synthesis. The model achieves an order-of-magnitude speed-up compared to autoregressive models like Tacotron 2. Glow-TTS can also be extended to multi-speaker settings. The implementation is based on PyTorch and includes configurations for training, inference, and integration with HiFi-GAN for improved vocoding and audio quality. The architecture incorporates monotonic alignment search and generative flows, ensuring efficient parallel processing and high-quality speech synthesis.
Glow-TTS is a flow-based generative model for parallel text-to-speech (TTS) that eliminates the need for external aligners.
Explore all tools that specialize in flow-based transformation. This domain focus ensures Glow-TTS delivers optimized results for this specific requirement.
Explore all tools that specialize in dynamic programming optimization. This domain focus ensures Glow-TTS delivers optimized results for this specific requirement.
Explore all tools that specialize in controllable speech synthesis. This domain focus ensures Glow-TTS delivers optimized results for this specific requirement.
Utilizes dynamic programming to find the most probable monotonic alignment between text and latent speech representations.
Employs flow-based generative models for efficient and diverse speech synthesis.
Leverages parallel processing to speed up the TTS process.
Supports integration with HiFi-GAN vocoder to reduce noise and improve audio quality.
Easily extended to a multi-speaker setting for diverse speech synthesis applications.
Putting a blank token between any two input tokens to improve pronunciation.
Download and extract the LJ Speech dataset.
Rename or create a link to the dataset folder.
Initialize WaveGlow submodule: git submodule init; git submodule update, and download the pretrained WaveGlow model and place it into the waveglow folder.
Build Monotonic Alignment Search Code (Cython): cd monotonic_align; python setup.py build_ext --inplace.
Install required environments: Python3.6.9, pytorch1.2.0, cython0.29.12, librosa0.7.1, numpy1.16.4, scipy1.3.0.
For Mixed-precision training, install apex; commit: 37cdaf4
Follow training example using train_ddi.sh configs/base.json base
All Set
Ready to go
Verified feedback from other users.
"Glow-TTS offers a fast and efficient solution for text-to-speech synthesis with good audio quality and controllable parameters."
Post questions, share tips, and help other users.
No direct alternatives found in this category.