How do I perform inference with HiFi-GAN?

For inference from a WAV file, use the `inference.py` script. For end-to-end speech synthesis, use the `inference_e2e.py` script.

HiFi-GAN

Overview

HiFi-GAN is a Generative Adversarial Network (GAN)-based model designed for efficient and high-fidelity speech synthesis. It addresses limitations in prior GAN-based speech synthesis methods, which often struggle to match the audio quality of autoregressive or flow-based models. HiFi-GAN focuses on modeling the periodic patterns inherent in speech audio to enhance sample quality. The architecture leverages generators and discriminators optimized for audio waveforms, allowing for fast audio generation. The model is implemented using PyTorch and is designed for researchers and developers looking to improve the speed and quality of speech synthesis systems. Pretrained models are available for various datasets, including LJ Speech and VCTK, enabling quick experimentation and deployment.

Common tasks

Speech Synthesis Mel-spectrogram Inversion End-to-end Speech Synthesis

FAQ

View all

What is HiFi-GAN?

HiFi-GAN is a generative adversarial network designed for efficient and high-fidelity speech synthesis. It focuses on modeling periodic patterns in audio to improve sample quality.

What datasets can I use with HiFi-GAN?

HiFi-GAN supports various datasets including LJ Speech and VCTK. Pretrained models are available for these datasets.

What are the hardware requirements for training HiFi-GAN?

Training HiFi-GAN requires a GPU with sufficient memory (e.g., a V100 GPU). Inference can be performed on both GPUs and CPUs.

How can I fine-tune HiFi-GAN for a specific speaker?

Generate mel-spectrograms using Tacotron2 with teacher-forcing, create a dataset with matching audio and mel-spectrogram file names, and run the training script with the `--fine_tuning True` option.

FAQ+