VITS (Variational Inference with adversarial Text-to-Speech) is an open-source, end-to-end TTS model that uses variational inference and adversarial training. It aims to produce more natural-sounding audio compared to traditional two-stage TTS systems. The architecture includes a variational autoencoder augmented with normalizing flows, improving generative modeling. A stochastic duration predictor enables the synthesis of diverse speech rhythms from input text, capturing the one-to-many relationship between text and speech. Implemented in PyTorch, VITS supports single-stage training and parallel sampling, making it efficient for research and experimentation. It’s designed for researchers and developers looking to create high-quality, expressive speech synthesis systems.

VITS

About VITS

Core Capabilities

Main Tasks

Text-to-Speech Conversion

Speech Synthesis

Voice Cloning

What this tool is best suited for

Shortlist VITS against top options

Pros

Cons

Reviews & Ratings

Reviews

Write a Review

Core Tasks

Target Personas

Categories

Alternative Tools