
Fisher English Training Speech Part 1
The gold-standard conversational telephone speech corpus for enterprise-grade ASR and NLU development.

The industry-standard public domain dataset for neural text-to-speech synthesis and voice modeling.

LJ Speech is a foundational public domain speech dataset released by Keith Ito in 2017, which remains the 'gold standard' benchmark for evaluating single-speaker neural text-to-speech (TTS) models in 2026. The dataset consists of 13,100 short audio clips of a single female speaker reading passages from seven non-fiction books. Technically, the collection provides approximately 24 hours of audio recorded at 22,050 Hz in 16-bit mono PCM, accompanied by normalized and non-normalized transcriptions in a CSV format. Its significance in the AI market lies in its role as a control variable; because the recording environment and speaker characteristics are consistent, researchers use it to isolate the performance of new architectures like Tacotron 2, FastSpeech, and HiFi-GAN. In 2026, it serves as the primary baseline for zero-shot cross-lingual transfer learning and as a pre-training corpus for more complex multi-speaker generative models. The Public Domain (CC0) status ensures it remains the most legally frictionless dataset for commercial and academic AI development.
LJ Speech is a foundational public domain speech dataset released by Keith Ito in 2017, which remains the 'gold standard' benchmark for evaluating single-speaker neural text-to-speech (TTS) models in 2026.
Explore all tools that specialize in vocoder benchmarking. This domain focus ensures LJ Speech Dataset delivers optimized results for this specific requirement.
24 hours of audio from the same female narrator ensures minimal variance in pitch, tone, and recording environment.
22,050 Hz sampling rate provides the industry-standard frequency response for clear human speech synthesis.
The dataset is dedicated to the public domain, waiving all copyright interests worldwide.
Metadata includes both the original text and a normalized version (e.g., '19th' to 'nineteenth').
Native data loaders available in Torchaudio, TensorFlow Datasets (TFDS), and Hugging Face Datasets.
The text-to-audio alignment has been manually checked for higher precision than automated alignments.
Text source from non-fiction books ensures a wide range of phonemes and complex sentence structures.
Download the compressed archive (ljspeech.tar.bz2) from the official repository or mirror.
Verify the MD5 checksum to ensure data integrity of the 2.6 GB file.
Extract the archive to access the 'wavs' directory and 'metadata.csv' file.
Parse metadata.csv using UTF-8 encoding to map filenames to their respective transcriptions.
Pre-process audio files: apply silence trimming and peak normalization if required by your model.
Convert transcriptions to phonemes or character sequences using a G2P (Grapheme-to-Phoneme) tool.
Partition the 13,100 samples into Training (12,500), Validation (300), and Test (300) sets.
Compute Mel-spectrograms from the WAV files as features for the acoustic model.
Initialize the training loop using frameworks like PyTorch or TensorFlow with the LJSpeech data loader.
Run inference on the test set and evaluate using Mean Opinion Score (MOS) or Word Error Rate (WER).
All Set
Ready to go
Verified feedback from other users.
"Universally acclaimed as the essential dataset for entry into speech synthesis research. Praised for its cleanliness and license."
Post questions, share tips, and help other users.

The gold-standard conversational telephone speech corpus for enterprise-grade ASR and NLU development.

High-quality multimodal AI training data for global enterprise scale.
ImageNet is a large-scale image database designed to advance computer vision and deep learning research by providing a structured resource of annotated images.