
TechRxiv
A preprint server for health sciences.

A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation.

VoxPopuli is a large-scale multilingual speech corpus comprising 400K hours of unlabelled speech data for 23 languages, 1.8K hours of transcribed speech data for 16 languages, 17.3K hours of speech-to-speech interpretation data for 15x15 directions, and 29 hours of transcribed speech data of non-native English. The raw data originates from European Parliament event recordings spanning 2009-2020. The corpus supports representation learning, semi-supervised learning, and speech-to-speech interpretation research. The dataset is formatted in Ogg Vorbis (16000Hz, 16-bit, mono-channel) and can be accessed through provided scripts for downloading, segmenting, and aligning the data, facilitating integration with common audio processing libraries such as libsndfile and libsox. Pre-trained wav2vec 2.0 models are also available.
VoxPopuli is a large-scale multilingual speech corpus comprising 400K hours of unlabelled speech data for 23 languages, 1.
Explore all tools that specialize in representation learning. This domain focus ensures VoxPopuli delivers optimized results for this specific requirement.
Offers data in 23 languages, facilitating cross-lingual research and development.
Provides 400K hours of unlabelled speech data, suitable for unsupervised and semi-supervised learning techniques.
Includes 17.3K hours of speech-to-speech interpretation data, supporting research in machine translation and spoken language understanding.
Offers 29 hours of transcribed accented English speech data, designed for improving ASR robustness to various accents.
Provides raw audio files and scripts for data processing, segmentation, and alignment, ensuring easy integration.
Clone the repository: git clone https://github.com/facebookresearch/voxpopuli.git
Install required PyPI packages: pip install -r requirements.txt
Download raw audios: python -m voxpopuli.download_audios --root [ROOT] --subset [SUBSET]
Segment audios (unlabelled data): python -m voxpopuli.get_unlabelled_data --root [ROOT] --subset [SUBSET]
Segment audios and align with transcripts (ASR data): python -m voxpopuli.get_asr_data --root [ROOT] --lang [LANGUAGE]
Download target audios (speech-to-speech interpretation data): python -m voxpopuli.download_audios --root [ROOT] --subset [TARGET_LANGUAGE]
Segment target audios and match with source: python -m voxpopuli.get_s2s_data --root [ROOT] --source-lang [SOURCE_LANGUAGE] --target-lang [TARGET_LANGUAGE]
All Set
Ready to go
Verified feedback from other users.
"Highly regarded for its scale and multilingual support, but requires significant preprocessing."
Post questions, share tips, and help other users.

A preprint server for health sciences.

Connect your AI agents to the web with real-time search, extraction, and web crawling through a single, secure API.

A large conversational telephone speech corpus for speech recognition and speaker identification research.

STRING is a database of known and predicted protein-protein interactions.

A free and open-source software package for the analysis of brain imaging data sequences.

Complete statistical software for data science with powerful statistics, visualization, data manipulation, and automated reporting in one intuitive platform.