Overview
VoxPopuli is a large-scale multilingual speech corpus comprising 400K hours of unlabelled speech data for 23 languages, 1.8K hours of transcribed speech data for 16 languages, 17.3K hours of speech-to-speech interpretation data for 15x15 directions, and 29 hours of transcribed speech data of non-native English. The raw data originates from European Parliament event recordings spanning 2009-2020. The corpus supports representation learning, semi-supervised learning, and speech-to-speech interpretation research. The dataset is formatted in Ogg Vorbis (16000Hz, 16-bit, mono-channel) and can be accessed through provided scripts for downloading, segmenting, and aligning the data, facilitating integration with common audio processing libraries such as libsndfile and libsox. Pre-trained wav2vec 2.0 models are also available.
