Overview

AudioSet is a large-scale dataset of manually annotated audio events, designed to provide a common evaluation task for audio event detection and a starting point for a comprehensive vocabulary of sound events. It consists of an expanding ontology of 632 audio event classes and a collection of over 2 million human-labeled 10-second sound clips drawn from YouTube videos. The ontology is structured as a hierarchical graph of event categories, covering a wide range of human and animal sounds, musical instruments, and common environmental sounds. The data collection process involves human annotators verifying the presence of sounds within YouTube segments nominated based on metadata and content-based search. Machine-extracted features are available for download alongside the dataset, facilitating machine learning model training and evaluation.

Common tasks

Audio Event Detection Sound Classification Acoustic Scene Understanding