Overview
AudioLM is a Google Research framework that leverages language modeling for high-quality audio generation. It maps input audio to discrete tokens and formulates audio generation as a language modeling task. The framework uses a hybrid tokenization scheme, combining discretized activations of a masked language model pre-trained on audio to capture long-term structure, with discrete codes from a neural audio codec for high-quality synthesis. AudioLM is trained on large corpora of raw audio waveforms to generate natural and coherent continuations from short prompts. It can generate syntactically and semantically plausible speech continuations, maintaining speaker identity and prosody, even for unseen speakers, without transcripts or annotations. The model can also generate coherent piano music continuations without any symbolic representation of music.
