Activelegacymultimodal Open Source

Qwen-Audio

by Alibaba· Released August 2023· Cutoff June 2023

Qwen-Audio is a large audio-language model developed by Alibaba Cloud, designed to process and understand various types of audio inputs including speech, music, and environmental sounds. It extends the Qwen series by incorporating audio understanding capabilities, enabling tasks such as audio captioning, sound event detection, and speech recognition. The model is part of Alibaba's open-source Qwen family, offering a unified framework for audio and text interactions.

Official Site 🤗 Hugging Face 📄 Research Paper

Input cost

Free (open source)

Output cost

Free (open source)

Context window

8192 tokens

Max output

2048 tokens

Modalities

textaudio

Parameters

License

Apache-2.0

Capabilities

Audio UnderstandingSpeech RecognitionAudio CaptioningSound Event DetectionMusic UnderstandingMultilingual Audio Processing

Best For

Audio understanding and captioning tasks, including speech, music, and environmental sound analysis.

Strengths

Strong performance on diverse audio tasks
Supports multiple audio types (speech, music, sound events)
Open-source and freely available
Unified model for audio and text
Multilingual audio processing

Limitations

Limited context window (8K tokens)
No video understanding
May not handle very long audio sequences
Primarily focused on audio, not a general multimodal model

Use Cases

Audio captioning for accessibility

Speech recognition and transcription

Sound event detection for surveillance

Music genre classification

Voice assistant integration

Audio content moderation

Multilingual audio translation

Improvements Over Previous Model

First audio-language model in the Qwen family
Introduces audio understanding to the Qwen series
Supports multiple audio types (speech, music, sound events)
Unified architecture for audio and text tasks

Back to all models

Activelegacymultimodal Open Source

Qwen-Audio

by Alibaba· Released August 2023· Cutoff June 2023

Official Site 🤗 Hugging Face 📄 Research Paper

Input cost

Free (open source)

Output cost

Free (open source)

Context window

8192 tokens

Max output

2048 tokens

Modalities

textaudio

Parameters

License

Apache-2.0

Capabilities

Audio UnderstandingSpeech RecognitionAudio CaptioningSound Event DetectionMusic UnderstandingMultilingual Audio Processing

Best For

Audio understanding and captioning tasks, including speech, music, and environmental sound analysis.

Strengths

Strong performance on diverse audio tasks
Supports multiple audio types (speech, music, sound events)
Open-source and freely available
Unified model for audio and text
Multilingual audio processing

Limitations

Limited context window (8K tokens)
No video understanding
May not handle very long audio sequences
Primarily focused on audio, not a general multimodal model

Use Cases

Audio captioning for accessibility

Speech recognition and transcription

Sound event detection for surveillance

Music genre classification

Voice assistant integration

Audio content moderation

Multilingual audio translation

Improvements Over Previous Model

First audio-language model in the Qwen family
Introduces audio understanding to the Qwen series
Supports multiple audio types (speech, music, sound events)
Unified architecture for audio and text tasks

Back to all models