Sourcify
Effortlessly find and manage open-source dependencies for your projects.

Enterprise-grade speech recognition framework for ultra-low latency, high-accuracy multilingual transcription.

FunASR is a fundamental speech recognition toolkit developed by Alibaba DAMO Academy’s Speech Lab, engineered to bridge the gap between academic research and production-grade industrial applications. Positioned as a market leader in 2026 for multilingual processing, its core architecture utilizes the Paraformer model—a non-autoregressive transformer that achieves state-of-the-art accuracy while significantly reducing inference latency compared to traditional RNN-T or Whisper-based models. The framework is highly modular, integrating Voice Activity Detection (VAD) via FSMN-VAD, punctuation restoration through CT-Transformer, and speaker diarization using the CAM++ model. FunASR is specifically optimized for long-form audio processing and real-time streaming, offering unique features like hotword customization (Seaco-Paraformer) to handle technical jargon and proper nouns. By supporting deployment across ONNX, TensorRT, and various edge devices, it provides enterprises with a privacy-first, self-hosted alternative to proprietary APIs. It is particularly dominant in the Asia-Pacific market due to its superior handling of Mandarin-English code-switching and diverse Chinese dialects, making it a critical asset for global enterprises targeting cross-border communication and localized customer service automation.
FunASR is a fundamental speech recognition toolkit developed by Alibaba DAMO Academy’s Speech Lab, engineered to bridge the gap between academic research and production-grade industrial applications.
Explore all tools that specialize in speaker diarization. This domain focus ensures FunASR delivers optimized results for this specific requirement.
A non-autoregressive end-to-end speech recognition model that predicts all tokens in parallel.
A specialized bias mechanism allowing the model to prioritize specific keywords/entities provided at runtime.
Controllable Time-delay Transformer for real-time punctuation and inverse text normalization.
Feed-forward Sequential Memory Network based Voice Activity Detection.
Context-Aware Masking based speaker embedding extraction for 'who spoke when' identification.
Native support for exporting models to optimized inference engines.
Unified modeling of mixed-language audio streams, particularly Mandarin and English.
Install Python 3.8+ environment and PyTorch 1.12+.
Install the core library via 'pip install funasr'.
Install ModelScope to access pre-trained model weights.
Initialize the AutoModel pipeline for a specific task (e.g., ASR, VAD).
Load the Paraformer-v2 model for high-speed non-autoregressive transcription.
Integrate FSMN-VAD for silence removal and audio segmenting.
Configure CT-Transformer for intelligent punctuation and capitalization.
Apply CAM++ for multi-speaker identification and labeling.
Optimize for production using the FunASR-runtime-SDK via Docker.
Export the final model to ONNX or TensorRT for high-throughput inference.
All Set
Ready to go
Verified feedback from other users.
"Users praise its speed and superior accuracy for Chinese-English mixing, though some note the documentation can be complex for beginners."
Post questions, share tips, and help other users.
Effortlessly find and manage open-source dependencies for your projects.

End-to-end typesafe APIs made easy.

Page speed monitoring with Lighthouse, focusing on user experience metrics and data visualization.

The world's leading software for helping musicians work out music from recordings.

Topcoder is a pioneer in crowdsourcing, connecting businesses with a global talent network to solve technical challenges.

Explore millions of Discord Bots and Discord Apps.