
TVPaint Animation
The digital solution for your professional 2D animation projects.

The world's fastest CLI for OpenAI's Whisper, transcribing 150 minutes of audio in under 98 seconds.

insanely-fast-whisper is a specialized CLI and Python wrapper designed to maximize the performance of OpenAI's Whisper models using the Hugging Face Transformers ecosystem. As of 2026, it remains the industry standard for high-throughput, localized audio transcription. The architecture leverages Flash Attention-2 and Optimum-based optimizations to parallelize transcription tasks, effectively removing the sequential bottlenecks found in standard implementations. It is specifically engineered for NVIDIA GPUs with Ampere architecture (A10, A100) or newer (H100, B200), utilizing half-precision (float16) and sophisticated batching strategies to achieve transcription speeds exceeding 30x real-time. By utilizing the Transformers 'pipeline' abstraction, it allows for seamless integration of speaker diarization via pyannote-audio and supports speculative decoding to further reduce latency. In the 2026 market, it serves as the foundational utility for developers who require enterprise-grade transcription speed without the data privacy risks or recurring costs associated with proprietary SaaS APIs like Deepgram or AssemblyAI.
insanely-fast-whisper is a specialized CLI and Python wrapper designed to maximize the performance of OpenAI's Whisper models using the Hugging Face Transformers ecosystem.
Explore all tools that specialize in transcribe audio files. This domain focus ensures insanely-fast-whisper delivers optimized results for this specific requirement.
Explore all tools that specialize in convert speech to text. This domain focus ensures insanely-fast-whisper delivers optimized results for this specific requirement.
Explore all tools that specialize in speaker diarization. This domain focus ensures insanely-fast-whisper delivers optimized results for this specific requirement.
Implements IO-aware exact attention to speed up the self-attention mechanism in the Whisper transformer blocks.
Uses a smaller 'assistant' model (Whisper-tiny) to predict candidate tokens, which the 'target' model (Whisper-large) then verifies.
Integrates with pyannote.audio to identify 'who spoke when' in multi-speaker environments.
Utilizes Hugging Face's pipeline batching to process multiple audio chunks concurrently on the GPU.
Extracts alignment data from the attention heads to provide millisecond-accurate word timing.
Supports 4-bit and 8-bit quantization via bitsandbytes to run large models on consumer hardware.
Identifies the spoken language in the first 30 seconds of audio and routes to the correct decoder head.
Ensure Python 3.10+ is installed in a virtual environment.
Install NVIDIA drivers and CUDA Toolkit 12.x.
Install PyTorch with CUDA support (pip install torch).
Install the package via pip: pip install insanely-fast-whisper.
Install ffmpeg on the host system for media decoding.
Run 'nvidia-smi' to verify GPU availability and VRAM capacity.
Execute transcription using: insanely-fast-whisper --file-name <path_to_audio>.
Enable Flash Attention-2 with the '--flash-attention-2' flag for Ampere+ GPUs.
Optimize throughput by adjusting '--batch-size' (default is 24; 16GB VRAM recommended).
Incorporate '--transcript-path' to export results to JSON or SRT.
All Set
Ready to go
Verified feedback from other users.
"Users praise the tool for its extreme efficiency and ease of use compared to the base Whisper implementation. Developers highlight the 'set it and forget it' nature of the CLI."
Post questions, share tips, and help other users.

The digital solution for your professional 2D animation projects.

Empowering independent artists with digital music distribution, publishing administration, and promotional tools.

Convert creative micro-blogs into high-performance web presences using generative AI and Automattic's core infrastructure.

Fashion design technology software and machinery for apparel product development.

Instantly turns any text to natural sounding speech for listening online or generating downloadable audio.

Professional studio-quality AI headshot generator for individuals and teams.