by Microsoft· Released February 2025
Phi-4 Multimodal is a compact, efficient multimodal model from Microsoft that processes text, images, and audio inputs. It is designed for on-device and edge scenarios, offering strong performance in vision and speech tasks while maintaining a small footprint. Part of the Phi-4 family, it balances capability with low computational cost.
Input cost
Free (open source)
Output cost
Free (open source)
Context window
128K tokens
Max output
—
Modalities
License
MIT
On-device and edge applications requiring multimodal understanding with low latency and small model size.