Activefrontiermultimodal Proprietary

Gemini 2.5 Pro TTS Preview

by Google· Released May 2025

Gemini 2.5 Pro TTS Preview is a multimodal model from Google that adds text-to-speech (TTS) capabilities to the Gemini 2.5 Pro reasoning model. It can generate spoken audio responses from text, enabling natural voice interactions. This preview model is designed for applications requiring high-quality, expressive speech synthesis.

Official Site API Docs

Input cost

$1.25 per 1M tokens

Output cost

$10.00 per 1M tokens

Context window

1M tokens

Max output

8192 tokens

Modalities

textaudio

License

proprietary

Capabilities

Text-to-SpeechReasoningCode GenerationFunction CallingStreamingJSON Mode

Best For

Generating natural, expressive speech from text for conversational AI and voice applications.

Strengths

High-quality, natural-sounding speech synthesis
Leverages Gemini 2.5 Pro's strong reasoning capabilities
Supports long context up to 1M tokens

Limitations

Preview release, may have limited language support
No vision or image input
Higher output pricing compared to text-only models

Use Cases

Voice assistants and chatbots

Audiobook and content narration

Accessibility tools for visually impaired users

Language learning and pronunciation practice

Interactive voice response (IVR) systems

Real-time voice translation

Podcast and media production

Improvements Over Previous Model

Adds native text-to-speech output capability to Gemini 2.5 Pro
Enables direct audio generation without external TTS integration
Supports expressive speech with natural prosody and intonation
Maintains the 1M token context window of Gemini 2.5 Pro

Back to all models