Activefrontiermultimodal Proprietary

Pixtral Large

by Mistral AI· Released November 2024· Cutoff August 2024

Pixtral Large is Mistral AI's most advanced multimodal model, combining a 124 billion parameter decoder with a dedicated vision encoder. It excels at understanding text, images, and documents, and is designed for complex reasoning tasks that require both visual and textual understanding.

Official Site API Docs

Input cost

$2.00 per 1M tokens

Output cost

$6.00 per 1M tokens

Context window

128K tokens

Max output

—

Modalities

textimage

Parameters

124B

License

proprietary

Capabilities

Multimodal UnderstandingVisionDocument UnderstandingCode GenerationFunction CallingJSON ModeStreaming

Best For

Complex multimodal reasoning tasks involving images, documents, and text, such as chart analysis, document QA, and visual question answering.

Strengths

State-of-the-art multimodal performance on benchmarks like MathVista and DocVQA
Large 124B parameter model with dedicated vision encoder for high-quality image understanding
Supports interleaved image and text inputs for flexible prompting
128K context window allows processing of long documents and multiple images

Limitations

Very large model size may lead to higher latency and cost compared to smaller models
Not available as open-source; proprietary and API-only
May not be as optimized for pure text tasks compared to Mistral Large 2
Limited to image and text modalities; no audio or video support

Use Cases

Analyzing complex charts and graphs for financial reports

Extracting information from scanned documents and invoices

Visual question answering for educational or customer support

Multimodal code generation from screenshots or diagrams

Content moderation of images and text

Automated document summarization with visual context

Building multimodal chatbots for e-commerce or healthcare

Improvements Over Previous Model

First multimodal model in the Mistral Large family, adding native vision support
Larger parameter count (124B vs 123B of Mistral Large 2) with dedicated vision encoder
Achieves state-of-the-art results on multimodal benchmarks like MathVista and DocVQA
Supports interleaved image and text inputs, enabling more flexible prompting
128K context window matches Mistral Large 2, allowing long document processing

Back to all models

Strengths

State-of-the-art multimodal performance on benchmarks like MathVista and DocVQA

Large 124B parameter model with dedicated vision encoder for high-quality image understanding

Supports interleaved image and text inputs for flexible prompting

128K context window allows processing of long documents and multiple images

Use Cases

Analyzing complex charts and graphs for financial reports

Extracting information from scanned documents and invoices

Visual question answering for educational or customer support

Multimodal code generation from screenshots or diagrams

Content moderation of images and text

Automated document summarization with visual context

Building multimodal chatbots for e-commerce or healthcare

Improvements Over Previous Model

First multimodal model in the Mistral Large family, adding native vision support

Larger parameter count (124B vs 123B of Mistral Large 2) with dedicated vision encoder

Achieves state-of-the-art results on multimodal benchmarks like MathVista and DocVQA

Supports interleaved image and text inputs, enabling more flexible prompting

128K context window matches Mistral Large 2, allowing long document processing