# Pixtral 12B 发布公告

- 来源：Mistral AI：News（网页）
- 发布时间：2024-09-17 00:00
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmppdcr7e0e4bslv4ednhnjvr
- 原文链接：https://mistral.ai/news/pixtral-12b

## AI 摘要

Mistral AI 发布了多模态模型 Pixtral 12B，该模型现已弃用，已被更新更强的版本取代。其基于 Mistral Nemo 构建，包含一个从头训练的 400M 参数视觉编码器，总参数为 12B。模型原生支持可变图像尺寸与宽高比，并能在 128K token 的上下文窗口中处理多张图像。在 MMMU 推理基准测试中，Pixtral 12B 达到 52.5% 的分数，超越了众多更大规模的模型。它在保持强大文本能力（如指令跟随，相对前代开源模型有 20% 提升）的同时，在多模态任务上表现出色。该模型采用 Apache 2.0 许可证。

## 正文

Pixtral 12B is no longer maintained and has been replaced by our latest, more powerful vision and multimodal models.

Explore our current vision capabilities

Pixtral 12B in short:

Natively multimodal, trained with interleaved image and text data

Natively multimodal, trained with interleaved image and text data

Strong performance on multimodal tasks, excels in instruction following

Strong performance on multimodal tasks, excels in instruction following

Maintains state-of-the-art performance on text-only benchmarks

Maintains state-of-the-art performance on text-only benchmarks

Architecture:

Architecture:

New 400M parameter vision encoder trained from scratch 12B parameter multimodal decoder based on Mistral Nemo Supports variable image sizes and aspect ratios Supports multiple images in the long context window of 128k tokens

New 400M parameter vision encoder trained from scratch

New 400M parameter vision encoder trained from scratch

12B parameter multimodal decoder based on Mistral Nemo

12B parameter multimodal decoder based on Mistral Nemo

Supports variable image sizes and aspect ratios

Supports variable image sizes and aspect ratios

Supports multiple images in the long context window of 128k tokens

Supports multiple images in the long context window of 128k tokens

Use:

Use:

License: Apache 2.0 Try it on La Plateforme or on Le Chat

License: Apache 2.0

License: Apache 2.0

Try it on La Plateforme or on Le Chat

Try it on La Plateforme or on Le Chat

Pixtral is trained to understand both natural images and documents, achieving 52.5% on the MMMU reasoning benchmark, surpassing a number of larger models. The model shows strong abilities in tasks such as chart and figure understanding, document question answering, multimodal reasoning and instruction following. Pixtral is able to ingest images at their natural resolution and aspect ratio, giving the user flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Unlike previous open-source models, Pixtral does not compromise on text benchmark performance to excel in multimodal tasks.

Performance

Pixtral was trained to be a drop-in replacement for Mistral Nemo 12B. Its key distinguishing factor from existing open-source models is the delivery of best-in-class multimodal reasoning without compromising on key text capabilities such as instruction following, coding, and math.

We re-evaluate a range of open and closed models through the same evaluation harness. For each dataset, the prompt was chosen such that we could reproduce the results of leading multimodal models (GPT-4o and Claude-3.5-Sonnet). All models were then evaluated with this same prompt. Overall, Pixtral substantially outperforms all open models around its scale and, in many cases, outperforms closed models such as Claude 3 Haiku. Pixtral even outperforms or matches the performance of much larger models like LLaVa OneVision 72B on multimodal benchmarks.All prompts will be open-sourced.

Performance of Pixtral compared to closed and larger multimodal models. [All models were benchmarked through the same evaluation harness and with the same prompt. We verify that prompts reproduce the performance reported for GPT-4o and Claude 3.5 Sonnet (prompts will be provided in technical report)].

Pixtral particularly excels at both multimodal and text-only instruction following as compared to other open multimodal models. It substantially outperforms Qwen2-VL 7B, LLaVa-OneVision 7B and Phi-3.5 Vision in instruction following, with a 20% relative improvement in text IF-Eval and MT-Bench over the nearest OSS model. To further evaluate this ability for multimodal use cases, we create multimodal versions of these benchmarks: MM-IF-Eval and MM-MT-Bench. Pixtral outperforms open-source alternatives on multimodal instruction following benchmarks as well. We will open-source MM-MT-Bench to the community.

Performance of Pixtral compared to open multimodal models. All models were benchmarked through the same evaluation harness and with the same prompt.

Architecture

Variable image size: Pixtral is designed to optimize for both speed and performance. We trained a new vision encoder that natively supports variable image sizes:

We simply pass images through the vision encoder at their native resolution and aspect ratio, converting them into image tokens for each 16x16 patch in the image

We simply pass images through the vision encoder at their native resolution and aspect ratio, converting them into image tokens for each 16x16 patch in the image

These tokens are then flattened to create a sequence, with [IMG BREAK] and [IMG END] tokens added between rows and at the end of the image.

These tokens are then flattened to create a sequence, with [IMG BREAK] and [IMG END] tokens added between rows and at the end of the image.

[IMG BREAK]

[IMG END]

[IMG BREAK] tokens let the model distinguish between images of different aspect ratios with the same number of tokens.

[IMG BREAK] tokens let the model distinguish between images of different aspect ratios with the same number of tokens.

[IMG BREAK]

In this way, Pixtral can be used to accurately understand complex diagrams, charts and documents in high resolution, while providing fast inference speeds on small images like icons, clipart, and equations.

Pixtral uses a new vision encoder trained from scratch that natively supports variable image sizes.

Final architecture: Pixtral has two components: the Vision Encoder, which tokenizes images, and a Multimodal Transformer Decoder, which predicts the next text token given a sequence of text and images. The model is trained to predict the next text token on interleaved image and text data. This architecture allows Pixtral to process any number of images with arbitrary sizes in its large context window of 128K tokens.

Qualitative Examples

This table combines the data from both tables, ensuring that all models and their respective scores from each benchmark are included in a single, cohesive format.

Prompt:

List the top 5 countries in Europe with the highest GDP

Response:

The top 5 countries in Europe with the highest GDP, based on the image, are:

Germany

Germany

GDP: $3.99 trillion Percentage: 4.65%

GDP: $3.99 trillion

GDP: $3.99 trillion

Percentage: 4.65%

Percentage: 4.65%

United Kingdom

United Kingdom

GDP: $2.82 trillion Percentage: 3.29%

GDP: $2.82 trillion

GDP: $2.82 trillion

Percentage: 3.29%

Percentage: 3.29%

France

France

GDP: $2.78 trillion Percentage: 3.24%

GDP: $2.78 trillion

GDP: $2.78 trillion

Percentage: 3.24%

Percentage: 3.24%

Italy

Italy

GDP: $2.07 trillion Percentage: 2.42%

GDP: $2.07 trillion

GDP: $2.07 trillion

Percentage: 2.42%

Percentage: 2.42%

Spain

Spain

GDP: $1.43 trillion Percentage: 1.66%

GDP: $1.43 trillion

GDP: $1.43 trillion

Percentage: 1.66%

Percentage: 1.66%

These countries are highlighted in green on the diagram.

Chart Understanding and Analysis

Image:

Prompt:

When did things start to go wrong for dark dragon?

Response:

The image is a line graph titled 'train/loss' that shows the training loss over steps for four different models: dark-dragon-50, dainty-sunset-49, different-haze-34, and morning-field-33. The x-axis represents the steps, ranging from 0 to 60k, and the y-axis represents the loss on a logarithmic scale from 1e-5 to 1e+5.

To determine when things started to go wrong for the dark-dragon-50 model, we need to look at the red line representing this model. Initially, the loss decreases, indicating that the model is learning effectively. However, around the 10k step mark, the loss starts to spike significantly. This spike indicates that the model's performance is deteriorating. The loss continues to spike and fluctuate, reaching very high values, which suggests that the model is not learning effectively and is likely overfitting or encountering other issues.

Therefore, the critical point where things started to go wrong for the dark-dragon-50 model is around the 10k step mark.

Multi-image instruction following

Images:

Prompt:

Combine information in both the tables into a single markdown table

Response:

To combine the information from both tables into a single table, we need to align the models and their corresponding scores from each benchmark. Here is the combined table:

| Model | MMMU (CoT) | Mathvista (CoT) | ChartQA (CoT) | DocVQA (ANLS) | VQAv2 (VQA Match) ||------------------------|------------|-----------------|---------------|---------------|-------------------|| Pixtral 12B | 52.5 | 58.0 | 81.8 | 90.7 | 78.6 || Claude-3 Haiku | 50.4 | 44.8 | 69.6 | 74.6 | 68.4 || Gemini-1.5 Flash 8B (0827) | 50.7 | 56.9 | 78.0 | 79.5 | 65.5 || LLaVA-OV 72B | 54.4 | 57.2 | 66.9 | 91.6 | 83.8 || GPT-4o | 68.6 | 64.6 | 85.1 | 88.9 | 77.8 || Claude-3.5 Sonnet | 68.0 | 64.4 | 87.6 | 90.3 | 70.7 || Qwen2-VL 7B | 47.6 | 54.4 | 38.6 | 94.5 | 75.9 || LLaVA-OV 7B | 45.1 | 36.1 | 67.1 | 90.5 | 78.3 || Phi-3 Vision | 40.3 | 36.4 | 72.0 | 84.9 | 42.4 || Phi-3.5 Vision | 38.3 | 39.3 | 67.7 | 74.4 | 56.1 |

| Model | MMMU (CoT) | Mathvista (CoT) | ChartQA (CoT) | DocVQA (ANLS) | VQAv2 (VQA Match) ||------------------------|------------|-----------------|---------------|---------------|-------------------|| Pixtral 12B | 52.5 | 58.0 | 81.8 | 90.7 | 78.6 || Claude-3 Haiku | 50.4 | 44.8 | 69.6 | 74.6 | 68.4 || Gemini-1.5 Flash 8B (0827) | 50.7 | 56.9 | 78.0 | 79.5 | 65.5 || LLaVA-OV 72B | 54.4 | 57.2 | 66.9 | 91.6 | 83.8 || GPT-4o | 68.6 | 64.6 | 85.1 | 88.9 | 77.8 || Claude-3.5 Sonnet | 68.0 | 64.4 | 87.6 | 90.3 | 70.7 || Qwen2-VL 7B | 47.6 | 54.4 | 38.6 | 94.5 | 75.9 || LLaVA-OV 7B | 45.1 | 36.1 | 67.1 | 90.5 | 78.3 || Phi-3 Vision | 40.3 | 36.4 | 72.0 | 84.9 | 42.4 || Phi-3.5 Vision | 38.3 | 39.3 | 67.7 | 74.4 | 56.1 |

This table combines the data from both tables, ensuring that all models and their respective scores from each benchmark are included in a single, cohesive format.

Image to Code

Image:

Prompt:

Write HTML code to create a website like this

Response:

Certainly! Below is an example of HTML code to create a simple webpage that allows users to pick an ice cream flavor. This example includes a dropdown menu for selecting the flavor and a "Next" button.

```html Pick an Ice Cream Flavor

Pick An Ice Cream Flavor

Vanilla Chocolate Strawberry Cookie Dough Mint Chocolate Next

© copyright MistralAI

```html Pick an Ice Cream Flavor

Pick An Ice Cream Flavor

Vanilla Chocolate Strawberry Cookie Dough Mint Chocolate Next

© copyright MistralAI

HTML Structure: The webpage consists of a container div that holds the heading, a dropdown menu, and a button.

HTML Structure: The webpage consists of a container div that holds the heading, a dropdown menu, and a button.

CSS Styling: Basic styling is applied to center the content and style the elements.

CSS Styling: Basic styling is applied to center the content and style the elements.

JavaScript Function: The `next` function is called when the "Next" button is clicked. It retrieves the selected flavor from the dropdown menu and displays an alert with the chosen flavor.

JavaScript Function: The `next` function is called when the "Next" button is clicked. It retrieves the selected flavor from the dropdown menu and displays an alert with the chosen flavor.

You can copy and paste this code into an HTML file and open it in a web browser to see the result.

Website Generated:

Natural scene understanding

Prompt: Is this person really big, or is this building just super small?

Response:

The image shows a person holding their thumb in front of the Leaning Tower of Pisa, making it appear as if the thumb is much larger than the tower. This creates an optical illusion, making the tower seem smaller than it actually is. In reality, the Leaning Tower of Pisa is a large and iconic structure, and the person's thumb is not nearly as large as it appears in the image. The purpose of this photo is to demonstrate the impressive height and size of the tower by comparing it to a familiar object, such as a thumb.

How to run Pixtral?

You can try Pixtral easily and freely via Le Chat , our user-friendly conversational chat interface. You can choose Pixtral in the model list, upload an image, and start asking questions about the image.

Pixtral is also available on La Plateforme. You can leverage Pixtral's capabilities through API calls, enabling seamless integration with various applications and workflows. Below is a simple example. Please find more details in our docs .

curl https://api.mistral.ai/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $MISTRAL_API_KEY" \ -d '{ "model": "pixtral-12b-2409", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What’s in this image?" }, { "type": "image_url", "image_url": "https://tripfixers.com/wp-content/uploads/2019/11/eiffel-tower-with-snow.jpeg" } ] } ], "max_tokens": 300 }'

curl https://api.mistral.ai/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $MISTRAL_API_KEY" \ -d '{ "model": "pixtral-12b-2409", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "What’s in this image?" }, { "type": "image_url", "image_url": "https://tripfixers.com/wp-content/uploads/2019/11/eiffel-tower-with-snow.jpeg" } ] } ], "max_tokens": 300 }'

mistral-inference

The easiest way to run Pixtral locally is to use mistral-inference. After installing mistral_inference, you can download the model, load the model, and run the model using the code below. For detailed information, please see here .

mistral-inference

mistral_inference

# download the modelfrom huggingface_hub import snapshot_downloadfrom pathlib import Path mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')mistral_models_path.mkdir(parents=True, exist_ok=True) snapshot_download(repo_id="mistralai/Pixtral-12B-2409", allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], local_dir=mistral_models_path) # load the modelfrom mistral_inference.transformer import Transformerfrom mistral_inference.generate import generate from mistral_common.tokens.tokenizers.mistral import MistralTokenizerfrom mistral_common.protocol.instruct.messages import UserMessage, TextChunk, ImageURLChunkfrom mistral_common.protocol.instruct.request import ChatCompletionRequest tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")model = Transformer.from_folder(mistral_models_path) # Run the modelurl = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"prompt = "Describe the image." completion_request = ChatCompletionRequest(messages=[UserMessage(content=[ImageURLChunk(image_url=url), TextChunk(text=prompt)])]) encoded = tokenizer.encode_chat_completion(completion_request) images = encoded.imagestokens = encoded.tokens out_tokens, _ = generate([tokens], model, images=[images], max_tokens=256, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)result = tokenizer.decode(out_tokens[0]) print(result)

# download the modelfrom huggingface_hub import snapshot_downloadfrom pathlib import Path mistral_models_path = Path.home().joinpath('mistral_models', 'Pixtral')mistral_models_path.mkdir(parents=True, exist_ok=True) snapshot_download(repo_id="mistralai/Pixtral-12B-2409", allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"], local_dir=mistral_models_path) # load the modelfrom mistral_inference.transformer import Transformerfrom mistral_inference.generate import generate from mistral_common.tokens.tokenizers.mistral import MistralTokenizerfrom mistral_common.protocol.instruct.messages import UserMessage, TextChunk, ImageURLChunkfrom mistral_common.protocol.instruct.request import ChatCompletionRequest tokenizer = MistralTokenizer.from_file(f"{mistral_models_path}/tekken.json")model = Transformer.from_folder(mistral_models_path) # Run the modelurl = "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"prompt = "Describe the image." completion_request = ChatCompletionRequest(messages=[UserMessage(content=[ImageURLChunk(image_url=url), TextChunk(text=prompt)])]) encoded = tokenizer.encode_chat_completion(completion_request) images = encoded.imagestokens = encoded.tokens out_tokens, _ = generate([tokens], model, images=[images], max_tokens=256, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)result = tokenizer.decode(out_tokens[0]) print(result)

If you choose to serve Pixtral locally, we also recommend using Pixtral with the vLLM library as a fantastic option to reach higher serving throughput. We thank the vLLM team for their support to integrate Pixtral quickly. Below is a simple usage example. Please find more information here .

from vllm import LLMfrom vllm.sampling_params import SamplingParams model_name = "mistralai/Pixtral-12B-2409" sampling_params = SamplingParams(max_tokens=8192) llm = LLM(model=model_name, tokenizer_mode="mistral") prompt = "Describe this image in one sentence."image_url = "https://picsum.photos/id/237/200/300" messages = [ { "role": "user", "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}] },] outputs = vllm_model.model.chat(messages, sampling_params=sampling_params) print(outputs[0].outputs[0].text)

from vllm import LLMfrom vllm.sampling_params import SamplingParams model_name = "mistralai/Pixtral-12B-2409" sampling_params = SamplingParams(max_tokens=8192) llm = LLM(model=model_name, tokenizer_mode="mistral") prompt = "Describe this image in one sentence."image_url = "https://picsum.photos/id/237/200/300" messages = [ { "role": "user", "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}] },] outputs = vllm_model.model.chat(messages, sampling_params=sampling_params) print(outputs[0].outputs[0].text)

0%