# Chatbot Arena 推出多模态排行榜

- 来源：LMSYS：Blog（Chatbot Arena 团队）
- 发布时间：2024-06-27 00:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnxbjke6007xsln0489fljsb
- 原文链接：https://www.lmsys.org/blog/2024-06-27-multimodal

## AI 摘要

Chatbot Arena 新增图像对战功能并发布多模态排行榜。基于两周内17,429份跨60余种语言的投票，GPT-4o以1226分领跑，Claude 3.5 Sonnet以1209分紧随其后，两者视觉优势较纯语言模型更明显。Gemini 1.5 Pro与GPT-4 Turbo并列第三，开源模型Llava 1.6 34B位列第八。平台同步将"Elo评分"更名为"Arena Score"，并计划扩展至PDF、视频及音频等模态支持。

## 正文

What's next?

The Multimodal Arena is Here!

Multimodal Chatbot Arena

We added image support to Chatbot Arena! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against eachother.

In just two weeks, we have collected over 17,000 user preference votes across over 60 languages. In this post we show the initial leaderboard and statistics, some interesting conversations submitted to the arena, and include a short discussion on the future of the multimodal arena.

Leaderboard results

Table 1. Multimodal Arena Leaderboard (Timeframe: June 10th - June 25th, 2024). Total votes = 17,429. The latest and detailed version here.

Rank Model Arena Score 95% CI Votes 1 GPT-4o 1226 +7/-7 3878 2 Claude 3.5 Sonnet 1209 +5/-6 5664 3 Gemini 1.5 Pro 1171 +10/-6 3851 3 GPT-4 Turbo 1167 +10/-9 3385 5 Claude 3 Opus 1084 +8/-7 3988 5 Gemini 1.5 Flash 1079 +6/-8 3846 7 Claude 3 Sonnet 1050 +6/-8 3953 8 Llava 1.6 34B 1014 +11/-10 2222 8 Claude 3 Haiku 1000 +10/-7 4071

This multi-modal leaderboard is computed from only the battles which contain an image, and in Figure 1 we compare the ranks of the models in the language arena VS the vision arena. We see that the multimodal leaderboard ranking aligns closely with the LLM leaderboard, but with a few interesting differences. Our overall findings are summarized below:

GPT-4o and Claude 3.5 achieve notably higher performance compared to Gemini 1.5 Pro and GPT-4 turbo. This gap is much more apparent in the vision arena compared to the language arena.

While Claude 3 Opus achieves significantly higher performance than Gemini 1.5 flash on the LLM leaderboard but on the multimodal leaderboard they have similar performance

Llava-v1.6-34b, one of the best open-source VLMs achieves slightly higher performance than claude-3-haiku.

Figure 1. Comparison of the model ranks in the language arena and the vision arena.

As a small note, you might also notice that the “Elo rating” column from earlier Arena leaderboards has been renamed to “Arena score.” Rest assured: nothing has changed in the way we compute this quantity; we just renamed it. (The reason for the change is that we were computing the Bradley-Terry coefficients, which are slightly different from the Elo score, and wanted to avoid future confusion.) You should think of the Arena score as a measure of model strength. If model A has an Arena score sAs_AsA​ and model B has an arena score sBs_BsB​, you can calculate the win rate of model A over model B as P(A beats B)=11+esB−sA400,\mathbb{P}(A \text{ beats } B) = \frac{1}{1 + e^{\frac{s_B - s_A}{400}}},P(A beats B)=1+e400sB​−sA​​1​, where the number 400 is an arbitrary scaling factor that we chose in order to display the Arena score in a more human-readable format (as whole numbers). For additional information on how the leaderboard is computed, please see this notebook.

Examples of Multimodal Usage

Now that we’ve looked at the leaderboard results, let’s look at some of the data! We are currently working to better group this data into categories, but these are some common tasks we have noticed so far:

General captioning questions

Math questions

Document understanding

Meme explanation

Story writing

Note that we currently only consider conversations which pass our moderation filter. Below we have some examples from these categories as well as some other fun examples we found!

What's next?

Along with rolling out some fun new categories for the multimodal leaderboard, we look forward to adding multi-image support and supporting new modalities such as PDFs, video, and audio. Please feel free to join our Discord and give us any feedback about what feature you want next!
