Chatbot Arena 推出多模态排行榜

2024-06-27 00:00·736天前

AI 摘要

Chatbot Arena 新增图像对战功能并发布多模态排行榜。基于两周内17,429份跨60余种语言的投票，GPT-4o以1226分领跑，Claude 3.5 Sonnet以1209分紧随其后，两者视觉优势较纯语言模型更明显。Gemini 1.5 Pro与GPT-4 Turbo并列第三，开源模型Llava 1.6 34B位列第八。平台同步将"Elo评分"更名为"Arena Score"，并计划扩展至PDF、视频及音频等模态支持。

原文 · 未翻译

What's next?

The Multimodal Arena is Here!

Multimodal Chatbot Arena

We added image support to Chatbot Arena! You can now chat with your favorite vision-language models from OpenAI, Anthropic, Google, and most other major LLM providers to help discover how these models stack up against eachother.

In just two weeks, we have collected over 17,000 user preference votes across over 60 languages. In this post we show the initial leaderboard and statistics, some interesting conversations submitted to the arena, and include a short discussion on the future of the multimodal arena.

Leaderboard results

Table 1. Multimodal Arena Leaderboard (Timeframe: June 10th - June 25th, 2024). Total votes = 17,429. The latest and detailed version here.

Rank Model Arena Score 95% CI Votes 1 GPT-4o 1226 +7/-7 3878 2 Claude 3.5 Sonnet 1209 +5/-6 5664 3 Gemini 1.5 Pro 1171 +10/-6 3851 3 GPT-4 Turbo 1167 +10/-9 3385 5 Claude 3 Opus 1084 +8/-7 3988 5 Gemini 1.5 Flash 1079 +6/-8 3846 7 Claude 3 Sonnet 1050 +6/-8 3953 8 Llava 1.6 34B 1014 +11/-10 2222 8 Claude 3 Haiku 1000 +10/-7 4071

This multi-modal leaderboard is computed from only the battles which contain an image, and in Figure 1 we compare the ranks of the models in the language arena VS the vision arena. We see that the multimodal leaderboard ranking aligns closely with the LLM leaderboard, but with a few interesting differences. Our overall findings are summarized below:

GPT-4o and Claude 3.5 achieve notably higher performance compared to Gemini 1.5 Pro and GPT-4 turbo. This gap is much more apparent in the vision arena compared to the language arena.

While Claude 3 Opus achieves significantly higher performance than Gemini 1.5 flash on the LLM leaderboard but on the multimodal leaderboard they have similar performance

Llava-v1.6-34b, one of the best open-source VLMs achieves slightly higher performance than claude-3-haiku.

Figure 1. Comparison of the model ranks in the language arena and the vision arena.

LMSYS：Blog（Chatbot Arena 团队）

导出 Markdown