Artificial Analysis@ArtificialAnlys

2026-05-20 01:52·44天前

AI 摘要

谷歌发布新模型Gemini 3.5 Flash，其在智能指数上提升9分至55分，超越Grok 4.3和Claude Sonnet 4.6，尤其在代理任务和知识真实性（大幅减少幻觉）方面进步显著。输出速度超280 tokens/s，使其位于速度与智能的领先前沿。然而，模型运行成本相比前代增加5.5倍，主要由于输入令牌用量及定价上涨。此外，它在多模态评估MMMU-Pro中取得最高分，支持多模态输入，展现了谷歌的综合优势。

Google's new Gemini 3.5 Flash is the clear leader on the Intelligence vs Speed Pareto frontier and makes large gains on GDPval-AA （real-world agentic tasks）， but is 5x the cost of Gemini 3 Flash

@GoogleDeepMind gave us pre-release access to Gemini 3.5 Flash， the latest model in its Flash family， which has traditionally has offered faster， lower-cost alternatives to Gemini Pro models. Gemini 3.5 Flash scores 55 on the Artificial Analysis Intelligence Index， up 9 points from Gemini 3 Flash， driven primarily by agentic performance gains and hallucination reduction. It achieves speeds of over 280 output tokens/s， but higher token usage and token pricing make it over 5x more costly to run the Intelligence Index than Gemini 3 Flash， and 75% more costly than Gemini 3.1 Pro. Gemini 3.5 Flash is $1.50/1M input and $9/1M output tokens， Gemini 3 Flash was $0.5/$3 per 1M input/output tokens， a 3x increase. The rest of the increase was driven by higher token usage when running our benchmarks

Key results for Gemini 3.5 Flash with 'high' thinking level：

➤ 9 point Intelligence Index improvement： Gemini 3.5 Flash scores 55 on the Artificial Analysis Intelligence Index， up 9 points from Gemini 3 Flash. This places it ahead of Grok 4.3 （high， 53） and Claude Sonnet 4.6 （max， 52）. The model improves across nearly all evaluations， with the largest gains coming from agentic evaluations and AA-Omniscience （knowledge and hallucination）. On AA-Omniscience， Gemini 3.5 Flash improves by 11 points， driven primarily by reduced hallucinations， with its hallucination rate falling to 61%， a 31 point decrease compared to Gemini 3 Flash

➤ Agentic capability improvements： Gemini 3.5 Flash improves substantially over Gemini 3 Flash across our agentic evaluations， in both GDPval-AA （real-world agentic tasks） and Tau2-Bench Telecom （agentic tool use）. Its GDPval-AA result is especially notable， achieving an Elo of 1656， well ahead of Gemini 3 Flash （1204） and Gemini 3.1 Pro （1314）， and just behind GPT-5.4 （xhigh， 1674）. This represents a meaningful step forward for Google in agentic performance， which has historically been a relative weakness for Gemini models

Artificial Analysis@ArtificialAnlys · X

78导出 Markdown

2026-05-20 01:52·44天前

在 X 看原推· x.com

AI 摘要