Just off stage at #GoogleIO, some highlights from this morning 🧵 Gemini 3.5 Flash is available today for everyone in @antigravity and across our products and APIs. Compared to 3.1 Pro, 3.5 Flash is better across almost all benchmarks with huge progress in coding. It’s also comparable to the best models but very fast (4x faster tokens/ second than other frontier models). And when looking at the intelligence versus output speed, it’s in a league of its own in the top right quadrant.

译刚结束 #GoogleIO 活动，分享今早的一些亮点 🧵 Gemini 3.5 Flash 今日起面向所有用户开放，可在 @antigravity 及我们的产品和 API 中使用。与 3.1 Pro 相比，3.5 Flash 在几乎所有基准测试中表现更优，编程能力大幅提升。它性能可比肩顶尖模型，但速度极快（每秒生成 token 数是其他前沿模型的 4 倍）。从智能水平与输出速度的综合表现来看，它在右上象限独占鳌头。

Google AI@GoogleAI · 5月20日85

Three years ago, Gemini started by understanding the world. With Gemini 2, models learned to think and reason. Late last year, Gemini 3 brought any idea to life. Today, we’re continuing that journey with our Gemini 3.5 series, starting with Gemini 3.5 Flash, delivering frontier performance for agents and coding.

译三年前，Gemini从理解世界开始。随着Gemini 2，模型学会了思考和推理。去年底，Gemini 3将任何想法变为现实。今天，我们继续这段旅程，推出Gemini 3.5系列，首先发布Gemini 3.5 Flash，为智能体和编程提供前沿性能。

🚨 AI News | TestingCatalog@testingcatalog · 5月20日75

GOOGLE I/O 🔥: GEMINI 3.5 FLASH HAS BEEN ANNOUNCED! Gemini 3.5 performs on par with Gemini 3.1 Pro on Artificial Analysis Intelligence benchmark but is much faster.

译谷歌I/O大会🔥：Gemini 3.5 Flash已发布！ Gemini 3.5在人工智能分析智能基准测试中表现与Gemini 3.1 Pro相当，但速度更快。 [引用 @GeminiApp]：Gemini 3.5 Flash来了，这是我们迄今为止在快速高效完成任务方面最好的模型。无论您需要日常任务帮助还是多步骤创意项目，Gemini 3.5 Flash都能应对现实世界的复杂性，助您采取行动。#GoogleIO

🚨 AI News | TestingCatalog@testingcatalog · 5月20日79

GOOGLE I/O 🔥: Gemini 3.5 Flash is now available on AI Studio for testing! Have you tried it yet? 👀

译GOOGLE I/O 🔥：Gemini 3.5 Flash现已在AI Studio上开放测试！你试过了吗？👀

Artificial Analysis@ArtificialAnlys · 5月20日78

Google’s new Gemini 3.5 Flash is the clear leader on the Intelligence vs Speed Pareto frontier and makes large gains on GDPval-AA (real-world agentic tasks), but is 5x the cost of Gemini 3 Flash @GoogleDeepMind gave us pre-release access to Gemini 3.5 Flash, the latest model in its Flash family, which has traditionally has offered faster, lower-cost alternatives to Gemini Pro models. Gemini 3.5 Flash scores 55 on the Artificial Analysis Intelligence Index, up 9 points from Gemini 3 Flash, driven primarily by agentic performance gains and hallucination reduction. It achieves speeds of over 280 output tokens/s, but higher token usage and token pricing make it over 5x more costly to run the Intelligence Index than Gemini 3 Flash, and 75% more costly than Gemini 3.1 Pro. Gemini 3.5 Flash is $1.50/1M input and $9/1M output tokens, Gemini 3 Flash was $0.5/$3 per 1M input/output tokens, a 3x increase. The rest of the increase was driven by higher token usage when running our benchmarks Key results for Gemini 3.5 Flash with ‘high’ thinking level: ➤ 9 point Intelligence Index improvement: Gemini 3.5 Flash scores 55 on the Artificial Analysis Intelligence Index, up 9 points from Gemini 3 Flash. This places it ahead of Grok 4.3 (high, 53) and Claude Sonnet 4.6 (max, 52). The model improves across nearly all evaluations, with the largest gains coming from agentic evaluations and AA-Omniscience (knowledge and hallucination). On AA-Omniscience, Gemini 3.5 Flash improves by 11 points, driven primarily by reduced hallucinations, with its hallucination rate falling to 61%, a 31 point decrease compared to Gemini 3 Flash ➤ Agentic capability improvements: Gemini 3.5 Flash improves substantially over Gemini 3 Flash across our agentic evaluations, in both GDPval-AA (real-world agentic tasks) and Tau2-Bench Telecom (agentic tool use). Its GDPval-AA result is especially notable, achieving an Elo of 1656, well ahead of Gemini 3 Flash (1204) and Gemini 3.1 Pro (1314), and just behind GPT-5.4 (xhigh, 1674). This represents a meaningful step forward for Google in agentic performance, which has historically been a relative weakness for Gemini models ➤ Speed-intelligence frontier: Gemini 3.5 Flash achieves speeds of over 280 output tokens per second, ~70% faster than Gemini 3 Flash and models such as gpt-oss-120b and GPT-5.4 mini (xhigh). With its 55 Intelligence Index score, this places Gemini 3.5 Flash on the speed-intelligence Pareto frontier alongside Gemini 3.1 Pro and Gemini 3.1 Flash-Lite, reinforcing Google’s strength in models balancing speed and intelligence ➤ 5.5x increase in cost to run: Gemini 3.5 Flash costs $1,552 to run the Artificial Analysis Intelligence Index, 5.5x more than Gemini 3 Flash and 75% more than Gemini 3.1 Pro. This is driven by increases in both token usage and token prices. Output token usage is broadly unchanged from Gemini 3 Flash (73M vs. 72M), but input token usage increases significantly, driven primarily by an increase in the number of turns in agentic evaluations. Gemini 3.5 Flash is priced 3x higher than Gemini 3 Flash at $1.50/$9.00 per 1M input/output tokens, with a 90% discount for cached input tokens ➤ Google continues to lead multimodal performance: Gemini 3.5 Flash is multimodal, supporting image, video, and speech input alongside text. This differs from many proprietary models, including Claude Opus 4.7, Grok 4.3, and GPT-5.5, which support image input only. In our multimodal evaluation, MMMU-Pro, Gemini 3.5 Flash scores 84% - the highest score recorded. This puts models from Google in the top two spots, with Gemini 3.1 Pro scoring 82% Key model details: ➤ Context window: Retains the same 1M context window as Gemini 3 Flash ➤ Multimodality: Text, image, video and speech input with text output only ➤ Pricing: $1.50/$9.00 per million input/output tokens, with a 90% discount for cached input tokens Congratulations @GoogleDeepMind , @sundarpichai and @demishassabis on the great release!

译谷歌发布新模型Gemini 3.5 Flash，其在智能指数上提升9分至55分，超越Grok 4.3和Claude Sonnet 4.6，尤其在代理任务和知识真实性（大幅减少幻觉）方面进步显著。输出速度超280 tokens/s，使其位于速度与智能的领先前沿。然而，模型运行成本相比前代增加5.5倍，主要由于输入令牌用量及定价上涨。此外，它在多模态评估MMMU-Pro中取得最高分，支持多模态输入，展现了谷歌的综合优势。

Chubby♨️@kimmonismus · 5月20日55

Gemini 3.5 pro next month!!!

译Gemini 3.5 Pro下月发布！！！

Chubby♨️@kimmonismus · 5月20日68

Insane evals for a Flash model! Gemini 3.5 Flash is really good for its size!

译一个Flash模型的评测结果太疯狂了！Gemini 3.5 Flash对于其尺寸来说真的非常出色！

Jeff Dean@JeffDean · 5月20日85

1/ Today at #GoogleIO, we’re releasing Gemini 3.5, our latest family of models combining frontier intelligence with action. We’re starting by releasing 3.5 Flash, which is built to help you execute complex, long-horizon agentic workflows. Gemini 3.5 Flash is our strongest model for coding and agent http://yet.It outscores 3.1 Pro on agentic and coding benchmarks like Terminal-Bench and MCP Atlas, while running 4x faster than other frontier models. Used in Google Antigravity, 3.5 Flash is even further optimized to be up to 12x faster. It’s a powerful engine to deploy sub-agents that collaborate, run high-frequency iterative loops, and solve real-world problems at scale. Some highlights we’re excited about 🔽

译在Google I/O大会上，谷歌正式推出Gemini 3.5系列模型，首个发布的Gemini 3.5 Flash专为执行复杂、长周期的代理工作流而设计。该模型在Terminal-Bench和MCP Atlas等编程与代理基准测试中得分超越3.1 Pro，且运行速度可达其他前沿模型的4倍。若在Google Antigravity环境中使用，速度提升可高达12倍。它能高效部署协同工作的子代理，通过高频迭代循环来解决现实世界的大规模问题。

Google DeepMind@GoogleDeepMind · 5月20日78

We’re dropping Gemini Omni: our first step towards a model that can create anything from anything - starting with video. It combines Gemini’s intelligence with our generative media systems - representing a leap forward in world understanding, multimodality, and editing 🧵

译我们推出Gemini Omni：这是迈向一个能从任何内容生成任何内容的模型的第一步——从视频开始。它结合了Gemini的智能与我们的生成式媒体系统——代表了在世界理解、多模态和编辑方面的飞跃🧵

Google DeepMind@GoogleDeepMind · 5月20日81

Introducing Gemini 3.5: our newest family of models combining frontier intelligence with real-world action. The first release is 3.5 Flash, our strongest model yet for agents and coding 🧵

译推出 Gemini 3.5：我们最新的模型系列，将前沿智能与现实行动相结合。首个发布版本是 3.5 Flash，这是我们迄今为止在智能体和编码方面最强大的模型 🧵

Google Gemini@GeminiApp · 5月20日79

Gemini 3.5 Flash is here and it's our best model yet for getting things done quickly and efficiently. Whether you need help with everyday tasks or multi-step creative projects, Gemini 3.5 Flash navigates real-world complexity to help you take action. #GoogleIO

译Gemini 3.5 Flash现已推出，这是我们迄今为止在快速高效完成任务方面表现最佳的模型。无论您需要处理日常任务还是多步骤创意项目，Gemini 3.5 Flash都能应对现实世界的复杂性，助您采取行动。#GoogleIO

🚨 AI News | TestingCatalog@testingcatalog · 5月20日74

GOOGLE I/O 🔥: GEMINI 3.5 FLASH HAS STARTED ROLLED OUT ON GEMINI AND APIs! Testing time soon 👀

译谷歌I/O 🔥：Gemini 3.5 Flash 已开始在 Gemini 和 API 上推出！即将开始测试 👀

🚨 AI News | TestingCatalog@testingcatalog · 5月20日75

GOOGLE I/O 🔥: GEMINI OMNI FLASH HAS BEEN ANNOUNCED AND IS NOW AVAILABLE ON GEMINI AND GOOGLE FLOW. GEMINI OMNI PRO IS COMING SOON 🤩

译谷歌 I/O 🔥：GEMINI OMNI FLASH 已发布，现已在 GEMINI 和 GOOGLE FLOW 上可用。 GEMINI OMNI PRO 即将推出 🤩

🚨 AI News | TestingCatalog@testingcatalog · 5月20日67

GOOGLE I/O 🔥: GEMINI 3.5 FLASH HAS BEEN ANNOUNCED! Gemini 3.6 performs on par with Gemini 3.1 Pro on Artificial Analysis Intelligence benchmark but is much faster.

译谷歌I/O 🔥：GEMINI 3.5 FLASH 已发布！ Gemini 3.6 在人工智能分析智能基准测试中表现与 Gemini 3.1 Pro 相当，但速度更快。

Chubby♨️@kimmonismus · 5月20日77

„Progress towards AGI“: Gemini Omni - world models -Gemini Omni official!! It can create anything from any input!!!

译„迈向AGI的进展“：Gemini Omni - 世界模型 -Gemini Omni官方发布！！它可以从任何输入创建任何内容！！！

Chubby♨️@kimmonismus · 5月20日54

Gemini 3.5 Flash official! Insanely fast an capable model

译Gemini 3.5 Flash官方发布！速度极快且能力强大的模型

小互@xiaohu · 5月20日48

Google 全新Omni 模型 🫡

歸藏(guizang.ai)@op7418 · 5月20日67

哇！谷歌新视频模型 Gemini Omni Flash 已经上线 FLow

歸藏(guizang.ai)@op7418 · 5月19日58

谷歌新的视频模型 Gemini Omni 已经开始放量了

AYi@AYi_AInotes · 5月19日64

Damn it！SAM3绝逼要封神了！不但开源而且强的一批！最牛逼的地方是追踪能力，即使在篮球比赛这种复杂到爆炸的场景里也稳得一逼！！

🚨 AI News | TestingCatalog@testingcatalog · 5月19日76

GOOGLE I/O 🔥: We are getting Gemini 3.5 Flash today! > GEMINI > GEMINI > GEMINI > GEM 👀

译谷歌I/O 🔥：我们今天将迎来 Gemini 3.5 Flash！ > GEMINI > GEMINI > GEMINI > GEM 👀 [引用 @AiBattle_]：Gemini 3.5 Flash 刚刚出现在 Google Cloud 控制台中它来了

Rohan Paul@rohanpaul_ai · 5月19日49

Gemini 3.5 in few more hours. 🔥

译Gemini 3.5将在几小时后发布。🔥 [引用 @_anshulr]：Gemini Gemini Gemini Gem

Alibaba Cloud@alibaba_cloud · 5月19日60

🚀🚀Qwen3.7 Preview lands on Arena！ ⚡️⚡️Here come Qwen3.7-Plus-Preview. Alibaba now #5 in Vision.🎨 Can't wait to release Qwen3.7 series models！Stay tuned! @arena

译🚀🚀Qwen3.7预览版登陆竞技场！ ⚡️⚡️Qwen3.7-Plus-Preview来了。阿里巴巴现在在视觉领域排名第五。🎨 迫不及待要发布Qwen3.7系列模型了！敬请期待！@arena

Alibaba Cloud@alibaba_cloud · 5月19日55

🚀🚀Qwen3.7 Preview lands on Arena！ ⚡️⚡️Here come Qwen3.7-Max-Preview. Alibaba now #6 lab in Text. Can't wait to release Qwen3.7 series models！Stay tuned! @arena

译阿里巴巴旗下通义千问的Qwen3.7系列模型在AI评测平台Arena首次公开。其中，Qwen3.7 Max Preview在文本竞技场总排名第13，使阿里在该平台位列第六；在数学、专业知识、软件与IT、编程等多个细分领域排名进入前十。此外，Qwen3.7 Plus Preview在视觉竞技场排名第16，阿里在该领域位列第五。官方表示即将正式发布Qwen3.7系列完整模型。

小互@xiaohu · 5月19日70

性能和 Opus 相当，价格却便宜了 30 倍？ Cursor 发布自研编码模型Composer 2.5 评分上：Composer 2.5 全部进入 Opus 4.7 的同一区间，最大差距不到 1 分。价格上：Opus 4.7 大约每百万输入 token 15 美元、输出 75 美元，Composer 2.5 输入便宜 10 倍、输出便宜 30 倍。 Cursor 称 Composer 2.5 相比 Composer 2 在智能和行为表现上都有明显提升，尤其是长时间任务、复杂指令遵循、协作顺滑度。长任务能在跨越数十万 token 的 rollout 中持续推进，不容易跑偏复杂指令遵循更可靠，沟通风格和投入级别校准也更稳，干活的力度调得更合适

译Cursor发布自研编码模型Composer 2.5，其性能与Opus 4.7相当，但在成本上具有显著优势。价格方面，Composer 2.5的输入成本比Opus 4.7低10倍，输出成本低约30倍。技术层面，该模型在智能和行为表现上较前代有明显提升，尤其擅长处理长时间、大上下文的复杂任务，指令遵循的可靠性与协作流畅度也得到增强。

Berryxia.AI@berryxia · 5月19日76

今天就被奥德赛实验室的“实际”模型刷屏！ Odyssey刚刚把“世界模型”直接拉进多人模式了。 Agora-1，全球第一个真正实时的多agent世界模型。人类和AI现在可以同时进同一个模拟世界，实时互动、互相影响。他们直接拿经典GoldenEye死亡竞赛做了可玩的研究预览。你现在就能进去，和AI一起开黑、互射、抢旗，模型会实时生成画面和声音，整个世界持续更新。这已经不是单人生成视频，而是多人共享的活世界。 Odyssey说，长期来看，多agent世界模型会彻底改变游戏、模拟、教育、机器人和AI协作的方式。大家不再是旁观者，而是真正一起生活在同一个模拟里。现在就可以去试：https://agora.odyssey.ml 完整介绍在这里：https://odyssey.ml/introducing-agora-1

译奥德赛实验室推出Agora-1，这是全球首个实时多agent世界模型，允许多人与AI同时在同一个模拟世界中实时互动并相互影响。该模型以经典游戏GoldenEye死亡竞赛为演示场景，提供可玩研究预览，用户现在即可体验与AI共同参与动态生成的模拟世界。这标志着从单人生成视频向多人共享“活世界”的转变，长期来看可能重塑游戏、模拟、教育、机器人及AI协作等领域，使人类从旁观者变为与AI共同生活的参与者。

meng shao@shao__meng · 5月19日71

Cursor 发布 Composer 2.5，仍基于 Kimi K2.5，同时因为与 SpaceXAI 合作，马斯克亲自发帖证实 Composer 2.5 已经开始使用 Colossus 2 算力训练，同时正在合作从零训练一个算力规模 10 倍以上的全新模型！ Composer 2.5 相对 Composer 2 在智能水平和行为表现上均有显著提升，重点改进了三类能力：长任务的持续推进、复杂指令的可靠遵循、协作交互的自然度。 https://cursor.com/blog/composer-2-5 三项关键训练创新 1. 定向文本反馈强化学习解决问题：长任务（数十万 token 的 rollout）中，最终奖励难以告诉模型究竟是哪一步出了错——典型的 RL 信用分配难题。 2. 合成训练数据合成任务量是 Composer 2 的 25 倍。其中一种代表性方法是 feature deletion： · 给模型一个有完整测试套件的代码库 · 删除若干代码以剥离某个特性 · 让 agent 重新实现该特性，以原测试作为可验证奖励 3. 基础设施层优化继续预训练阶段使用 Muon 优化器 + 分布式正交化： · 按模型自然粒度跑 Newton-Schulz（attention 按 head，MoE 按 expert） · 分片张量先 all-to-all 拼回完整矩阵，正交化后再 all-to-all 散回；通信与计算异步重叠 · 1T 模型的优化器单步耗时仅 0.2s 训练目标的"软"维度 Cursor 明确指出现有 benchmark 无法很好衡量的两个维度，他们专门优化了： · Communication style（沟通风格） · Effort calibration（投入度校准——什么时候该多想、什么时候该收手）这两点在实际协作中体感差异很大，也是这次定向文本反馈方法的重点应用场景。

译Cursor发布迄今最强模型Composer 2.5，仍基于Kimi K2.5。模型已与SpaceXAI合作，使用Colossus 2算力开始训练，并计划合作训练一个规模大10倍的全新模型。Composer 2.5在长任务推进、复杂指令遵循及协作自然度方面均有显著提升。关键创新包括：采用定向文本反馈强化学习解决长任务信用分配问题、使用25倍于前代的合成数据进行训练，以及通过Muon优化器与分布式正交化技术优化基础设施层。此外，模型还专门针对沟通风格和投入度校准等协作“软”维度进行了优化。

Berryxia.AI@berryxia · 5月19日62

卧槽，这个模型真的有点东西啊! 看完后就想问什么时候可以上手啊！ Odyssey AI实验室刚刚扔出一个真正让人眼前一亮的家伙：Starchild-1。这是全球第一个实时多模态世界模型。它不只是生成画面，还能同时生成真实世界的声音。视频里你能看到一个完整的场景：画面在动，声音同步响起，视觉和听觉完全融为一体，像真正活过来的世界模拟。以前的世界模型大多只能“看”世界，现在Starchild-1直接学会了“听”。这不仅仅是又一个视频生成工具，更大的意义是朝着通用世界模型又迈出的关键一步，真正理解并模拟物理世界的下一步。 Odyssey团队说，他们正在用这种新形式的多模态智能，重新定义AI对现实的认知。

译Odyssey AI实验室发布了Starchild-1，这是全球首个实时多模态世界模型。该模型不仅能生成视频画面，还能同步生成与之匹配的声音，实现了视觉与听觉的真正融合，模拟出完整、鲜活的世界动态。与以往只能“看”世界的世界模型不同，Starchild-1实现了“听”的能力。这被视为向通用世界模型迈出的关键一步，旨在重新定义AI对现实世界的认知与模拟方式。

🚨 AI News | TestingCatalog@testingcatalog · 5月19日68

GOOGLE I/O 🔥: These legends are AI-generated via an upcoming Gemini Omni model. > Both videos are 8s HD samples. > Video with Sandar and Demis is likely generated as an image-to-video using Omni for style editing. > Logan's video is likely a "Likeness" Avatar and Omni video. And "GEMINI" means a new model release! 🤯

译谷歌I/O 🔥：这些传奇人物是通过即将推出的Gemini Omni模型生成的AI图像。 > 两段视频均为8秒高清样本。 > 与Sundar和Demis相关的视频很可能是使用Omni进行风格编辑的图像转视频生成。 > Logan的视频则可能是“相似度”虚拟形象与Omni视频的结合。而“GEMINI”意味着新模型的发布！🤯

karminski-牙医@karminski3 · 5月19日59

究极"拼好模"出现了! 字节跳动 Lance! 字节跳动刚发布了一个开源模型 Lance, 激活参数量只有 3B. 但是这个模型可以接受文本, 图片, 视频输入, 然后同时可以输出文本, 图片, 视频! 所以这一个模型就能完成像图片理解, 视频理解, 文生图, 图生图, 图片编辑, 文生视频, 图生视频, 视频编辑等任务. 而训练团队在技术报告中透露, 训练成本仅仅是 128 涨 A100 显卡 (按照大厂算力来说纯纯是把冗余算力拿来用了). 那为啥说是"拼好模"呢? 原因是团队并没有完全从0造轮子. 模型的视觉输入模块直接用了 Qwen2.5-VL-ViT (用来看图和视频), 而视觉输出模块是 Wan2.2_VAE (用来画画). 而模型本体是两个: Lance_3B (用来做图片的理解、生成或编辑任务) Lance_3B_Video (用来做视频相关的任务, 比如文生视频、图生视频) 所以, 这完全是一个研究性项目了, 而项目本身的亮点其实恰好是"拼得好". 这个模型不像之前许多自称为全能模型那样直接把大语言模型 (LLM) 和扩散模型 (Diffusion) 硬拼接在一起 (即所谓的 Pipeline 方案) . 而是在一个共享的交错序列 (Interleaved sequence) 中同时处理文本、图像和视频的上下文. 这样做最大的好处是统一了语义空间, 让模型的理解能力和性能更好. (从评测来看3B就接近了许多10B甚至20B模型的水平) 另外还引入了多任务协同. 简单来讲, 理解任务 (图片转向量) 和生成任务 (向量转图片) 在模型内部本身是互斥的. Lance 创新性地在同一个框架内加入了专用专家模块, 成功缓解了这种冲突, 让模型既能做 VQA (视觉问答) , 又能做图像/视频生成和编辑. 期待一波实际应用落地, 这个模型对于端侧和多模态 Agent 来讲意义是重大的, 有很多之前需要多个模型协作的场景都能用单个模型做了. #lance #全模态模型

译字节跳动开源了全模态模型Lance，其激活参数量仅为3B，却能同时处理文本、图片和视频的输入与输出，完成理解、生成与编辑等多种任务。该模型通过模块化拼接构建了Lance_3B与Lance_3B_Video两个版本，其创新在于采用共享交错序列统一语义空间，并引入专用专家模块协调理解与生成的互斥关系，使得小参数模型性能接近更大规模模型。训练仅需128张A100，对端侧部署和多模态Agent应用具有重要价值。

Chubby♨️@kimmonismus · 5月19日71

Huge, did NOT expect that release. Evals looks very solid, significant jump compared to composer 2! But: it’s 10x more efficient than the competition. Looks really exciting. Need to try it out

译没想到这次发布这么重磅。评测结果看起来非常扎实，相比Composer 2有显著提升！但重点是：它的效率是竞争对手的10倍。看起来真的很令人兴奋。需要试用一下。

Chubby♨️@kimmonismus · 5月19日62

Intelligence too cheap to meter. This is the real deal. Composer 2.5 is an efficiency-beast

译智能成本低到难以计量。这是真正的突破。Composer 2.5是效率怪兽。

Rohan Paul@rohanpaul_ai · 5月19日64

Can a smaller model purpose-built for one domain beat a frontier general model that's 100× its size? A recent paper showed yes — and not by a small margin. Raven 3.5 from PolyAI shows that a smaller specialist model can beat bigger general models on customer service calls. It beats GPT-5 and Claude Sonnet 4.6 on all 4 customer service benchmarks while staying under 300ms latency. This is one of the live debates in ML. Every researcher is asking this question. The paper is the empirical answer. PolyAI's research team published “Raven 3.5: The post-training recipe that beats GPT-5 for customer service” —- Voice agents are moving from call-center software into everyday product infrastructure. PolyAI’s launch targets the gap between website traffic and real customer conversations. Made every website capable of answering out loud. PolyAI helps enterprises fix slow phone support, long wait times, costly contact centers, robotic IVRs, and missed revenue from abandoned calls. Its voice agents handle customer conversations 24/7 across voice, chat, SMS, and social channels in 45+ languages. The result is faster support, lower operating cost, more consistent answers, and better customer experience at enterprise scale. 📞 PolyAI is launching 2 new voice AI products: ADK, a code-first Agent Development Kit for building production voice agents from your own IDE, and PolyPhone, which turns any website into a live voice AI agent in about 10 minutes. ADK connects directly into Agent Studio, so developers can build, manage, and deploy agents from the terminal. PolyPhone reads a website, understands things like FAQs and product details, then creates a voice agent that can be embedded on any webpage without needing telephony setup. The bigger point: enterprise voice AI is moving from “contact center project” to “something teams can build and ship much faster.” 🧵 1

译PolyAI研究证实，专为客服设计的较小模型Raven 3.5，在性能上显著超越了规模大其100倍的通用前沿模型。该模型在所有四项客服基准测试中击败GPT-5和Claude Sonnet 4.6，并将响应延迟控制在300毫秒内。这项发布同时包括ADK代码开发工具包和PolyPhone网页语音生成工具，助力企业快速构建生产级语音代理。此举旨在将企业语音AI从大型项目转变为可快速部署的基础设施，从而有效解决客服等待时间长、成本高等问题，提升服务效率与客户体验。

Rohan Paul@rohanpaul_ai · 5月19日74

Agora-1, a multi-agent world model from Odyssey just exposed the next bottleneck for world models: keeping one shared reality consistent for everyone inside it. The first serious test of whether world model can act like a game engine for multiple players at once. Agora-1 turns world models from single-player predictors into shared real-time environments. The big deal here is that several agents, human or AI, can now disturb the same simulated world at once, forcing the model to track not only scenery, but consequence. Traditional world models combine simulation dynamics and rendering within a single model. And a single-player world model can survive by predicting what should happen next from one stream of action, but a multiplayer world has collisions, timing, intent, surprise, and blame. But Agora-1 turns a world model into a learned multiplayer engine, where the AI does not just generate what one player sees, but keeps a shared world state stable while up to 4 humans or AI agents act inside it in real time. In that setting, realism is no longer just visual fidelity; it is whether the world stays coherent when two minds push on it from different directions.

译Odyssey团队推出的Agoro-1是首个支持多智能体交互的世界模型，解决了多人共享同一虚拟环境时维持现实一致性的核心瓶颈。该模型突破了传统单玩家预测模式的限制，能够实时模拟多个参与者（人类或AI）在同一世界中的交互行为，并保持世界状态的稳定连贯。这意味着世界模型从单向预测演进为支持动态因果关系的共享实时环境，其真实性不再仅依赖视觉保真度，而取决于多方干预下世界逻辑的自洽性。

Rohan Paul@rohanpaul_ai · 5月19日57

HiDream just open-sourced an 8B image model with a big message behind it: the old diffusion pipeline (VAE-plus-text-encoder) may not be the only serious path left. 8B param, HiDream-O1-Image (8B) claims parity with models over 3x its size (e.g., 27B Qwen-Image). @HiDream_AI , @vivago_ai Key Features 🧬 Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder. 🎨 One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, and storyboard generation in a single architecture. 🧠 Reasoning-Driven Prompt Agent — Built-in "thinking" agent that resolves implicit knowledge, layout, and text rendering before generation. 🖼️ Native High Resolution — Direct synthesis up to 2,048 × 2,048 with sharp fine-grained detail. ⚡ Exceptional Efficiency and Versatility at 8B Scale — With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models. Most image models still split the job across a text encoder, a VAE, and a diffusion model, so details can get lost when real pixels are compressed into hidden image codes. HiDream-O1-Image removes that split by using a Pixel-level Unified Transformer, where raw image patches, text tokens, and task conditions enter the same model space. That means text-to-image, image editing, and subject personalization become variants of one in-context generation task, not separate pipelines. A prompt agent first rewrites messy user requests into clearer visual instructions, reasoning through layout, subject attributes, physics, and context before generation. The strongest result is text rendering. On LongText-Bench, the 8B model scores 0.979 in English and 0.978 in Chinese, while the 200B+ model reaches 0.982 and 0.980. That is the part to watch, because clean text inside generated images is still one of the hardest problems for image models. 🧵 1.

译HiDream开源了8B参数的HiDream-O1-Image模型，其核心创新在于采用像素级统一变换器，用单一架构直接处理原始图像块、文本与任务条件，将文本生成图像、编辑、个性化等任务统一为上下文生成，无需传统的VAE和文本编码器管线。该模型内置推理提示代理，能原生支持最高2048×2048的高分辨率合成。在性能上，它在参数量仅为部分同类模型三分之一的情况下，达到了可比的水平，尤其在文本渲染任务上表现出色，结果接近更大规模的模型。

宝玉@dotey · 5月19日83

Cursor 发布 Composer 2.5 Cursor 今天上线自家编程模型 Composer 2.5。主打长任务上更顶得住、复杂指令跟得更稳，官方称效率最多能比同等水平的模型高出十倍。为了推这个新模型，Cursor 把它未来一周的默认额度直接翻倍。训练上的一个小亮点是用文本反馈做信用分配，让模型在十万 token 量级的长轨迹里也能学得动。就是让模型扛得住连续几十上百步的编程任务，中途不容易忘了自己在干什么。底座还是 Kimi Composer 2.5 仍然基于 Moonshot 的 Kimi K2.5 二次训练，跟上一代一致。两个月前 Composer 2 发布时 Cursor 没披露底座来源，被开发者从 API 请求头里挖出 kimi-k2p5-rl 的模型 ID 闹了一场，这次直接写进了博客，算是把透明度补回来。发布同时，Cursor 宣布跟 SpaceXAI 联合从零训练一个更大的模型，总算力是这次的十倍，跑在 Colossus 2 那套百万张 H100 等效的超算集群上。背景是 SpaceX 4 月跟 Cursor 签了战略合作，并拿到了今年晚些时候以 600 亿美元收购 Cursor 的选择权；xAI 此前已并入 SpaceX。Cursor 的算力命脉，事实上已经接到了马斯克这边。

译Cursor 发布了迄今最强的编程模型 Composer 2.5。该模型在长任务处理和复杂指令跟随方面更加稳定高效，官方称其效率最高可提升十倍。其技术亮点在于采用文本反馈方法，解决了超长轨迹（十万 token 级）下的学习难题，使模型能可靠执行连续数十甚至上百步的复杂编程任务。模型底座仍基于 Moonshot 的 Kimi K2.5 进行二次训练。同时，Cursor 宣布与 SpaceXAI 联合启动更大规模模型训练，将依托 Colossus 2 超算集群，这也意味着其算力基础已与马斯克旗下资源深度绑定。

凡人小北@frxiaobei · 5月19日61

Qwen 3.7 有惊喜但不大，国内 top/国际第一梯队早就实锤了。期待下未来能超过 Anthropic，给国人出口恶气。

Qwen@Alibaba_Qwen · 5月19日57

🚀🚀Qwen3.7 Preview lands on Arena ！ Here come Qwen3.7-Max-Preview & Qwen3.7-Plus-Preview. Alibaba now #6 lab in Text, #5 in Vision.⚡️⚡️ Can't wait to release Qwen3.7 series models！Stay tuned! @arena

译阿里巴巴通义千问Qwen3.7-Max-Preview与Qwen3.7-Plus-Preview模型现已登陆AI评测平台Arena，分别参与文本与视觉评测。这一进展大幅提升了阿里巴巴在两大领域的实验室排名：在文本榜单跃升至全球第6位，在视觉榜单升至第5位。具体而言，Qwen3.7 Max Preview在文本总榜位列第13，并在数学、编程等多个子榜单中表现突出；Qwen3.7 Plus Preview在视觉总榜排名第16。官方对取得的进步表示祝贺，并透露Qwen3.7系列模型的正式版本即将发布。

Qwen@Alibaba_Qwen · 5月19日47

🚀🚀

译🚀🚀 [引用 @arena]：在视觉竞技场中，Qwen3.7 Plus Preview使@Alibaba_Qwen成为第5大实验室，总排名第16位。

🚨 AI News | TestingCatalog@testingcatalog · 5月17日60

SPACEXAI 🔥: The next version of Grok, based on the 1.5T V9 base model has finished training. Looks like we will get a major upgrade this summer. > Next, we are adding the Cursor data in supplemental training. Soon 👀

译基于1.5T参数V9基础模型的新版Grok已完成训练，预计将在今年夏季迎来重大升级。接下来将进入补充训练阶段，加入Cursor数据，随后进行SFT和RL优化。整个发布流程预计需要3至4周。此次升级标志着Grok从当前的0.5T V8公开版本实现显著性能提升。