openrouter 上的新模型不是 DeepSeek-v4 哈 openrouter 刚刚又上了个匿名模型 elephant, 但应该不是 DeepSeek-v4 哈, 别上当, 我简单测了下, 编程水平特别拉. 使用我那个大象牙膏 prompt 测试结果它用的 three.js 库还是 r128 (2021年的版本). 可见训练语料库是有多旧... 从体感上来说, 甚至可能打不过 DeepSeek-V3. 所以不可能是 DeepSeek-V4 了, 也不太可能是几家国产模型. 因为目前我这个测试国产的几家模型还没有表现这么差的... 模型大小100B, 支持262K上下文, 倒也勉强算个双卡甜区模型 (俩32G显卡勉强能装下4bit量化版本). 速度倒是不错, 输出接近300 token/s 了. 总之不建议用这个模型编程, 用龙虾的同学可感兴趣可以切换试试, 看看能不能用来当龙虾的日常干活模型, 毕竟免费的, 薅羊毛谁也不嫌多哈哈哈. #openrouter #deepseekv4 #elephant

译OpenRouter上线匿名模型elephant，实测排除其为DeepSeek-v4。该模型100B参数，支持262K上下文，推理速度近300 token/s，双32G显卡可部署4bit版。但编程能力薄弱，依赖的three.js库停留在2021年r128版，训练数据陈旧，整体表现不及DeepSeek-V3。不建议用于编程，仅适合免费日常试用。

Chubby♨️@kimmonismus · 4月14日

An image circulating on the web. Looks like Kimi 2.6 code is incoming!

译网上流传的一张图片。看起来 Kimi 2.6 代码要来了！

Ethan Mollick@emollick · 4月13日

Impressed that Seedance 2.0 can pull of "a mech battle between Neanderthal and Homo Sapiens" so well. (This is exactly what happened, historically)

译惊讶于 Seedance 2.0 能如此出色地实现"尼安德特人与智人之间的机甲战斗"。（这正是历史上真实发生的）

AK@_akhaliq · 4月12日

MiniMax-M2.7 is out on Hugging Face model: https://huggingface.co/MiniMaxAI/MiniMax-M2.7

译MiniMax-M2.7 模型现已在 Hugging Face 平台发布，用户可通过官方仓库链接获取该模型。

Rohan Paul@rohanpaul_ai · 4月12日

Mark Zuckerberg: Most businesses will not own frontier AI in the way Meta or OpenAI does. But many will end up with something that feels like their own AI: a customized operational layer that reflects how that company actually works. He says, "OpenAI, Google, they're building an AI. But I think we're gonna have a lot of different AI systems, just like we're gonna have, we have a lot of different apps. I think in the future, every business, just like I have a website and a phone number and an email address, a social media account, is also going to have an AI that can interact with their customers to help them sell things, help them give support." --- What he is really describing that a company’s “own AI” will usually not be a frontier model trained from scratch, but a layer built on top of shared models, shaped by its products, policies, customer history, and way of working. Support, sales, and basic operations can be handled through a system that knows the business well enough to answer, route, recommend, and escalate without feeling generic. --- From 'Cleo Abram' YT channel (link in comment)

译Mark Zuckerberg指出，未来企业不会拥有前沿AI基础模型，而是基于共享模型构建定制化运营层，反映其业务流程与客户历史，用于客户互动和支持。与此同时，Meta发布原生多模态推理模型Muse Spark，采用多智能体编排架构，多个副本可并行推理并比较结果，用比Llama 4 Maverick少10倍以上的训练计算达到类似能力，标志着AI性能提升从单一模型扩展转向运行时智能分配计算资源。

TestingCatalog News 🗞@testingcatalog · 4月10日

Meta is planning to release Muse Spark on the APIs soon. Would be curious also to play with Meta’s 9B model if it will ever come out. Soon 👀

译Meta 即将通过 API 发布 Muse Spark，作者同时期待能体验 Meta 的 9B 模型（如果最终发布）。

karminski-牙医@karminski3 · 4月10日

AI能帮我拍照了? Qwen3.5-Omni实测! 给大家带来 Qwen3.5-Omni-Plus 全模态大模型实测! 这个模型同时支持文本, 音频, 图片, 视频输入, 并且支持文本和语音输出. 非常适合做语音助手. 本次主要测试了它的视觉能力, 测试包括视频理解和图片文本理解, 直接来看结论: 视频理解测试中, 画面细节都能准确的捕捉, 比如视频中的关键道具, 文本, 动作等. 而图片测试则是令我最意外的, 我测试了从100-5000字的OCR识别, 测试结论是2000字以内错误率能在0.1%以内, 直到3900字+才会到0.3%以上. 不过测试中也暴露出了模型的一些问题, 比如视频理解中会出现幻觉, 识别出不存在的音乐或者情节. 建议在实际生产中增加交叉验证或者干脆把温度调整到0试试. 另外我这次还魔改了龙虾(openclaw), 让它支持了Omni模型, 成功实现了让 Qwen3.5-Omni-Plus 操作我平板电脑的屏幕和摄像头, 结合大家生活中的场景 Omni 模型可以做出很多有趣的 SKILL. #通义实验室 #千问大模型 #qwen #qwen35omni

译Qwen3.5-Omni-Plus作为全模态大模型，支持文本、音频、图像、视频输入及文本与语音输出。实测显示其视频理解能精准捕捉画面细节，OCR能力在2000字内错误率低于0.1%，但存在幻觉问题，会虚构音乐或情节。作者通过修改openclaw框架，实现了该模型对平板屏幕和摄像头的直接控制，拓展了端侧AI交互场景。

Haider.@haider1 · 4月9日

ok whattt "openai plans a limited rollout of its new model to a small group of companies, with no public release planned" still hoping for a new model or omni model, maybe gpt-5.5 or gpt-5o but it looks like both anthropic and openai are doing PR stunts around their internal models, "mythos" and "spud"

译OpenAI 计划向少数公司限量开放具备高级网络安全能力的新模型，暂不公开发布，与 Anthropic 限制发布 Mythos 类似。作者质疑这是 PR 噱头，原本期待的是 GPT-5.5 或 GPT-5o 的正式亮相。

Ethan Mollick@emollick · 4月9日

So what's the deal with Amazon Nova? They released Nova 2 in December, and even then, the top flight Nova 2 model trailed Sonnet 4.5. And it still hasn't left preview.

译Amazon Nova 2 自去年12月发布至今，其顶级模型性能仍落后于 Sonnet 4.5，且始终未能脱离预览阶段，进展缓慢。

Jeff Dean@JeffDean · 4月9日

Great to see the reception for the very capable Gemma 4 models!

译Gemma 4 发布一周内下载量突破 1000 万次，Gemma 系列模型累计下载量已超 5 亿次。Sundar Pichai 公布数据并期待看到开发者基于该模型的创作。

Sundar Pichai@sundarpichai · 4月9日

Lots of love for Gemma 4! Team just told me it’s already had 10M+ downloads since last week’s launch. Gemma models have now been downloaded 500M+ times! Excited to see what you all are creating 👀

译Google开源模型Gemma 4发布仅一周下载量已突破1000万次，Gemma系列模型历史累计下载量更超过5亿次。这一数据反映出开发者社区对最新开源模型的热烈反响。官方对此表示欣喜，并期待看到用户基于Gemma 4开发的各类创新应用和创作成果。

karminski-牙医@karminski3 · 4月9日

不是牛油果是缪斯! Meta刚刚发布了新模型! Meta 刚刚终于发布了他们继Llama4后的首个大模型 Muse Spark! (我不确定到底应不应该翻译成缪斯哈) 这是一个原生多模态推理模型, 支持文本+图片输入. 从性能上看, 目前这个模型并不是 SOTA 级别的, 官方放出的分数表格很鸡贼, 我给大家画了一下哪个是最高的, 大家就可以看出, 这个模型更对是面向图像理解, 健康与医疗任务, 代理搜索类任务优化的(这三个SOTA了).而 Agent, 多任务编排, 并行推理, 视觉推理链这些虽然是这次的主打功能, 但是相关测试评分没有到达 SOTA 级别. 目前上下文窗口和参数量都没有公开, 不过官方报道里面提了一嘴 "The results are clear: we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick" 并且说 "With larger models in development", 所以我们可以推测, 这并不是个旗舰模型, 而是系列模型中的先导模型, 并且从Blog中的技术介绍 "scaling Muse Spark with multi-agent thinking enables superior performance with comparable latency" 可以推测, 想表达的是 "跑 N 个小模型并行推理，而不是跑 1 个大模型长时间思考", 这个策略通常只对单模型足够小, 推理足够快时才有意义, 否则成本会爆炸. #muse #musespark #meta #llama #原生多模态推理模型

译Muse Spark是Meta继Llama 4后推出的原生多模态推理模型，支持文本与图像输入。该模型在图像理解、医疗健康及代理搜索任务上达到SOTA水平，但Agent与多任务编排等主打功能评分未达顶尖。官方强调其计算效率较Llama 4 Maverick提升一个数量级，并透露更大模型正在开发中。技术路线采用"多小模型并行推理"策略，以低延迟换取性能，而非依赖单一大模型的长时间思考。

Yuchen Jin@Yuchenj_UW · 4月9日

Meta released Avocado, they call it Muse Spark. It's not open source (a bit sad). Meta TBD lab rebuilt the entire pretraining stack in 9 months and reached similar capability with >10x less compute than Llama 4 Maverick. I still think infra is the real moat in AI labs. You can train models much faster with a good infra, and it allows researchers to experiment with many more ideas much more quickly.

译Meta TBD 实验室发布 Avocado（内部代号 Muse Spark），未开源。团队仅用 9 个月重建预训练技术栈，以不到 Llama 4 Maverick 十分之一的算力达到相近能力。作者认为，基础设施才是 AI 实验室的真正护城河，决定模型训练速度和实验迭代效率。

Artificial Analysis@ArtificialAnlys · 4月8日

🇰🇷 South Korean AI lab Upstage has launched Solar Pro 3! Solar Pro 3 scores 26 on the Artificial Analysis Intelligence Index, a significant improvement over Solar Pro 2 and is currently the second strongest model released by a Korean lab Key benchmarking takeaways: ➤ Strength in agentic tool use and instruction following: @upstageai's Solar Pro 3 scores 71% on IFBench, which signals strong instruction following capabilities. Solar Pro 3 ranks near the frontier models in this category, scoring similarly to GLM-5 (71%) and Kimi K2.5 (70%) and is the leader among Korean models. Solar Pro 3 scores also 86% on τ²-Bench Telecom, demonstrating strong performance on agentic tool-use, making it a strong candidate for incorporation into agentic workflows. ➤ Relatively high token usage: Solar Pro 3 demonstrates relatively high token usage compared to other models in the same intelligence tier, using ~100M reasoning tokens across the Artificial Analysis Intelligence suite. This is comparable to LG’s K-EXAONE (100M reasoning tokens), another Korean model. ➤ Modest accuracy and reliability: Solar Pro 3 scores -54 on AA-Omniscience, our evaluation of knowledge reliability and hallucination, where scores range from -100 to 100 (higher is better) and a negative score indicates more incorrect than correct answers. However, with an 18% on accuracy component score, Solar Pro 3 does outperform Korean competitors in this metric. ➤ First-party and third-party API access: Solar Pro 3 is a proprietary model and is currently available through Upstage’s first-party API Other Relevant Model Details: ➤ Model type: Mixture of Experts (MoE) ➤ Size: 102B total parameters (12B active parameters) ➤ Context length: 128k ➤ Training data cut-off: July 2025 See below for further analysis

译韩国AI实验室Upstage发布Solar Pro 3，AI Index得分26，为韩国实验室第二强模型。采用MoE架构（102B总参数/12B激活参数），支持128k上下文。核心优势在于agentic工具调用与指令遵循，IFBench得分71%与GLM-5、Kimi K2.5相当，τ²-Bench Telecom达86%。但token消耗较高（约100M），可靠性不足（AA-Omniscience得分-54），准确性18%优于其他韩国模型。可通过Upstage API访问。

Dario Amodei@DarioAmodei · 4月8日

I’m proud that so many of the world’s leading companies have joined us for Project Glasswing to confront the cyber threat posed by increasingly capable AI systems head-on. https://x.com/AnthropicAI/status/2041578392852517128

译Anthropic 发起 Project Glasswing 安全倡议，联合多家全球领先企业应对日益先进的 AI 系统带来的网络威胁。该计划基于最新前沿模型 Claude Mythos Preview，其发现软件漏洞的能力仅次于最顶尖的人类专家，旨在保护全球关键软件安全。

François Chollet@fchollet · 4月4日

First update of the call, from Sachin: Gemma 4 is out now on KerasHub! Best open-source model so far for reasoning and agentic workflows.

译来自 Sachin 的会议首个更新：Gemma 4 现已在 KerasHub 上线！目前推理和智能体工作流的最佳开源模型。

Demis Hassabis@demishassabis · 4月3日

Gemma 4 outperforms models over 10x their size! (note the x-axis is log scale!)

译Gemma 4 在基准测试中性能超越体量 10 倍以上的大模型，图表 x 轴为对数坐标，凸显其极高的参数效率。

karminski-牙医@karminski3 · 4月3日

Qwen3.6-plus 实测! 新模型有哪些变化? 给大家带来刚刚正式发布的 Qwen3.6-Plus 的全方位编程能力测试, Qwen3.6-Plus 支持多模态输入, 所以可以做到给到图片还原设计. 首先是前端性能测试: case1: 建模&空间理解测试, 使用 three.js 还原一块手表 case2: 建模&空间理解测试, 使用 three.js 还原一个分体键盘 case3: UI布局&组件测试, 给到 UI Kit 参考图来设计UI元素直接来看结果: #Qwen36plus #阿里千问 #多模态模型 #AIAgent #AI编程

译Qwen3.6-Plus 实测显示其多模态编程能力突出。该模型支持图像输入并生成对应代码，测试中通过 three.js 成功还原手表与分体键盘的3D建模，并能依据 UI Kit 参考图生成界面组件。验证了其空间理解、建模能力及前端代码生成水平，展现了从设计图到代码的直接转换能力。

Artificial Analysis@ArtificialAnlys · 4月3日

India enters the open-weights AI race with its largest models pre-trained from scratch: Sarvam 105B and Sarvam 30B @SarvamAI's Sarvam 105B and Sarvam 30B score 18 and 12 on the Artificial Analysis Intelligence Index respectively. Announced at the India AI Impact Summit 2026 and open-sourced under Apache 2.0, both are Mixture-of-Experts models trained entirely in India using compute provided under the IndiaAI Mission (@OfficialINDIAai). Both support reasoning and non-reasoning modes. These are an improvement from Sarvam's previous model, Sarvam M (8 on Intelligence Index, 23.6B parameters), which was based on Mistral Small rather than pre-trained from scratch. Sarvam 105B has 106B total parameters with ~10B active per token and a 128K context window. Sarvam 30B has 32B total parameters with ~2.4B active per token and a 65K context window. Alongside the text models, Sarvam also announced Saaras v3 (Speech to Text) and Bulbul v3 (Text to Speech) with a focus on Indic languages. Key takeaways in reasoning mode: ➤ Sarvam 105B scores 18 on the Intelligence Index. Among ~100B-class open-weights reasoning models, it trails GLM-4.5-Air (23), INTELLECT-3 (22), Mistral Small 4 (27), and gpt-oss-120B (High, 33). All four peers also activate more parameters per token ➤ Sarvam 30B scores 12 on the Intelligence Index. Among ~30B-class open-weights reasoning models, it trails GLM-4.7-Flash (30), Nemotron Cascade 2 30B A3B (28), Qwen3 30B A3B 2507 (22), and Qwen3 32B (17). Sarvam 30B activates fewer parameters than these peers. ➤ Sarvam 105B's relative strength is in select agentic tasks. Its agentic index of 25 places it ahead of INTELLECT-3 (20) and GLM-4.5-Air (21) despite trailing both on overall intelligence. Its GDPval index of 773 also edges ahead of GLM-4.5-Air (665). Both new models are a large step up from Sarvam M (Reasoning), which scored 8 on the Intelligence Index. ➤ Compared to peers, both models score lower on TerminalBench Hard (Agentic Coding & Terminal Use) and AA-Omniscience. Sarvam 105B scored 1.5% and Sarvam 30B scored 2.3% on TerminalBench Hard, compared to GLM-4.5-Air (20.5%) and INTELLECT-3 (9.1%). The AA-Omniscience Index is -60 for Sarvam 105B and -72 for Sarvam 30B. Both models have high hallucination rates relative to their accuracy, and both attempt to answer far more questions rather than abstaining, which drives the negative scores. Key model details: ➤ Modality: Text input and output only. ➤ Context window: 128K tokens (Sarvam 105B) and 65K tokens (Sarvam 30B). ➤ Pricing: Currently free on Sarvam's first-party API. ➤ License: Apache 2.0. ➤ Availability: Sarvam's first-party API; weights available on @huggingface and AIKosh.

译Sarvam AI发布印度首批从头预训练的开源权重模型Sarvam 105B与30B，采用MoE架构并在本土训练。两款模型在Intelligence Index分别得分18和12，支持推理与非推理双模式。105B在Agentic任务表现优于部分同类模型，但TerminalBench Hard编码测试成绩落后且幻觉率较高。模型采用Apache 2.0协议开源，上下文窗口128K/65K tokens，目前通过API免费提供服务。

Artificial Analysis@ArtificialAnlys · 4月3日56

Microsoft has released MAI-Transcribe-1: a speech transcription model achieving 3.0% on AA-WER (#4), and is fast at 69x real-time The model was developed by Microsoft AI (MAI)’s Superintelligence team and supports 25 languages including English, French, Arabic, Japanese, and Chinese. MAI-Transcribe-1 API is currently available in public preview via Azure Speech on Microsoft Foundry. On the Artificial Analysis Speech to Text (STT) leaderboard, MAI-Transcribe-1 achieves a 3.0% word error rate on AA-WER for speech transcription accuracy, positioning it 4th overall behind Mistral’s Voxtral Small (2.9% AA-WER), Google’s Gemini 3.1 Pro High (2.9% AA-WER) and ElevenLabs’ Scribe v2 (2.3% AA-WER). It also stands out as one of the faster high-accuracy transcription models available, processing audio at ~69x real-time. See more details below ⬇️

译微软AI超级智能团队发布了MAI-Transcribe-1语音转录模型。该模型在Artificial Analysis语音转文本排行榜的AA-WER指标上达到3.0%的词错误率，位列第四，仅次于Mistral Voxtral Small、Google Gemini 3.1 Pro High和ElevenLabs Scribe v2。其处理速度约为实时音频的69倍，属于高速高精度模型。模型支持包括英语、法语、阿拉伯语、日语和中文在内的25种语言，其API目前已在Microsoft Foundry的Azure Speech服务上提供公开预览。

Artificial Analysis@ArtificialAnlys · 4月3日

Google has released Gemma 4, a new family of multimodal open-weight models including Gemma 4 E2B, Gemma 4 E4B, Gemma 4 31B and Gemma 4 26B A4B @GoogleDeepMind’s new Gemma 4 family introduces four multimodal models supporting text, image, and video inputs. We evaluated Gemma 4 31B (dense) and Gemma 4 26B A4B (MoE), both with a 256k context window, while the other two smaller models support up to 128k. With 31B and 26B parameters respectively, both evaluated models can run on a single H100. On GPQA Diamond, our scientific reasoning evaluation, Gemma 4 31B (Reasoning) scores 85.7%, the second highest result we have recorded for an open-weights model with fewer than 40B parameters, just behind Qwen3.5 27B (Reasoning, 85.8%). It reaches this score using only ~1.2M output tokens, fewer than Qwen3.5 27B (~1.5M) and Qwen3.5 35B A3B (~1.6M). Gemma 4 26B A4B (Reasoning) scores 79.2%, ahead of gpt-oss-120B (high, 76.2%) but behind Qwen3.5 9B (Reasoning, 80.6%). We are now running the Artificial Analysis Intelligence Index on all four Gemma 4 models and will share a full update once those results are complete.

译Google DeepMind推出Gemma 4系列四款多模态开源模型，支持文本、图像及视频输入。31B（密集架构）与26B A4B（MoE架构）拥有256k上下文窗口，可在单张H100运行；另两款较小模型支持128k上下文。GPQA Diamond测试中，Gemma 4 31B（Reasoning）获85.7%，仅次于Qwen3.5 27B，但输出token仅约1.2M，效率更优；26B A4B（Reasoning）得分79.2%，超越gpt-oss-120B。

Sundar Pichai@sundarpichai · 4月3日

Gemma 4 is here, and it’s packing an incredible amount of intelligence per parameter 👇

译Gemma 4 开源模型发布，提供 31B dense、26B MoE 及有效 2B/4B 四种尺寸，分别针对性能、低延迟和边缘设备优化。Google DeepMind 称其为同尺寸最佳开源模型，强调单位参数量智能密度极高。

Demis Hassabis@demishassabis · 4月3日

Excited to launch Gemma 4: the best open models in the world for their respective sizes. Available in 4 sizes that can be fine-tuned for your specific task: 31B dense for great raw performance, 26B MoE for low latency, and effective 2B & 4B for edge device use - happy building!

译Gemma 4 开源模型发布，提供 4 种尺寸：31B dense 版追求极致性能，26B MoE 版实现低延迟，2B 与 4B 版适配边缘设备，均可针对特定任务微调。

Google DeepMind@GoogleDeepMind · 4月3日

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵

译Google 发布 Gemma 4 开源模型系列，采用 Apache 2.0 许可证，支持在本地硬件运行，专为高级推理和 agentic 工作流设计。

Satya Nadella@satyanadella · 4月2日

We’re bringing our growing MAI model family to every developer in Foundry, including … · MAI-Transcribe-1, most accurate transcription model in world across 25 languages · MAI-Voice-1, natural, expressive speech generation · MAI-Image-2, our most capable image model yet Start building: https://microsoft.ai/news/today-were-announcing-3-new-world-class-mai-models-available-in-foundry/

译MAI 模型家族正式登陆 Foundry 平台，推出三款新模型：MAI-Transcribe-1（支持25种语言的最准确转录模型）、MAI-Voice-1（自然语音生成）和 MAI-Image-2（最强图像生成模型）。开发者现可通过该平台直接调用。

karminski-牙医@karminski3 · 4月2日

GLM-5V-Turbo 能补充 GLM-5.1模态上的不足吗? GLM-5V-Turbo 刚刚发布啦! 给大家带来 GLM-5V-Turbo 简单实测! 之前给大家测试 GLM-5.1 的时候, 大家除了API不稳定问题以外, 抱怨最多的就是 5.1 不支持多模态输入, 而智谱更多的把多模态输入放在了V系列模型, 而支持多模态输入的旗舰模型一个很大的应用场景就是, 给到参考图直接克隆网页. 于是直接给大家带来 GLM-5V-Turbo 的网页克隆测试! 直接说结论, GLM-5V-Turbo 仍然保持了之前系列模型的文本识别准确率, 但是输出前端代码的性能的确一般, 我测试了总计4个场景, 分别是: case1: 需要使用js计算进行背景图片绝对定位 case2: 文本透明 case3: SVG 线条分割画面 case4: 复杂DIV布局 #GLM5VTrubo #GLM5V #GLM #智谱

译智谱发布多模态模型 GLM-5V-Turbo，弥补 GLM-5.1 缺乏视觉输入的短板。实测显示其文本识别准确率保持水准，但前端代码生成能力一般。在网页克隆测试中，面对 JavaScript 背景定位、透明文本、SVG 分割及复杂 DIV 布局等场景，模型将设计图转换为精确代码的表现仍有提升空间。

karminski-牙医@karminski3 · 4月1日

给大家带来WAN-2.7-Image简单测试! 阿里 WAN-2.7-Image 刚刚发布! 这是个图片生成+修图大模型, 最大的特性是生成人物会更加美观以及文本更加精准. 我先测了一下文本+图片生成情况: #wan27image #wan27 #阿里万相

译阿里发布 WAN-2.7-Image 图像生成与修图大模型，重点优化了人物生成美观度与文本渲染精准度。该模型支持文生图及图像编辑功能，博主对其文本到图像生成能力进行了初步测试。作为阿里万相系列最新版本，WAN-2.7-Image 在视觉质量和语义理解方面展现出改进，为创作者提供更精准的图像生成工具。

Artificial Analysis@ArtificialAnlys · 3月31日

KwaiKAT has released KAT-Coder-Pro V2, a non-reasoning model that scores 44 on the Artificial Analysis Intelligence Index, an 8 point improvement from KAT-Coder-Pro V1 @KwaiAICoder has updated their flagship proprietary coding model with the release of KAT-Coder-Pro V2. KAT-Coder-Pro V2 achieves 44 on the Artificial Analysis Intelligence Index, matching Claude Sonnet 4.6 (non-reasoning) and trailing only Claude Opus 4.6 (non-reasoning, 46) among non-reasoning models. At ~9M output tokens, it is also more token efficient than Claude Opus 4.6 (~11M), Claude Sonnet 4.6 (~14M), and reasoning models with similar intelligence such as DeepSeek V3.2 (reasoning, ~61M) and Qwen3.5 397B A17B (reasoning, ~86M). KAT-Coder-Pro V2 is a non-reasoning model, unlike all of the current frontier language models which ‘think’ before answering. Typically, reasoning variants score higher on the Intelligence Index than their non-reasoning counterparts, but consume more output tokens and are less suited to latency-sensitive workloads. Key Highlights: ➤ 🧠 Higher overall intelligence, but regression in long context reasoning and knowledge recall: KAT-Coder-Pro V2 scores 44 on the Artificial Analysis Intelligence Index, an 8 point improvement from KAT-Coder-Pro V1 and matching Claude Sonnet 4.6 (non-reasoning, max effort). It performs well on tool use (90% on Tau2-Telecom), but regresses compared to KAT-Coder-Pro V1 on long-context reasoning and knowledge, falling 8 p.p. on AA-LCR (66%) and 17 p.p. on HLE (16%). ➤ 🤖 Agentic capability improvements: KAT-Coder-Pro V2 shows major improvements on our agentic evaluations. On Terminal-Bench Hard, it scores 49%, up 40 p.p. from KAT-Coder-Pro V1, making it the highest-scoring non-reasoning model, matching Claude Opus 4.6 (non-reasoning, 49%) and ahead of Claude Sonnet 4.6 (non-reasoning, 46%). KAT-Coder-Pro V2 also shows improvement in GDPval-AA, scoring 1123 (+304 Elo from V1), but still sits behind models such as DeepSeek V3.2 (1198) and Qwen3.5 397B A17B (1202). ➤ ⚙️ High token efficiency: KAT-Coder-Pro V2 is a non-reasoning model and uses fewer tokens than peers with similar intelligence. It uses 8.7M output tokens to run the Artificial Analysis Intelligence Index, below Claude Opus 4.6 (non-reasoning, ~11M) and Claude Sonnet 4.6 (non-reasoning, ~14M), though this is ~2x higher than its predecessor, KAT-Coder-Pro V1 (~4.5M). It also uses significantly fewer tokens than similarly intelligent reasoning models such as DeepSeek V3.2 (reasoning, ~61M) and Qwen3.5 397B A17B (reasoning, ~86M). ➤ $ Improved cost efficiency: KAT-Coder-Pro V2 costs $73 to run the Artificial Analysis Intelligence Index, down from $76 for V1, as it uses fewer input tokens by requiring fewer turns in agentic evaluations. This makes it one of the most cost-efficient models at its intelligence level, costing less than Qwen3.5 397B A17B (reasoning, $418) and Claude Sonnet 4.6 (non-reasoning, $1397). KAT-Coder-Pro V2 is currently priced at $0.30/$1.20 per 1M input/output tokens on StreamLake and AtlasCloud API endpoints. ➤ ⚡ Low end-to-end response time: KAT-Coder-Pro V2 runs at ~109 output tokens per second, far ahead of Claude Opus 4.6 (non-reasoning, 39 OTPS) and Claude Sonnet 4.6 (non-reasoning, 43 OTPS). Because it also has a low time to first token without any reasoning delay, it delivers one of the fastest end-to-end response times, which measures the time taken from request sent to final output returned. Model details: ➤ Availability: KAT-Coder-Pro V2 is available via StreamLake and AtlasCloud API endpoints ➤ Context Window: 256K tokens (equivalent to KAT-Coder-Pro V1) ➤ Multi-modal capabilities: Text input and output only

译KwaiKAT发布非推理代码模型KAT-Coder-Pro V2，在Artificial Analysis Intelligence Index获44分，较V1提升8分，与Claude Sonnet 4.6持平。该模型token效率显著，运行仅需约9M输出token，远低于Claude系列及DeepSeek等推理模型。Agent能力大幅提升，Terminal-Bench Hard得分49%（提升40个百分点），匹配Claude Opus 4.6。成本降至73美元，响应速度达109 token/秒。但在长上下文推理和知识回忆方面较V1有所退步。

karminski-牙医@karminski3 · 3月30日

速报一波，GLM-5.1 真的猛，应该是从国产模型SOTA要跃升到真正的全球SOTA了，我的 vector-db-bench 直接刷到了第一，我已经在剪视频了，稍后马上为大家带来GLM-5.1详细评测视频~ (另, GPT-5.4-Pro(xhigh) 真的贵, 为了跑这个昨天干进去150刀....其实也算好消息, 当模型价格比我工资贵, 那它就没太多竞争力了...[允悲]) (测试在这里：http://vector-db-bench.kcores.com)

译GLM-5.1在vector-db-bench向量数据库基准测试中登顶第一，实现从国产SOTA到全球SOTA的关键跨越。测试数据显示其性能已超越国际主流模型，展现强劲竞争力。相比之下，GPT-5.4-Pro(xhigh)单次测试成本高达150美元，价格劣势显著。作者将发布详细评测视频进一步解析GLM-5.1的技术表现与性价比优势。

Demis Hassabis@demishassabis · 3月27日

Gemini 3.1 Flash Live is our highest quality audio & voice model yet - and a big leap towards building next-gen voice-first agents. Lower latency, better precision, more natural interactions... try it now with Gemini Live in the @GeminiApp or build with it in @GoogleAIStudio!

译Google 发布 Gemini 3.1 Flash Live，称其迄今最高质量音频模型，具备更低延迟、更高精度和更自然的对话体验，改进了函数调用能力。现已在 Gemini App 和 Google AI Studio 上线。

Google DeepMind@GoogleDeepMind · 3月26日

Say hello to Gemini 3.1 Flash Live. 🗣️ Our latest audio model delivers more natural conversations with improved function calling – making it more useful and informed. Here’s what’s new 🧵

译Gemini 3.1 Flash Live 音频模型发布，支持更自然的实时对话，函数调用能力改进，使 AI 助手更实用、信息获取更充分。

Sundar Pichai@sundarpichai · 3月26日

Gemini 3.1 Flash Live is our highest-quality audio and voice model yet. Voice capabilities have come a long way and are a big part of how we interact with AI to get things done. 3.1 Flash Live’s improved precision and reasoning make those interactions more natural and intuitive. Available in @GoogleAIStudio through the Gemini Live API in preview.

译Gemini 3.1 Flash Live 发布，为 Google 迄今最高质量音频语音模型，精度和推理能力显著提升，交互更自然直观。现已在 Google AI Studio 通过 Gemini Live API 预览版上线。

Artificial Analysis@ArtificialAnlys · 3月26日

OpenAI released GPT-5.4 mini and nano, cheaper variants of GPT-5.4 with the same reasoning modes. GPT-5.4 nano is the standout, scoring ahead of both Claude Haiku 4.5 and Gemini 3.1 Flash-Lite Preview with lower per token pricing @OpenAI released GPT-5.4 mini (xhigh, 48) and nano (xhigh, 44), the first mini and nano updates since GPT-5. Both are multimodal with image input support and feature a 400K token context window. They support the same reasoning effort levels as GPT-5.4 (xhigh, high, medium, low, none) and are priced significantly lower: mini at $0.75/$4.50 per 1M input/output tokens and nano at $0.20/$1.25, compared to GPT-5.4 at $2.50/$15. We evaluated these models across three reasoning variants: xhigh, medium, none. While both models are more intelligent than their peers in the highest reasoning efforts, they are more verbose, using 200M+ output tokens to run the Intelligence Index, higher than even select frontier models Key benchmarking takeaways from the highest reasoning variants: ➤ GPT-5.4 nano (xhigh, 44) jumps 18 points from GPT-5 nano (high, 27), with improvements across all evaluations. Compared to Claude Haiku 4.5 (Reasoning, 37) and Gemini 3.1 Flash-Lite Preview (34), GPT-5.4 nano leads on τ²-Bench (81% vs 55% and 31%), IFBench (76% vs 54% and 77%), and TerminalBench (42% vs 27% and 24%) ➤ GPT-5.4 mini (xhigh, 48) gains 7 points over GPT-5 mini (high, 41), with gains across most evaluations. Compared to Gemini 3 Flash Preview (Reasoning, 46) and Claude Sonnet 4.6 (Adaptive Reasoning, max effort, 52), GPT-5.4 mini leads on TerminalBench (52% vs 39% and 53%) and CritPt (10% vs 9% and 3%) ➤ Both models perform less on AA-Omniscience compared to peers, driven primarily by high hallucination rates. GPT-5.4 mini scores -18.7 with a 90% hallucination rate, well behind Claude Sonnet 4.6 (Adaptive Reasoning, max effort, +12.4, 46% hallucination rate) and Gemini 3 Flash Preview (Reasoning, +11.6, 92% hallucination rate but 54% accuracy). GPT-5.4 nano scores -29.6 with a 74% hallucination rate, behind Claude Haiku 4.5 (Reasoning, -4.2, 26% hallucination rate) and Gemini 3.1 Flash-Lite Preview (-15.5, 82%). Both GPT-5.4 models attempt to answer far more questions than Claude Haiku 4.5 and Claude Sonnet 4.6 rather than abstaining, which drives the higher hallucination rates ➤ Both models show strong agentic performance. GPT-5.4 mini scores 1405 on GDPval-AA (Agentic Real-World Work Tasks), ahead of Gemini 3 Flash Preview (Reasoning, 1191) but behind Claude Sonnet 4.6 (Adaptive Reasoning, max effort, 1633). GPT-5.4 nano scores 1169, close to Claude Haiku 4.5 (Reasoning, 1173) and well ahead of Gemini 3.1 Flash-Lite Preview (944) ➤ Token usage with xhigh reasoning effort is higher for both models compared to peers with highest reasoning efforts. GPT-5.4 mini used 235M output tokens to run the Intelligence Index, ~3.4x GPT-5 mini (high, 69M) and more than Claude Sonnet 4.6 (Adaptive Reasoning, max effort, 198M) despite scoring 4 points lower. GPT-5.4 nano used 210M output tokens, ~2.4x Claude Haiku 4.5 (Reasoning, 87M) and ~4x Gemini 3.1 Flash-Lite Preview (53M) ➤ Effective cost to run the Intelligence Index reflects the higher token usage. GPT-5.4 mini (xhigh) cost ~$1,406, compared to ~$278 for Gemini 3 Flash Preview (Reasoning) and ~$3,959 for Claude Sonnet 4.6 (Adaptive Reasoning, max effort). GPT-5.4 nano (xhigh) cost ~$376, compared to ~$584 for Claude Haiku 4.5 (Reasoning) and ~$94 for Gemini 3.1 Flash-Lite Preview. GPT-5.4 nano is cheaper than Claude Haiku 4.5 on an effective cost basis despite using ~2.4x more tokens, due to its significantly lower pricing. Overall, GPT-5.4 nano is the standout offering a better Intelligence vs. Cost to Run Intelligence Index tradeoff than peers and GPT-5.4 mini

译OpenAI发布GPT-5.4 mini与nano轻量模型，保留多档推理能力与400K上下文窗口，价格降至$0.20/$1.25每百万token。基准测试显示，GPT-5.4 nano在τ²-Bench等多项测试中领先Claude Haiku 4.5与Gemini 3.1 Flash-Lite Preview，但幻觉率较高且token消耗量大。得益于极低单价，nano在Intelligence Index测试中的有效成本反而低于竞品，展现出优秀的性价比优势。

Artificial Analysis@ArtificialAnlys · 3月20日

Mistral has released Mistral Small 4, an open weights model with hybrid reasoning and image input, scoring 27 on the Artificial Analysis Intelligence Index @MistralAI's Small 4 is a 119B mixture-of-experts model with 6.5B active parameters per token, supporting both reasoning and non-reasoning modes. In reasoning mode, Mistral Small 4 scores 27 on the Artificial Analysis Intelligence Index, a 12-point improvement from Small 3.2 (15) and now among the most intelligent models Mistral has released, surpassing Mistral Large 3 (23) and matching the proprietary Magistral Medium 1.2 (27). However, it lags open weights peers with similar total parameter counts such as gpt-oss-120B (high, 33), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), and Qwen3.5 122B A10B (Reasoning, 42). Key takeaways: ➤ Reasoning and non-reasoning modes in a single model: Mistral Small 4 supports configurable hybrid reasoning with reasoning and non-reasoning modes, rather than the separate reasoning variants Mistral has released previously with their Magistral models. In reasoning mode, the model scores 27 on the Artificial Analysis Intelligence Index. In non-reasoning mode, the model scores 19, a 4-point improvement from its predecessor Mistral Small 3.2 (15) ➤ More token efficient than peers of similar size: At ~52M output tokens, Mistral Small 4 (Reasoning) uses fewer tokens to run the Artificial Analysis Intelligence Index compared to reasoning models such as gpt-oss-120B (high, ~78M), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, ~110M), and Qwen3.5 122B A10B (Reasoning, ~91M). In non-reasoning mode, the model uses ~4M output tokens ➤ Native support for image input: Mistral Small 4 is a multimodal model, accepting image input as well as text. On our multimodal evaluation, MMMU-Pro, Mistral Small 4 (Reasoning) scores 57%, ahead of Mistral Large 3 (56%) but behind Qwen3.5 122B A10B (Reasoning, 75%). Neither gpt-oss-120B nor NVIDIA Nemotron 3 Super 120B A12B support image input. All models support text output only ➤ Improvement in real-world agentic tasks: Mistral Small 4 scores an Elo of 871 on GDPval-AA, our evaluation based on OpenAI's GDPval dataset that tests models on real-world tasks across 44 occupations and 9 major industries, with models producing deliverables such as documents, spreadsheets, and diagrams in an agentic loop. This is more than double the Elo of Small 3.2 (339) and close to Mistral Large 3 (880), but behind gpt-oss-120B (high, 962), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 1021), and Qwen3.5 122B A10B (Reasoning, 1130) ➤ Lower hallucination rate than peer models of similar size: Mistral Small 4 scores -30 on AA-Omniscience, our evaluation of knowledge reliability and hallucination, where scores range from -100 to 100 (higher is better) and a negative score indicates more incorrect than correct answers. Mistral Small 4 scores ahead of gpt-oss-120B (high, -50), Qwen3.5 122B A10B (Reasoning, -40), and NVIDIA Nemotron 3 Super 120B A12B (Reasoning, -42) Key model details: ➤ Context window: 256K tokens (up from 128K on Small 3.2) ➤ Pricing: $0.15/$0.6 per 1M input/output tokens ➤ Availability: Mistral first-party API only. At native FP8 precision, Mistral Small 4's 119B parameters require ~119GB to self-host the weights (more than the 80GB of HBM3 memory on a single NVIDIA H100) ➤ Modality: Image and text input with text output only ➤ Licensing: Apache 2.0 license

译Mistral发布开源权重模型Mistral Small 4，采用119B参数MoE架构（每token激活6.5B参数），支持可切换的推理/非推理模式及图像输入。推理模式在Artificial Analysis Intelligence Index获27分，超越Mistral Large 3，但低于gpt-oss-120B等竞品。模型token效率优于同类，幻觉率更低（AA-Omniscience -30分），支持256K上下文窗口，采用Apache 2.0许可证。

Satya Nadella@satyanadella · 3月20日

Great to see our new image model from our Superintelligence team rolling out in Copilot and coming soon to Foundry for enterprise customers.

译MAI-Image-2 图像生成模型已在 MAI Playground 上线，竞技场排名第 3，支持从写实风格到详细信息图等多种生成需求。即将集成至 Copilot、Bing Image Creator 及 Microsoft Foundry，面向企业客户开放。

Hao AI Lab@haoailab · 3月18日

(1/N) We're launching Dreamverse. Most AI video models take minutes to generate a 5 s 1080p clip. In 4.5 seconds, we can generate 30 s 1080p clips on a single GPU. Our videos generate faster than you can watch them: stop waiting on prompts and start directing scenes live. 🕹️Demo: http://dreamverse.fastvideo.org 📑 Blog: https://haoailab.com/blogs/dreamverse Welcome to the era of vibe-directing 👇

译(1/N) 我们正在推出 Dreamverse。大多数 AI 视频模型需要数分钟才能生成一段 5 秒 1080p 的片段。而在 4.5 秒内，我们就能在单张 GPU 上生成 30 秒 1080p 的片段。

Greg Brockman@gdb · 3月18日

introducing gpt-5.4 mini:

译OpenAI 发布 GPT-5.4 mini，已在 ChatGPT、Codex 及 API 上线。针对编程、计算机使用、多模态理解与 subagents 优化，速度较 GPT-5 mini 提升 2 倍。

OpenAI@OpenAI · 3月18日

GPT-5.4 mini is available today in ChatGPT, Codex, and the API. Optimized for coding, computer use, multimodal understanding, and subagents. And it’s 2x faster than GPT-5 mini. https://openai.com/index/introducing-gpt-5-4-mini-and-nano/

译GPT-5.4 mini 今日在 ChatGPT、Codex 和 API 中可用。针对编程、计算机使用、多模态理解和子代理场景优化，推理速度比 GPT-5 mini 快 2 倍。

Greg Brockman@gdb · 3月17日

gpt-5.4 has ramped faster than any other model we've launched in the API: within a week of launch, 5T tokens per day, handling more volume than our entire API one year ago, and reaching an annualized run rate of $1B in net-new revenue. it's a good model, try it out!

译GPT-5.4 上线一周内日处理 token 量达 5T，超过去年同期整个 API 的总量，年化新增净收入达 10 亿美元，增速创历史纪录。模型质量出色，值得试用。

Sam Altman@sama · 3月8日

GPT-5.4 is great at coding, knowledge work, computer use, etc, and it's nice to see how much people are enjoying it. But it's also my favorite model to talk to! We have missed the mark on model personality for awhile, so it feels extra good to be moving in the right direction.

译GPT-5.4 在编程、知识工作、计算机使用等方面表现出色，很高兴看到大家如此喜欢它。但它也是我最喜欢聊天的模型！我们在模型个性方面已经偏离目标有一段时间了，所以能朝着正确方向前进感觉特别好。