Artificial Analysis@ArtificialAnlys · 6月17日20

To mark the release of Artificial Analysis Intelligence Index v4.1, we're bringing together researchers, engineers, and builders working at the frontier of AI in San Francisco on June 29. Join us for an evening of talks on AI evaluation, model intelligence, and the tradeoffs between cost, speed, and performance. Apply to attend 👇 https://luma.com/qdl9mr2e

译为庆祝 Artificial Analysis Intelligence Index v4.1 发布，我们将于 6 月 29 日在旧金山汇聚前沿 AI 领域的研究人员、工程师和构建者。欢迎加入我们，共度一个关于 AI 评估、模型智能以及成本、速度与性能之间权衡的晚间讨论。申请参加 👇 https://luma.com/qdl9mr2e

Ethan Mollick@emollick · 6月17日29

This was not a good benchmark before it was updated and it is not a good benchmark now. Having AIs evaluate the work of other AIs on publicly available questions from a different closed benchmark doesn’t tell you very much. And it is unclear how they establish the human ELO.

译新版 GDPval-AA v2 成为 Intelligence Index v4.1 权重最高的评估，升级将 ELO 基线重置为人类 1000 分，引入前沿模型法官轮换面板，回合上限从 100 提升至 250。Claude Fable 5（有回退）以 1818 分领先，但当前不可用；Claude Opus 4.8 得 1638 分，GPT-5.5 (xhigh) 得 1531 分。Ethan Mollick 批评：AI 评估 AI 在取自另一闭卷基准的公开问题上表现意义有限，且人类 ELO 设定方式不透明，认为更新前后均非良好基准。

Chubby♨️@kimmonismus · 6月17日69

Open Source is so back. Let’s freaking go

译GLM-5.2 以 Elo 1360 在 Design Arena 代码类别中跃居第一，超越现已下架的 Claude Fable 5，且权重开放。这是自该榜单启动以来代码类别的最高 Elo 分数之一，较之前提升了 4 个名次和 27 Elo 分。 Open Source is so back. Let’s freaking go

elvis@omarsar0 · 6月17日56

Impressive if true! Better than Claude Fable 5? Wow! Design is really lacking in these frontier models, so I'm very curious to test GLM-5.2 myself. Testing this already on a few internal use cases and will report back on findings.

译智谱发布GLM-5.2，在Design Arena评测中跃居第1名，Elo评分1360，超过已下架的Claude Fable 5，提升4个名次和27 Elo分。该模型为开源权重。DAIR.AI创始人Elvis Saravia表示若属实则令人印象深刻，并称已在内部用例测试，后续将汇报结果。

MiniMax (official)@MiniMax_AI · 6月17日25

happy world cup everyone ⚽️ FWC-Bench when?

译MiniMax 的 M3 模型在卡塔尔 vs 瑞士的世界杯比赛中正确预测平局，成为五个模型和一位人类预测中唯一正确的选择。Kilo CLI 分析显示，该基准刻意排除博彩赔率，因此瑞士 64% 的市场赔率未被纳入。M3 依据双方相同的 WWDLW 记录、卡塔尔更高的原始评分以及瑞士更强的联赛水平做出判断。主推文同时提问“FWC-Bench when?”，暗示可能推出新基准测试。

Ethan Mollick@emollick · 6月17日32

Compare GPT-5.2 from 7 months ago with the new GLM-5.2 Deep Think Max's: "create a visually interesting shader that can run in twigl with an infinite city of neo-gothic towers partially drowned in a stormy ocean with large waves" "Make it better" GLM-5.2 also had a couple errors

译Ethan Mollick 将 7 个月前的 GPT-5.2 与新款 GLM-5.2 Deep Think Max 进行对比，用同一提示词要求生成可运行于 Twigl 的着色器（描绘哥特塔楼无限城市半淹于风暴海洋）。GLM-5.2 出现了若干错误。此前 Ethan 曾提前体验 GPT-5.2，并展示了 GPT-5.2 Pro 单次生成的该着色器版本。

Rohan Paul@rohanpaul_ai · 6月17日72

This paper shows a strange weakness in AI reasoning: models can solve math, yet fail to judge reasoning. The unsettling part is not that frontier models make arithmetic mistakes. It is that they can reach the right answer, see the right answer in someone else’s solution, and then forgive broken logic that should have been easy to catch. The authors call this the production-evaluation gap: the gap between generating a solution and evaluating whether a given solution actually earns its conclusion. Their Valid-Answer-Invalid-Reasoning (VAIR) benchmark makes the trap clean. The final answer is correct, but the reasoning is damaged by missing steps, shuffled steps, missing premises, or circular explanation. A careful evaluator should say, “Yes, the answer is right, but the argument does not justify it.” Many reasoning models instead appear to do something lazier and more dangerous: they solve the problem themselves, confirm the final answer, and then rationalize the path as acceptable. That is not reasoning vigilance. It is answer confirmation bias wearing the costume of mathematical judgment. The mechanism matters because modern AI training often rewards outcomes more than valid intermediate thought. A model trained to get the answer may learn to treat the answer as the evidence, especially when grading another chain of reasoning. Humans were not perfect here, but the contrast is revealing: people showed only a small drop from solving to grading, while models collapsed much more sharply on the same kind of task. This is where the result becomes larger than math. If AI systems can mass-produce plausible arguments but cannot reliably police the logic inside them, they become engines of confidence rather than engines of understanding. ---- Link – arxiv. org/abs/2606.01462 Title: "An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models"

译一篇新论文揭示了大型推理模型的“生产-评估差距”：模型能解出数学题并得到正确答案，但在评估他人推理时，即便逻辑有缺失步骤、前提颠倒或循环论证等明显缺陷，只要最终答案正确，模型也往往判定为合格。作者提出VAIR（有效答案-无效推理）基准验证该问题。这种现象称为“答案确认偏差”，模型仅凭正确答案而非有效逻辑评判推理。与人类相比，模型从解题到评估的能力下降更显著，表明AI可能成为制造看似合理论点的自信引擎，而非真正理解自身产出的推理引擎。

OpenAI@OpenAI · 6月17日31

Let’s talk about evals. We’re always looking for better ways to measure and forecast model progress, especially as benchmarks get saturated or gamed. @tejalpatwardhan, who leads our frontier evals team, spoke to @andrewmayne about why evals matter and what models need to be judged on next.

译我们来聊聊评估。我们一直在寻找更好的方法来衡量和预测模型进展，尤其是在基准测试逐渐饱和或被钻空子的时候。领导我们前沿评估团队的 @tejalpatwardhan 与 @andrewmayne 谈到了评估为何重要，以及接下来模型需要被评判的标准。

fofr@fofrAI · 6月16日24

Did you know Omni is good at text?

译你知道吗，Omni 也擅长文本处理。

Artificial Analysis@ArtificialAnlys · 6月16日60

Announcing Artificial Analysis Intelligence Index v4.1: a shift toward agentic workloads, featuring upgraded benchmarks and new per-task metrics The Artificial Analysis Intelligence Index is our synthesis metric for assessing model intelligence and tracking AI progress. v4.1 marks a broader shift toward agentic workloads, with three main changes: Updated and reweighted evaluations toward agentic tasks: 1. We upgraded three evaluations, removed one, and reweighted the Intelligence Index: ➤ Upgraded Terminal-Bench Hard to Terminal-Bench 2.1 and τ²-Bench Telecom to τ³-Bench Banking. Both move to newer, more robust task sets with harder, more realistic agentic scenarios that better separate frontier models ➤ Upgraded GDPval-AA to GDPval-AA v2. The upgrade re-baselines Elo to human performance at 1000, introduces a rotating panel of frontier-model judges, and raises the turn limit from 100 to 250 for longer-horizon agent trajectories ➤ Removed IFBench due to saturation. The benchmark no longer distinguishes frontier models sufficiently, so we have removed it from the Intelligence Index. We will continue to run it and publish results on new model releases 2. Cost per Task, Time per Task, and Tokens per Task: Three new per-task metrics, reported for every model and based on the Intelligence Index. We take the total cost, total time, and total output tokens for a model to run the Intelligence Index and divide by the number of tasks across its evaluations, giving the average cost, time, and output tokens to complete a single Intelligence Index task 3. Cached input token reporting: We now report cached input tokens and their impact on cost, including the cost to run the Intelligence Index, to better reflect the real cost of running each model Key Results: ➤ Leading models: Claude Fable 5 (with Opus 4.8 fallback, 60) leads the Artificial Analysis Intelligence Index v4.1 by four points but is currently unavailable, leaving Claude Opus 4.8 (max, 56) as the most intelligent available model, ahead of GPT-5.5 (xhigh, 55) ➤ Open weights leading models: Among open weights models, DeepSeek V4 Pro (max, 44) and MiniMax M3 (44) lead, followed by Kimi K2.6 (43) and MiMo-V2.5-Pro (42) ➤Cost per Task: Claude Opus 4.8 (max) is the most expensive available model at $1.78 per task, with Claude Fable 5 the highest overall at $3.25. GPT-5.5 (xhigh) scores within a point of Opus 4.8 on the Intelligence Index at $0.99 per task. DeepSeek V4 Pro (max) stands out on the Intelligence vs Cost per Task chart at $0.04 per task, with other leading proprietary models costing 20x to 45x more ➤Time per Task: time per task (inference decode time) ranges from 1.5 minutes for Grok 4.3 (high) to 13.5 for Claude Sonnet 4.6 (max), a roughly 9x spread. Claude Opus 4.8 (max) completes a task in 6.4 minutes and GPT-5.5 (xhigh) in 3.7, while Gemini 3.1 Pro Preview stands out on the Intelligence vs Time per Task chart at 1.6 minutes for a score of 46

译Artificial Analysis 发布 Intelligence Index v4.1，转向智能体任务。升级 Terminal-Bench 2.1、τ³-Bench Banking、GDPval-AA v2（Elo 重基线、引入前沿模型评审、回合上限增至250），移除饱和的 IFBench。新增每任务成本、时间、输出 token 指标及缓存 token 影响。关键结果：Claude Fable 5（60分）领先但不可用；可用模型中 Claude Opus 4.8（max）56分居首，GPT-5.5（xhigh）55分。开源 DeepSeek V4 Pro 与 MiniMax M3 均44分。成本方面，Opus 4.8 每任务 $1.78，GPT-5.5 $0.99，DeepSeek V4 Pro 仅 $0.04。时间方面，Grok 4.3 最快（1.5分钟），Opus 4.8 需6.4分钟，GPT-5.5 需3.7分钟，Gemini 3.1 Pro Preview 以1.6分钟得46分。

meng shao@shao__meng · 6月16日69

Cua 和 Snorkel AI 联合发布「Cua-Bench」：评测 Agent 在专业软件上的 Computer Use 能力 @trycua @SnorkelAI Cua-Bench 首个公开数据集聚焦 KiCad，一个完整的电子设计自动化工具，25 道任务均由执业电气工程师编写、第二人复核，覆盖从「改一个电容值」到「从零搭建双运放电路」等真实工作场景。 https://cua.ai/cuabench/report https://snorkel.ai/blog/cua-bench-benchmarking-computer-use-agents-on-professional-software/ 首批测试结果没有一个模型通过四分之一，最强也只有 24% 的完全通过率： 1. GPT-5.5：6 / 25 完全通过，0 / 25 部分通过 2. Claude Sonnet 4.5：5 / 25 完全通过，3 / 25 部分通过 3. Claude Haiku 4.5：5 / 25 完全通过，3 / 25 部分通过最重要的发现：「编辑现有」与「从零搭建」之间的能力断崖 · 所有完全通过的任务，都是对已有原理图的局部修改（改元件值、换电源端口、调整偏置点等）。 · 16 道从零搭建任务：0 成功。模型能放元件，但很少完成布线；任务结束时连线往往仍是未完成状态。瓶颈在执行层：规划多步流程、在复杂 GUI 中定位与操作、自我校验、在步数预算耗尽前保持任务不漂移。 Snorkel 的深度分析进一步指出：步数上限不是主因。两个失败任务放宽到 500 步仍失败；而所有成功案例都在 150 步内完成。问题出在计划与操作效率，而非单纯「时间不够」典型失败模式（可复现、可归类） · 导航开销大（~84%）：首次启动弹窗、误进 PCB 编辑器而非原理图编辑器，恢复就消耗 25–70 步。 · 操作粒度过细（~84%）：每轮只做一个点击 + 大段自我叙述，工程师三步能完成的事拆成十轮。 · 视图控制混乱（~76%）：不用 Home 键 fit，在极端缩放间来回 scroll，元件一出视野就「丢失」。 · 布线未完成（~72%）：16 个因步数耗尽而失败的任务中，没有一个画全所需连线。 · 自我验证不可靠：5 次宣告 DONE 的产出实际未通过验证——Agent 读的是自己「说过什么」，而不是屏幕上的真实状态。典型错误：悬空电阻却声称已连接；输入 2.80kOhm 而非 KiCad 要求的 2.8k；用错芯片参考电压（LT3010 是 0.808V，不是 1.24V）。根因分布：规划 ~40%、感知 ~22%、导航低效 ~19%、领域知识 ~11%、工具/API ~8%——且全程零 API 错误，说明 harness 本身没问题，问题在 Agent 如何使用它。对行业的含义 1. 现有 computer-use benchmark 可能高估了真实能力。浏览器里「多试几次总能蒙对」的策略，在专业软件上行不通。 2.「会答电路题」≠「能在 KiCad 里做出正确原理图」。知识与 GUI 执行是两条能力线，当前 frontier 模型在前者尚可、后者明显不足。 3. 长 horizon + 自我校验是下一个瓶颈。不是缺底层能力，而是缺「如何规划、批量操作、读 UI 状态而非读自己的 narration」的 meta-policy。 4. 评测设计值得借鉴：专家出题、双人复核、netlist 客观打分、任务难度按人类 ~50 步校准——这是衡量 Agent 能否创造真实经济价值的一个较公平标尺。

译Cua 与 Snorkel AI 联合发布 Cua-Bench，首个公开数据集聚焦电子设计工具 KiCad，含 25 道由执业电气工程师编写并复核的任务。测试中，GPT-5.5 完全通过 6/25（24%），Claude Sonnet 4.5 和 Haiku 4.5 各通过 5/25（20%）。所有成功任务均为局部修改，16 道从零搭建任务全部失败。瓶颈在执行层：导航开销大（~84%）、操作粒度过细（~84%）、视图控制混乱（~76%）、布线未完成（~72%）、自我验证不可靠。步数上限并非主因。根因分布：规划 ~40%、感知 ~22%、导航低效 ~19%、领域知识 ~11%、工具/API ~8%，全程零 API 错误。

Epoch AI@EpochAIResearch · 6月16日47

Claude Fable 5 achieves a new high score of 161 on the Epoch Capabilities Index! This beats out GPT-5.5 Pro by 1 point, and is the first time Anthropic has taken the lead on the ECI in over a year.

译Claude Fable 5 在 Epoch Capabilities Index 上取得新高分161！这以1分优势击败了GPT-5.5 Pro，也是Anthropic一年多来首次在该指数上领先。

AYi@AYi_AInotes · 6月16日68

seedance 2.0比Grok贵将近4倍，但生成视频这质量一点也不输啊，这可是就一句话的提示词兄弟们，只是想测一下Grok对中国古装风格的理解，真的超预期了

译用户对比Seedance 2.0与Grok的视频生成效果，发现Seedance 2.0价格贵近4倍，质量却不相上下；仅用一句话提示词测试Grok对中国古装风格理解，结果超预期。引用推文指出，GPT Image 2加Grok的混合工作流性价比极高：SuperGrok月费30美元，目前有3个月67%优惠，单条短片几乎零边际成本。角色风格一致性由GPT Image 2把控，出图后丢进Grok做动态效果即可。

Rohan Paul@rohanpaul_ai · 6月16日54

"You don’t need frontier scale to reach frontier quality" in specialized domains, you need the right expert feedback loop. Heidi says it matched Sonnet 4.6 in clinical search with a much smaller model trained on clinician preferences instead of raw scale. Heidi Evidence is a clinical search tool where doctors ask medical questions and get sourced answers. Here, clinicians were shown the same medical question with 2 anonymous answers, one from Heidi’s smaller model and one from Sonnet 4.6, and they picked Heidi’s answer 49.9% of the time. In medicine specifically, the hard problem is knowing when to search, what to cite, how much to say, and when a vague answer is worse than no answer.

译临床搜索工具 Heidi Evidence 表示，六周前其自研小模型在临床搜索任务中匹配了前沿规模模型 Sonnet 4.6 的质量。方法是通过临床医生的偏好反馈训练，而非单纯扩大模型规模。在匿名测试中，医生面对同一医学问题、两个匿名答案，选择 Heidi 小模型答案的概率为 49.9%。Heidi 指出，医学领域的关键难点在于知道何时搜索、引用什么、说多少，以及模糊答案何时比不回答更糟。

Ethan Mollick@emollick · 6月15日53

Weird headline - I am not sure solving 7 out of 10 novel very hard problems meant AI "did not live up to the task," when 15 months ago LLMs couldn't do math. But the actual study is interesting and illuminates flaws & successes of AIs in math. https://1stproof.org/assets/docs/report.pdf

译奇怪的标题——我不确定解决10个极其困难的新问题中的7个就意味着AI“没有完成任务”，而15个月前大语言模型还不会做数学。但实际研究很有趣，揭示了AI在数学中的缺陷与成功。https://1stproof.org/assets/docs/report.pdf [引用 @Nature]：人工智能经历了其最严谨的数学测试，然而它并未完成任务 https://go.nature.com/4oqlNk6

Chubby♨️@kimmonismus · 6月15日45

An AI editor and a pro editor cut the same 4-hour video project. They made the same cuts 84% of the time. Still their own test, and the last ~16% is where a human's judgment wins. But a draft in minutes at ~60% less prep sounds really exciting.

译一款基于Premiere Pro重构的AI视频编辑器，与专业剪辑师共同剪辑同一4小时视频项目，两者84%的剪辑操作相同。AI编辑器可在数分钟内完成草稿，节省约60%的准备工作时间。最后约16%的差异点仍需要人类判断。该工具在Key & Peele、Beast Games等项目幕后剪辑师中进行了测试。

Ethan Mollick@emollick · 6月15日47

This is a good methodological thread on the debate over a new paper that suggests generalist models beat specialized medical AIs. (And a good overview of the challenges of benchmarking AIs in medicine)

译这是一条关于新论文辩论的优秀方法论线程，该论文表明通用模型能击败专业医疗AI（同时也很好地概述了医疗AI基准测试的挑战。）

Rohan Paul@rohanpaul_ai · 6月14日68

Univ of Texas paper shows AI agents can slowly become less reliable after deployment, even when the model itself does not change. The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance. An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house. Every one of those steps can quietly rot. A medication dose can become “a daily medication,” two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass. The uncomfortable finding is that the agent may still sound competent while becoming less exact. The proposed AgingBench, a benchmark that checks whether an agent stays reliable across many sessions instead of only checking one clean starting point. It studies 4 ways agents age: summaries can drop key details, similar memories can get mixed up, updated facts can stay stale, and maintenance can suddenly break memory. The deeper lesson is that “give it more memory” is often the wrong repair. If the fact was never written, retrieval cannot save it. If the fact was written but crowded out, better summarization will not fix it. If the fact is present but unused, the problem is not storage but the agent’s decision to trust or ignore what it retrieved. This paper reframes deployed agents less like static models and more like aging infrastructure. ---- Link – arxiv. org/abs/2605.26302 Title: "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems"

译德克萨斯大学论文指出，AI 智能体在部署后即使模型不变，也会因长期记忆的摘要压缩、相似记忆混淆、事实更新失效及维护操作而可靠性下降。例如药物剂量可能变成“每日用药”，相似客户记录混淆，已取消订阅仍保留，日程可能因维护消失。论文提出 AgingBench 基准测试，评估智能体在多次会话中的可靠性。研究强调“增加更多记忆”往往是错误修复——问题可能在于从未写入、写入后被挤掉、或写入后未被信任使用。论文将部署智能体重新定义为类似老化基础设施的系统。

Rohan Paul@rohanpaul_ai · 6月14日65

Adaline just launched a self-improvement layer for AI agents that turns messy production traces into fresh evals, synthetic edge cases, and better agent candidates for humans to approve. I expected it to be a regular trace viewer, but it is reading my production traffic and building evals I would never have considered. It reads production traffic and user feedback, then clusters the mess into recognizable agent behaviours rather than asking a human to manually inspect every strange conversation.

译Adaline 2.0 推出 AI 智能体自我改进层，将生产流量和用户反馈痕迹自动转化为行为聚类，进而生成评估（Evals）、合成边缘场景数据，并基于此产出新的智能体候选版本。开发者只需审核胜出版本即可上线。该工具无需人工逐条检查异常对话，可自动发现人类难以想到的评估用例。

数字生命卡兹克@Khazix0918 · 6月13日71

http://x.com/i/article/2065786589650026496 # 实测GLM-5.2，国产Coding模型的又一座新高峰。最近整个世界的魔幻程度，真的让人唏嘘。今天早上，Anthropic收到了美国商务部的一封信。信的内容很简单，以国家安全为由，要求Anthropic立刻暂停所有外国公民对Fable 5和Mythos 5的访问权限。而且不只是美国境外的用户，也包括美国境内的外国公民，甚至包括Anthropic自己公司里的外籍员工。然后Anthropic做了一个让所有人都没想到的决定，为了确保合规，直接把Fable 5和Mythos 5对所有用户全部关停，老美自己也用不了了。 X上直接爆了5000万的阅读。这个事引起了轩然大波，全网直接爆了。我中午睡醒一看，心都凉了半截，因为Claude fable 5在纯粹的代码执行能力上，我觉得其实Opus 4.8和GPT 5.5也能干，但是他的方案构建能力、架构能力、产出的完整度和全面程度，是任何一个模型都比不了的，刚刚让它帮我完成了AIHOT精选算法的优化，还有移动端的全面适配和重构，今天刚准备开发完小程序，直接就没了。。。仅仅4天，这个号称全世界最强的模型，就被召回，全面下线。再结合这次世界杯强调全球大团结的背景之下，一个索马里的世界杯裁判在美国被禁止入境，从而缺席世界杯赛场。这个世界的格局，好像越来越不一样了。好像，也越来越封闭了。就在我们落寞的看待着这一切的时候。下午2点19，智谱突然发了一篇公告。 “在一些前沿模型突然变得不可用的时刻，我们选择相信另一条路：前沿智能不应只属于少数人，也不应被少数规则随时收回。它应该开放、可用、可构建，并服务于每一位开发者。” 我的朋友圈瞬间就被刷屏了。而且这一次，GLM 5.2，继续开源。 GLM 5.1的口碑，在技术圈和AI圈的口碑有多好就不需要我再复述了，基本上是公认的国产之光，为数不多的能跟Claude和GPT掰掰手腕的模型，在Coding和Agent能力上，也是我给所有用不了海外模型的朋友，都推荐的首选模型。要不是因为算力限制，国内几乎都没有卡，无论是训练还是推理，几乎都比国外少N个数量级，我真的觉得，像智谱、DeepSeek之类的，是绝对能做出不亚于海外那两家公司的模型的。这一次非常的事发突然，看到他们发布的时候我甚至还在外面吃饭，下午的事都推了，急急忙忙赶回家，还好我的Coding Plan还在，然后拿到了GLM 5.2的权限。这里说一下，今天GLM 5.2上线的是智谱的Coding Plan，你可以把Coding Plan理解成Claude和GPT的订阅，也就是你只有订阅过的用户才可以使用。下周会上线API方式，并且会直接开源出来。而且今天他们5点21上线的这个时间点也非常的有梗。因为Anthropic是5点21收到的信，所以，智谱选择5点21开放。一边在关门，一边在开门。一边说前沿智能是国家安全风险，一边说前沿智能属于所有人。真的能笑死，戏剧性也属实是拉满了。 Coding Plan稍微蛋疼一点的就是，他们的算力太少了，没办法支持所有用户的推理请求，所以Coding Plan只能限额，也就是这个玩意你想买，是需要靠抢的。。。所以如果想用的，记得每天早上定个10点的闹钟，去抢一下。我自己在测完和跟一些朋友对完之后，我想说，这就是国产模型的新高峰，至少在我的层面，除了算力资源问题，会显得很慢之外，在纯粹的结果上，只要你不是强设计类型的东西，GLM 5.2做任务跟Opus 4.8好像差的也不多。在大型工程、长任务、后端等等上面，很强，非常强。差距我觉得其实就在前期方案的先进和完整度、还有设计的差别上。优点就很多了，GLM 5.2输出的东西我看的懂，能聊的明白，幻觉极低，稳如老狗，而且这次整个上下文长度终于加到了1M，这就很棒了。在测试过程中，400～500k左右的上下文长度左右，准确性和指令遵循跟Claude差距不是很大，非常的稳，我写的Claude.md到了400K这个长度的时候也能遵循没啥问题，我自己一般喜欢在这个位置用我的洁癖.skill手动存档，再往后比如500k～1M的这个区间，我一般很少会涉及到了。最最最可惜的是，GLM 5.2，还是没有多模态，依然是个纯文本模型。干活程度也没啥毛病，我的评价是更像一个勤勤恳恳的老黄牛，活肯定能给你干好，它的聪明程度肯定还比不上Claude Fable 5这种级别，跟Opus 4.8的聪明程度也差一点，但是也已经非常好了。举个例子，我今天AIHOT上的一个小任务。就是我前段时间为了自己的学习，也为了省一点自己的时间，所以用一些有趣的手段，监控了一些我常看的公众号方便我第一时间知道信息，但是呢，今天发现了一个BUG，就是智谱的公众号是我监控了的，今天的GLM 5.2的消息是2点19发的，但是在AIHOT里，居然没监控到，等到4点的时候，智谱发了X，才看到。这就很奇怪了，于是我把这个问题，直接让GLM 5.2试了一下。其实在它去解决的过程中，我已经大概知道是什么原因了，前段时间切换了监控方案，现在是两个监控方案线上灰度并行，大概率是我们后来切换的一个三方API账户里没钱了，我前天就想着要充来着，但是忙忘了。不过也正好，这么个小事，可以看一下模型的聪明程度，我这个项目大概10万行代码，因为有各种监控和调度，所以后端逻辑会稍微复杂。随后GLM 5.2找到了这个问题，其实本质上是智谱好几天没发文章了，跟我们抓取的体系没啥关系。。。然后他就沿着这条路推了下去，以为是我们整个监控体系BUG了。终于找到了答案。然后问我要不要做个监控。全程耗时21分钟。 Claude Opus 4.8的思考过程和GLM 5.2几乎一模一样，唯一的区别是，我在fast模式下，6分钟干完的，不开fast正常差不多也就是10分钟。也就是说，Claude Opus 4.8比GLM 5.2快了两倍，但是过程和结果，是一模一样的。这个本质上就是infra和算力的差距了，属于基建问题了。随手又让GLM 5.2干个后续。因为我的文档和记忆是极度规范的，也有一个专门的飞书报警群，通过飞书的机器人进行推送。所以我相信GLM 5.2肯定是可以完成的，其实就看这个任务他能不能最短时间内找到余额报警的方式，再找到我的群，然后把这个事干完。补流程+查代码文档+开发+走测试+合并+洁癖.skill迭代记忆和文档，完美完成，花了26分钟。验证没问题。然后，我又让他去干了一个稍微大点的活。直接把我们AIHOT的官网，转成一个小程序，这个本来是我今天想用fable 5做的，结果fable 5用不了了，那就直接用GLM 5.2来吧。。。 Prompt就是直接把小程序的开发目录扔进去，又扔了小程序的开发文档，然后说，帮我把AIHOT做成小程序版。 GLM 5.2一顿研究后，问了我2个问题。我无脑点了第一个。然后，他就开始列计划，列完计划以后，开始并行4个Agent，来进行开发。在大概40分钟以后，小程序干完了。 BUG倒是没啥BUG，各个地方都能点，也没啥报错的，该有的功能和信息也都有，就是，真的丑啊= = 底tab栏还有小BUG，背景没了，tabbar的适配没做好，调了一下才改好。不过在其他的逻辑展示、接口调用之类的，几乎没有任何问题，GLM 5.2这个模型，在做一些稍微大一点的任务上，是真的稳如老狗。这个真想做成完整的小程序的话，肯定还是要对着UI一点点细调的，跟Claude相比，无论是Fable还是Opus的省心角度，确实还是差了一些。设计审美的差距，我觉得只有GLM啥时候把多模态能力补上，才会有质的飞跃的了。然后我就让GLM 5.2用Three.js又做了一个未来我们社群想搞的一个线上的游戏化营地，这是一轮出的效果。也可以看出来，稳定性啥的都没问题，就是这个审美，只能说能用，但是你要说多漂亮多精致，那肯定还是有一些差距的。 Skill的构建也是模型现在很重要的一环，我也拿之前清理电脑那个skill来做了一下测试。这也是从0开始，用嘴复刻构筑，最终的感觉，跟Opus 4.8开发的Skill基本上也没啥区别。可以看看效果。在我有限的时间体验下，GLM 5.2整体其实是非常惊喜也超乎我预期的。只要你刨除掉审美和多模态这个因素，在我的体验中，它是真的可以和Opus 4.8掰掰手腕的。至此，我觉得国产两个模型现在极度值得大家使用。只要是涉及到Agent和Coding的，无脑推荐使用GLM 5.2 + Claude Code框架，这就是目前你在国内用到的最强的组合了。如果是涉及到一些诸如策划、写作之类的泛知识任务，无脑推荐你使用DeepSeek V4 Pro，这是目前我认为世界知识最棒的模型。智谱在今天公众号文章的结尾，写了两行英文。 A step closer to frontier intelligence for everyone. The future of AI is open, and it is for the people. 向前沿智能再近一步，为每一个人。 AI的未来是开放的，它属于所有人。我觉得这两句话，放在今天这个语境下，格外令人感慨。 2026年的AI赛道，每天都在上演让人目瞪口呆的事。一边在筑墙，一边在铺路。但是我还是始终坚信。这些墙在汹涌向前的洪流之下，必然会倒塌。智能，应该是献给所有人的。新时代，一定会到来的。

译美国商务部以国家安全为由要求Anthropic限制外国公民访问Fable 5和Mythos 5，Anthropic直接关停两模型。同日智谱发布GLM 5.2并开源，推出需抢购的Coding Plan，下周上线API。实测：上下文窗口扩至1M，400-500k长度准确性和指令遵循与Claude差距不大；代码工程能力极稳、幻觉低；小型任务21分钟结果与Opus 4.8相同但速度慢约两倍。缺憾：纯文本、无多模态、推理慢。作者认为这是国产Coding模型新高峰，推荐GLM 5.2 + Claude Code框架。

Artificial Analysis@ArtificialAnlys · 6月13日53

Today is the first time our Intelligence Frontier chart has moved backward.

译今天是我们 Intelligence Frontier 图表首次出现回退。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月13日65

In ONE year, AI went from being able to solve ~none of the hardest math problems to solving almost ALL of them

译一年之内，AI从几乎无法解决任何最难数学问题，发展到几乎能解决所有它们。

Rohan Paul@rohanpaul_ai · 6月13日45

NVIDIA just posted the first agentic AI benchmark results where GB300 NVL72 runs up to 20x more coding agents per megawatt than H200. Older inference benchmarks mostly ask how fast a system can produce tokens after one prompt. AgentPerf from Artificial Analysis, asks a harder question: how many agents can run at the same time while still feeling responsive. It tests a harder workload than normal LLM serving because an agent is not one request and one answer, but a long chain of model calls, code edits, command runs, tool delays, and growing context. The benchmark replays real coding-agent paths from public repos across 12+ programming languages, with request lengths from 5K to 131K tokens and an average near 27K tokens. NVIDIA says GB300 NVL72 reaches 61.4K concurrent agents per megawatt at the lowest service tier, while H200 reaches 2.6K. The gain comes from 72 GPUs acting like one rack-scale machine through NVLink, plus software that spreads MoE expert work, overlaps communication with compute, and keeps batches large. @NVIDIAAIDev

译NVIDIA 首次在 AgentPerf（由 Artificial Analysis 开发）中评测智能体 AI。该基准测试的不是传统 token 生成速度，而是每兆瓦可同时运行且保持响应性的编码智能体数量。工作负载模拟真实编码智能体路径（长链模型调用、代码编辑、命令运行、工具延迟、增长上下文），涵盖 12+ 编程语言，请求长度 5K–131K tokens（平均 27K）。结果：GB300 NVL72 在最低服务层每兆瓦达 61.4K 并发智能体，H200 仅为 2.6K（20 倍提升）。性能提升源于 72 GPU 通过 NVLink 组成的机架级系统，配合软件优化（MoE 专家分布、通信与计算重叠、大批量保持）。

Artificial Analysis@ArtificialAnlys · 6月13日59

Today we're releasing the first results for AA-AgentPerf, our new agentic inference benchmark: initially covering DeepSeek V4 Pro across NVIDIA Blackwell, Hopper, and AMD. AA-AgentPerf is the first benchmark built for agentic inference. We use real, long-context agentic coding trajectory data as the workload, and inference with real production optimizations such as KV cache reuse and speculative decoding, leading to the most realistic evaluation of inference performance available today. AA-AgentPerf’s lead metric is Agents per Megawatt. In a power-constrained world, this answers the most relevant question for AI infrastructure providers - “how many real agents can I deploy per unit of power available?”. First results for DeepSeek V4 Pro (at the easiest defined service level of 20 tokens/s and 10s TTFT): ➤ GB300 (rack-scale, disaggregated): 61,354 Agents/MW ➤ B300 (single node, disaggregated): 21,053 Agents/MW ➤ MI355X: 3,551 Agents/MW ➤ H200: 2,594 Agents/MW Further AA-AgentPerf details: ➤ Real agent workloads, beyond synthetic queries: AA-AgentPerf replays real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens - the workloads that matter in 2026 ➤ Production optimizations allowed: KV cache reuse, speculative decoding, and prefill/decode disaggregation are all permitted, with accuracy verification to control for quality loss - we want results to reflect what real deployments actually look like ➤ Lead metric is Agents per Megawatt: simultaneous agents supported at production performance targets (e.g. 20 tokens/s per user, ≤10s TTFT) per megawatt consumed. Agents per TCO and $/hr will be supported soon Key findings: ➤ Rack-scale disaggregated inference (GB300) is ~3× more power-efficient than single-node Blackwell (B300), and similarly ahead in raw agents per GPU ➤ Blackwell represents a large generational step over Hopper in both power efficiency and raw compute per GPU ➤ In this test, NVIDIA's Blackwell systems currently lead AMD MI355X by a clear margin. Important context: our MI355X configs are approximately two weeks older than our Blackwell configs and couldn’t stably use speculative decoding. MI355X power draw under heavy load is also well below TDP, indicating there is much room to improve on DeepSeek V4 Pro, which we will measure and publish in the coming weeks ➤ Config and inference framework version matter enormously - we've seen meaningful improvements daily since the DeepSeek V4 Pro release and look forward to tracking performance over time AA-AgentPerf is a live benchmark and we publish results on a rolling basis as submissions come in. Some of the new features coming in v1.1: more models (gpt-oss-120b), more hardware (GB200, B200, H100, MI300X), better AMD configurations, $/hr and cost-per-task normalization, Agents per TCO, and performance tracking over time.

译Artificial Analysis 发布新基准 AA-AgentPerf，首批结果覆盖 DeepSeek V4 Pro 在 NVIDIA Blackwell（GB300、B300）、Hopper（H200）及 AMD MI355X 上的推理能效。核心指标为每兆瓦承载的并发智能体数（要求 20 tokens/s 且 TTFT≤10s）：GB300（机架级解耦）达 61,354，B300（单节点解耦）21,053，MI355X 3,551，H200 2,594。基准使用真实编码 agent 轨迹（最多 200 轮、序列超 100K tokens），允许 KV cache 复用、推测解码等生产优化并验证精度。测试显示 Blackwell 机架级比单节点能效高约 3 倍，且代际大幅领先 Hopper；MI355X 配置较早且未稳定启用推测解码，仍有优化空间。

Rohan Paul@rohanpaul_ai · 6月13日73

A Nature Medicine study found general-purpose LLMs are now outperforming dedicated medical AI products on physician-reviewed clinical tasks. The authors compared OpenEvidence and UpToDate Expert AI with GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 on medical exam questions, clinician-style answers, and real questions doctors asked during care. In 100 de-identified physician questions from live clinical use, blinded clinicians again preferred the frontier models, especially on completeness and clarity,

译《自然·医学》一项研究发现，通用大语言模型在经医生评审的临床任务上已超越专用医疗 AI 产品。研究对比了 OpenEvidence、UpToDate Expert AI 与 GPT-5.2、Gemini 3.1 Pro、Claude Opus 4.6 在医学考试题、医生风格回答及实时临床提问上的表现。在来自真实临床场景的 100 个脱敏医生问题中，盲审医生更偏好前沿模型，尤其在其回答的完整性和清晰度方面。

Chubby♨️@kimmonismus · 6月13日24

Looking at the graph, I think Fable 5 will only maintain its lead up to GPT-5.6. And secondly, I think the benchmark will soon be completely saturated.

译观察图表，我认为 Fable 5 只会保持领先直到 GPT-5.6。其次，我认为该基准测试很快就会完全饱和。

Ethan Mollick@emollick · 6月13日57

The shape of the graph is getting very familiar.

译Claude Fable 5 在 FrontierMath 基准测试（Tiers 1-4, v2）中表现优异，Tiers 1-3 得分 87%，Tier 4 得分 88%，延续了 Anthropic 模型数学能力快速提升的趋势。主推文评论道：“图形的形状越来越熟悉了。”

Epoch AI@EpochAIResearch · 6月13日41

Claude Fable 5 scores very well on FrontierMath: Tiers 1–4 (v2), reaching 87% on Tiers 1–3 and 88% on Tier 4. This continues a streak of Anthropic models improving rapidly at math.

译Claude Fable 5 在 FrontierMath（Tiers 1–4，v2）上得分很高，在 Tiers 1–3 上达到 87%，在 Tier 4 上达到 88%。这延续了 Anthropic 模型在数学上快速提升的趋势。

Rohan Paul@rohanpaul_ai · 6月13日43

Most AI agents do not forget because they lack memory; they fail because they remember badly. AGENTCL asks a simple question: does an AI agent really learn from experience, or merely carry clutter forward? Today's agents can spend enormous effort solving one task, then enter the next one almost as if nothing happened. AGENTCL says AI agents need better tests for whether their memory actually helps them learn across tasks. The paper’s main idea is to build task streams where earlier tasks clearly contain pieces that later tasks can reuse, such as a small coding function, evidence for a research question, or a useful workflow. It compares these careful “compositional” streams with normal “naive” streams, where tasks come from the same area but do not have a guaranteed reuse link. Agent memory is easy to overrate when the benchmark is messy. If tasks are not carefully connected, a memory system may look good for the wrong reason, or bad for a reason the test cannot explain. AGENTCL tries to fix that by making the task relationships clear, then measuring whether memory helps on later tasks, stays useful, and transfers to unseen tasks. The key finding is that today’s memory methods can reuse past work when the connection is obvious, but they still struggle to avoid confusion when the next task is different. ---- Link – arxiv. org/abs/2606.02461 Title: "AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents"

译AGENTCL 提出评估 AI 智能体是否真正从经验学习，而非单纯累积信息。通过构建组合任务流（前序任务包含可被后续任务复用的代码片段、研究证据或工作流），与无固定复用线索的随意任务流对比。关键发现：当前记忆方法在任务连接明显时可复用过去经验，但当任务差异较大时仍难以避免混淆。论文旨在为智能体持续学习提供更清晰的测评标准。

Epoch AI@EpochAIResearch · 6月13日64

FrontierMath: Tiers 1–4 (v2) is live. We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.

译FrontierMath: Tiers 1–4 (v2) 现已上线。我们完成了一项审计，修正了 42% 的问题中的错误。排名相似，但整体得分更高。目前的领先者是 GPT-5.5 (xhigh)，在 Tiers 1–3 上达到 85%，以及 Google 的 AI co-mathematician，在 Tier 4 上达到 76%。

Ethan Mollick@emollick · 6月12日72

There has been a push to use OpenEvidence AI for doctors. But this paper suggests general models are much better: “Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ.”

译一项发表在Nature Medicine的研究显示，通用前沿大语言模型（Google、OpenAI、Anthropic）在医学信息评估中全面优于专门的临床AI工具（OpenEvidence和UpToDate）。12名美国临床医生进行随机盲测，Frontier LLMs在三项评估中均胜出。临床AI工具的表现与自动启用的Google Search AI Overview在RCQ测试中相当。

Artificial Analysis@ArtificialAnlys · 6月12日60

We've updated the Artificial Analysis Coding Agent Index, replacing SWE-Bench Pro with Datacurve's DeepSWE benchmark - the swap lifts Codex with GPT-5.5 (xhigh) above Claude Code with Opus 4.8 (max), while the newly released Claude Fable 5 (max) in Claude Code debuts at the top DeepSWE, built by @datacurve, writes its tasks from scratch rather than adapting them from public GitHub issues or pull requests, so no model has seen the solutions during training. That matters because SWE-Bench Pro, the benchmark it replaces in our Coding Agent Index, had grown gameable, with some models recovering the fix from the repository's commit history instead of solving the task. The swap reorders the index: Codex with GPT-5.5 (xhigh) rises from 65 to 76, overtaking Claude Code with Opus 4.8 (max) at 73. Claude Code with Fable 5 (max), which enters directly on the refreshed index, leads at 77. SWE-Bench Pro had been flattering some combinations and penalizing others. More below.

译Artificial Analysis 更新 Coding Agent Index，以 Datacurve 的 DeepSWE 基准取代 SWE-Bench Pro。DeepSWE 从头编写测试任务，而非改编自公开 GitHub issue/PR，避免训练数据泄露；原 SWE-Bench Pro 存在模型从仓库提交历史恢复修复的作弊问题。换基准后排名变动：Codex with GPT-5.5 (xhigh) 从 65 升至 76，超过 Claude Code with Opus 4.8 (max) 的 73；新发布的 Claude Code with Fable 5 (max) 以 77 分直接登顶。

AK@_akhaliq · 6月12日67

Agents' Last Exam

译智能体的最后考试

Rohan Paul@rohanpaul_ai · 6月12日56

atomic[.]chat shared a revealing comparison of local open-weight LLMs running on their own hardware. They benchmarked the new DiffusionGemma (diffusion text model) vs. Gemma4 26B A4B (autoregressive model) on a single H100 (FP8). The 4X speed of DiffusionGemma changes the shape of error. - Autoregressive models move left to right, one token at a time, which is slower, but each new word is conditioned on the exact text already written. - Diffusion models write many tokens at once, then revise the block over several passes, so they can feel fast because the model is not waiting to finish token 1 before starting token 2. atomic[.]chat, a desktop app for running LLMs locally

译atomic[.]chat 在单张 H100（FP8）上对比 DiffusionGemma 26B A4B 与 Gemma4 26B A4B 在事实性写作任务中的表现。DiffusionGemma 速度达 763 tok/s（3.7 秒），是 Gemma4（218 tok/s，15.1 秒）的 4 倍，但错误率显著更高。在 Steve Jobs 传记、Tetris 历史和 BeOS 故事三项任务中，Gemma4 答对 45 个事实、错 5 个；DiffusionGemma 仅对 33 个、错 28 个。主题越冷门错误越多：Jobs 4 错、Tetris 12 错、BeOS 12 错，例如将 Jobs 母亲写为 Clara Clley、为 Tetris 发明者虚构同事 Geri Gulovik、将 BeBox 价格误报为 $9,999（实价 $1,600）。原因在于 DiffusionGemma 一次生成 256 tokens 并多轮抛光，只追求文本流畅性而非事实准确性。Google 官方也建议在事实重要时使用常规 Gemma4。

elvis@omarsar0 · 6月12日25

Got my 10yr old introduced to Codex today. The excitement in his face tells it all. After struggling with Claude Code CLI for a bit, today he was like “this is the future, dad”. The Codex team built a beautiful app.

译今天让我10岁的孩子体验了Codex。他脸上的兴奋说明了一切。在用Claude Code CLI折腾了一会儿之后，今天他说：“这就是未来，爸爸。” Codex团队打造了一个漂亮的应用程序。

Artificial Analysis@ArtificialAnlys · 6月12日61

Users and enterprises are handing AI models and agents more autonomy, so the guardrails that screen their inputs and outputs matter more than ever. However, the benchmarks for evaluating those guardrails haven’t kept pace with model intelligence In partnership with @nvidia, we independently benchmarked guardrail and moderation models across three open datasets, measuring detection quality, latency, and the tradeoff between catching unsafe content and over-refusing safe content. No model wins outright, and there is still no common standard for judging them. We see this as an early step in a measurement problem that will continue to grow more important as models take on more real-world work.

译随着用户和企业赋予 AI 模型与智能体更高自主权，其输入输出护栏的重要性持续上升。Artificial Analysis 与 NVIDIA 合作，在三个开放数据集上独立基准测试了护栏与审核模型，评估检测质量、延迟以及在捕获不安全内容与过度拒绝安全内容之间的权衡。结果显示无模型全面领先，且业内仍缺乏统一评判标准。该研究被视为这一日益重要的评估问题的早期探索。

Noam Brown@polynoamial · 6月12日63

I'm happy GPT-5.5 tops this eval I'm even happier it's still doing the best when measured vs tokens, cost, or wall-clock time!

译OpenAI 研究员 Noam Brown 表示，GPT-5.5 在 Agents' Last Exam（ALE）基准中排名第一，且按模型 token、成本或墙钟时间衡量同样表现最佳。ALE 由 @dawnsongtweets 团队创建，是一个滚动基准，包含超过 1500 个专家任务、覆盖 55 个职业，测试 AI 智能体能否执行实际经济价值工作。评估对象包括 GPT-5.5、Fable 5、Composer 2.5 等前沿系统。结果显示：当前智能体能解决部分专业任务，但在需要持续推理和深度专业知识的最难层级，所有被测前沿智能体（包括 Fable 5）成功率为 0%。

AK@_akhaliq · 6月12日58

TRL-Bench Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

译TRL-Bench 标准化跨范式表格编码器的表示级评估

OpenRouter@OpenRouter · 6月12日74

Use our Benchmarks explorer to plot Pareto curves for 10 different benchmarks, including @ArtificialAnlys and @Designarena: https://openrouter.ai/rankings#benchmarks

译使用我们的基准测试探索器，为 10 个不同基准（包括 @ArtificialAnlys 和 @Designarena）绘制帕累托曲线： https://openrouter.ai/rankings#benchmarks

OpenRouter@OpenRouter · 6月11日77

Use our Benchmarks explorer to plot Pareto curves for 10 different benchmarks More coming soon! https://openrouter.ai/rankings#benchmarks

译使用我们的基准探索器，为10个不同基准绘制帕累托曲线。更多功能即将推出！https://openrouter.ai/rankings#benchmarks