This may be an extreme case but it still shows how quickly Fable 5 classifiers can reroute routine coding to Opus. The session routed 75% of its work to Opus because the new classifiers kept misreading the coding prompts here as a cybersecurity issue.

译用户 @bridgemindai 披露一次编码会话花费 $321，其中 Fable 5 仅完成 $78（约 25%），而 Opus 4.8 被回退调用完成 $242（约 75%）。原因在于 Fable 5 的新分类器将常规编码提示误判为网络安全风险，导致大部分工作自动路由到更昂贵的 Opus 模型。Anthropic 曾称仅极少数任务会触发 fallback，但该用户实际体验与此不符。

Epoch AI@EpochAIResearch · 4小时前44

OpenAI’s GPT-4 led the Epoch Capabilities Index for 352 days after its March 2023 release, far longer than any model since. The second-longest lead belongs to OpenAI’s o1 at 98 days.

译OpenAI的GPT-4在2023年3月发布后，引领Epoch能力指数长达352天，远超此后任何模型。第二长的领先属于OpenAI的o1，为98天。

Ethan Mollick@emollick · 8小时前50

You really need your own benchmarks. If you are translating hieroglyphics, use Gemini 3.5 Flash. If you are running a vending machine use Opus 4.8. (This is one reason why I am skeptical of just swapping out models to optimize costs or generic benchmarks without testing first)

译Ethan Mollick主张用自定义基准测试评估模型，而非依赖通用基准或直接换模型。他举例：翻译埃及象形文字用Gemini 3.5 Flash，运行自动售货机用Opus 4.8。JakeABoggs的HieroglyphBench测试显示，Anthropic Fable 5与GPT-5.5持平，但均远落后于Gemini系列，其中Gemini 3.5 Flash得分是Fable 5的两倍以上。

Yuchen Jin@Yuchenj_UW · 21小时前38

Databricks ranks #1 on NVIDIA’s SOL-ExecBench kernel leaderboard, in the L1 single operation track, powered by KDA (Kernel Design Agents) 🎉 What’s crazy is: we 100% leveraged AI agents to beat the competition. This is a sneak peek at recursive self-improvement. The core frameworks we used were KDA, Humanize, and Omnigent: Claude writes code, Codex reviews. Together, they enabled agents to run autonomously for as long as possible. The key is setting up the right framework to let the agents cook. This work was driven by @leshenj15 at Databricks, in collaboration with NVIDIA and MIT HAN Lab’s @LigengZhu and @DongyunZou03 . Databricks AI is like a neolab. Join us if you’re cracked!

译Databricks 在 NVIDIA SOL-ExecBench kernel 排行榜 L1 single operation 赛道排名第一，完全依靠 AI 智能体自主运行。使用的框架是 KDA、Humanize 和 Omnigent：由 Claude 编写代码，Codex 审查代码，实现了递归自我改进。该工作由 Databricks 的 leshenj15 主导，并与 NVIDIA 及 MIT HAN Lab 的 Ligeng Zhu 和 Dongyun Zou 合作完成。

elvis@omarsar0 · 1天前43

Who did it best? GLM-5.2 (left) | Fugu Ultra (middle) | Fable 5 (right) Same one-shot prompt. The last one is my favorite!

译谁做得最好？ GLM-5.2（左）| Fugu Ultra（中）| Fable 5（右）同样的一次性提示。最后那个是我的最爱！

SemiAnalysis@SemiAnalysis_ · 23小时前57

This week the InferenceX team discusses what it took to get DeepSeek V4 on InferenceX, changes in the model architecture, what is a MegaKernel, and initial performance on various accelerators including Huawei Ascend NPUs.

译本周 InferenceX 团队讨论了将 DeepSeek V4 部署到 InferenceX 所需的工作、模型架构的变化、什么是 MegaKernel，以及在包括华为昇腾 NPU 在内的各种加速器上的初始性能。

Rohan Paul@rohanpaul_ai · 1天前53

Fable 5 absolutely crushed the HTML5 physics contest, but cost 6x more than Opus 4.8 and 39× more than GLM 5.2 in that test. Test was done on atomic[.]chat, a desktop app that runs LLMs locally. The test asked 4 models to generate self-contained canvas demos with believable motion and collisions. The scenes were not simple animations because every crash needed gravity, force, timing, and contact handling. Outputs: - Fable 5: 62,158 tokens, $3.12 - GPT 5.5: 37,753 tokens, $1.14 - Opus 4.8: 22,280 tokens, $0.56 - GLM 5.2: 36,246 tokens, $0.08

译在 atomic.chat（本地 LLM 桌面应用）的 HTML5 物理竞赛中，Fable 5 以 A+ 成绩完成全部三个场景（火车脱轨、汽车空中碰撞、怪物卡车碾压），消耗 62,158 token，成本 $3.12。相比之下，Opus 4.8 消耗 22,280 token/$0.56，GPT 5.5 消耗 37,753 token/$1.14（在怪物卡车场景中略胜 Fable），GLM 5.2 消耗 36,246 token/$0.08 但未赢得任何场景。Fable 5 质量最佳但成本最高。

Chubby♨️@kimmonismus · 1天前44

Your move, @OpenAI. Excited to see how GPT-5.6 performs. Rumor has it the release is coming next week. Let’s see. Oh, and where is Gemini 3.5 Pro?

译Fable-5在Remote Labor Index（RLI）取得16.10%分数，领跑公开排行榜。RLI使用240个真实远程工作项目（覆盖23个领域、总价值超14万美元），评审将AI输出与人类交付物对比，判断合理客户是否接受。该成绩被称为“疯狂的跃升”，表明AI仍处于指数发展期。与此同时，传闻GPT-5.6将于下周发布，作者向OpenAI喊话并追问Gemini 3.5 Pro去向。

Artificial Analysis@ArtificialAnlys · 1天前68

Fish Audio has recently released S2.1 Pro and is making it available for free via API through July 24. Fish Audio S2.1 Pro is the latest Text to Speech model from @FishAudio, supporting multilingual speech generation across 83 languages with improved quality, lower latency, and higher throughput than S2 Pro. The model also supports voice cloning and natural language control over emotion and prosody. Key takeaways: ➤ Quality: S2.1 Pro has an Elo of 1,153, placing it #13 on the Artificial Analysis Speech Arena Leaderboard ahead of Async Pro v1.0, Speech 2.8 Turbo, and Step TTS 2, based on 1,072 arena appearances. ➤ API: S2.1 Pro is available via the Fish Audio API with a free access period through July 24, 2026. ➤ Speed: S2.1 Pro processes 56.3 characters per second, ahead of GPT-Realtime-2 (45.8 chars/s) and Gemini 3.1 Flash TTS (25.3 chars/s). See more details and listen to samples below ⬇️

译Fish Audio 发布 S2.1 Pro 文本转语音模型，通过 API 免费使用至 2026 年 7 月 24 日。该模型支持 83 种语言、声音克隆及自然语言控制情感与韵律，质量、延迟和吞吐量均优于前代 S2 Pro。在 Artificial Analysis Speech Arena 排行榜上，S2.1 Pro 基于 1072 场竞技获得 Elo 1153，排名第 13，超过 Async Pro v1.0、Speech 2.8 Turbo 和 Step TTS 2。处理速度达 56.3 字符/秒，高于 GPT-Realtime-2（45.8 chars/s）和 Gemini 3.1 Flash TTS（25.3 chars/s）。

Ethan Mollick@emollick · 1天前41

Since its back, here were my impressions I posted a couple weeks ago of Fable after my time as an early access user (yes, it really is very impressive, but that shows off best in longer, harder tasks) https://open.substack.com/pub/oneusefulthing/p/what-it-feels-like-to-work-with-mythos?r=i5f7&utm_medium=ios

译自从它回归以来，这里是我几周前作为早期访问用户使用Fable后的印象（是的，它确实非常令人印象深刻，但在更长、更困难的任务中表现最佳）https://open.substack.com/pub/oneusefulthing/p/what-it-feels-like-to-work-with-mythos?r=i5f7&utm_medium=ios

Artificial Analysis@ArtificialAnlys · 1天前55

Claude Sonnet 5 ranks second only to Fable 5 on AA-Briefcase, our new agentic knowledge work benchmark, with a ~17x cost per task range across its five effort settings @AnthropicAI has released Claude Sonnet 5, the latest addition to the Claude Sonnet family. On AA-Briefcase, Claude Sonnet 5 (max) scores 1391 Elo, a +312 point improvement over Claude Sonnet 4.6 (max), making it the second highest scoring model behind Claude Fable 5. This gain is driven primarily by improvements in rubric scoring and analytical quality, with Sonnet 5 trailing Claude Opus 4.8 on Presentation Elo. We benchmarked all 5 available effort settings for Claude Sonnet 5: ➤ Max effort achieves the second highest AA-Briefcase Elo, but lower efforts are not Pareto efficient: Claude Sonnet 5 (max) achieves the highest AA-Briefcase score among Sonnet 5 effort settings, but lower effort settings do not reach the cost-performance Pareto frontier. Models such as Claude Opus 4.8 (max), GLM-5.2 (max), and MiniMax-M3 offer stronger cost-performance trade-offs than Claude Sonnet 5 at lower effort settings ➤ Substantially higher turn use across effort levels: Claude Sonnet 5’s higher cost is driven by an increased number of turns, with Sonnet 5 (max) averaging 183 turns per AA-Briefcase task, more than 4x that of Claude Sonnet 4.6 (max). This increase is consistent across effort levels, with Claude Sonnet 5 (medium) averaging 55 turns per task, in line with Claude Opus 4.8 with max effort AA-Briefcase is our new proprietary benchmark for agentic knowledge work. It tests models on realistic tasks across thousands of input files, requiring deliverables such as spreadsheets, presentations, and UI mock-ups. Model performance is measured across three dimensions: binary rubric checks for ground-truth correctness, pairwise grading on analytical quality, and pairwise grading on presentation quality. The AA-Briefcase Elo is a single metric that combines results across all three dimensions

译Anthropic发布Claude Sonnet 5。在AA-Briefcase（智能体知识工作基准，测试模型处理数千文件并产出表格、演示和UI原型）上，Sonnet 5 (max)得1391 Elo，较Sonnet 4.6 (max)提升312分，排第二，仅次于Fable 5。提升来自rubric评分与分析质量，呈现仍落后Opus 4.8。max设置得分最高，但较低设置不处成本-性能帕累托前沿；Opus 4.8 (max)、GLM-5.2 (max)和MiniMax-M3在低努力下性价比更优。Sonnet 5成本较高，因turn数大增：max平均每任务183 turns（Sonnet 4.6 max的4倍多），medium平均55 turns，各设置成本跨度约17倍。

Ethan Mollick@emollick · 1天前61

You really need to benchmark models for your use case. As soon as judgements & decisions stack on top of each other, the differences between models amplifies, and no standard benchmark will tell you that Gemini 3.1 is less worried about financial losses at a cafe than GPT-5.5

译主推文强调必须针对实际用例做基准测试，因为决策层层叠加时模型差异会被放大，标准基准无法反映 Gemini 3.1 比 GPT-5.5 更不关心咖啡馆财务损失。引用案例：Andon Labs 的 AI 智能体用 Gemini 3.1 Pro 在斯德哥尔摩开咖啡馆，过度采购且易被欺骗，支出 $15k、收入仅 $9k，亏损 $6k，现已切换到 GPT-5.5。

Chubby♨️@kimmonismus · 1天前73

This is crazier than you might think: Fable-5 now scores 16.10% on the Remote Labor Index What is RLI? The Remote Labor Index uses 240 real remote-work projects from professional freelancers, covering 23 domains and more than $140,000 of human work. Each task comes with the actual brief, files, and accepted human deliverable. Reviewers then compare the AI output against the human reference and ask whether a reasonable client would accept it. That is why the scores are still low. Full projects require planning, file handling, quality control, visual consistency, domain judgment, and final packaging. Fable-5 now leads the public leaderboard at 16.10%. And it’s a crazy jump. We are still deep in exponential development, and now even the toughest benchmarks are being tackled.

译Fable-5 在 Remote Labor Index（RLI）上取得 16.10% 的自动化率，较前代 Opus 4.6 的 4.2% 提升近 4 倍，且是第二名模型的两倍。RLI 使用 240 个来自专业自由职业者的真实远程工作项目，覆盖 23 个领域、超 14 万美元的人类工作，评审者将 AI 输出与人类参考对照，判断合理客户是否会接受。Fable-5 目前领先公共排行榜，作者称这一飞跃表明 AI 仍在指数级发展，甚至最难的基准也开始被攻克。

Berryxia.AI@berryxia · 1天前15

我也来一杯啤酒吧~

译Omini 1.0 在视频修改方面表现不错，演示空间和透视处理有显著提升。新版本很快将可使用，但由于其属于强编辑型工具，目前热度不高。

Berryxia.AI@berryxia · 1天前58

赖叔的这个视频做的不错😄 据说GEO很赚钱啊，我不知道我能不能忽悠到我们老板。

译用户用Codex将GEO入门文档整理后，分别交给6个顶流PPT Skill生成演示内容。部分Skill输出HTML，宝玉走生图路线，PPT Master可直接生成PPT和PDF便于编辑。归藏版本留白较多，适合演讲类内容，而非知识点密集的培训课件。测试仅反映默认表现，不代表各Skill能力上限。用户根据本次交付暂时更倾向PPT Master。

Epoch AI@EpochAIResearch · 1天前28

We recently began tracking 13 new evals on our benchmarking hub. 7 of these have been incorporated into the Epoch Capabilities Index (ECI).

译我们最近开始在评测中枢跟踪13项新基准。其中7项已被纳入Epoch能力指数（ECI）。

小互@xiaohu · 1天前40

用我的 http://Best.XiaoHu.AI 内容测试对比 Sonnet 5 比 4.6 在文字方面和其他任务方面有明显提升但是和opus 相比前端能力很差，很多前端设计和交互和svg图像无法完成我用Sonnet 5代替了文字解读和翻译任务能节省约一半输入 token，同时速度提升了1倍多翻译成本量级下降 ~80%，质量零损失

译用 Best.XiaoHu.AI 内容测试显示：Sonnet 5 相比 4.6 在文字和其他任务上提升明显，但前端能力（前端设计、交互、SVG 图像）远不及 Opus。用户将 Sonnet 5 用于文字解读和翻译任务，可节省约一半输入 token，速度提升 1 倍多，翻译成本量级下降约 80%，质量零损失。

Orange AI@oran_ge · 1天前54

没想到 Sonnet 5 的争议那么大因为更换了新的 tokenizer，Sonnet 5 的实际费用和 Opus 4.8 差不多 Sonnet 在金融领域是最佳模型，比如 GDPeval，比如投资调研之类的工作，且更喜欢调用工具核查事实，能提高报告的准确性。（相应的费用也up） Sonnet 5 有个小坑，用来编程的话，费用可能超过 Opus 4.8 ，这也是大家吐槽最多的点，需要特别注意下 Opus4.8 在复杂编程和规划方面非常强，且 HTML 设计方面很强，不过写作方面不如 Opus 4.6，且新的 tokenizer 花费也比 4.6 要多，目前来说和 GPT 5.5 各有千秋编程方面目前首选还是 GPT 5.5 Sonnet 5 、Opus 4.8、GPT 5.5 现已上线 Cola，欢迎体验

译Sonnet 5 因更换新 tokenizer，实际费用与 Opus 4.8 相近，引发争议。Sonnet 5 在金融领域（如 GDPeval）表现最佳，擅长调用工具核查事实，但编程费用可能超过 Opus 4.8。Opus 4.8 在复杂编程、规划和 HTML 设计上强，写作不及 Opus 4.6，与 GPT 5.5 各有千秋。目前编程首选 GPT 5.5。三模型均已上线 Cola。

Rohan Paul@rohanpaul_ai · 1天前58

atomic[.]chat, a desktop app that runs LLMs locally, ran a very revealing comparison for Claude Sonnet 5, Claude Opus 4.8, Claude Sonnet 4.6, and GPT 5.5. Claude Sonnet 5 just matched GPT 5.5 on 3 physics coding demos at 6x lower cost. Also spent minimum number of tokens. - Sonnet 5: 15,047 tokens, $0.15 - Opus 4.8: 23,063 tokens, $0.58 - Sonnet 4.6: 25,824 tokens, $0.39 - GPT 5.5: 31,152 tokens, $0.94

译atomic.chat桌面应用对Claude Sonnet 5、Opus 4.8、Sonnet 4.6及GPT 5.5进行对比测试。使用同一提示词构建三个HTML5物理碰撞演示（汽车撞墙、破坏球毁屋、投石机砸城）。Sonnet 5在全部测试中与GPT 5.5和Opus 4.8表现相当，其中破坏球场景胜Opus 4.8，投石机场景胜GPT 5.5。Sonnet 5仅用15,047 tokens（$0.15），GPT 5.5使用31,152 tokens（$0.94），成本低约6倍；Opus 4.8使用23,063 tokens（$0.58），Sonnet 4.6使用25,824 tokens（$0.39）。Sonnet 5 token消耗最少，图形细节仍有提升空间。

Rohan Paul@rohanpaul_ai · 2天前55

Claude Sonnet 5 is more expensive (around +15%) per task than Opus 4.8 and much more expensive (2X) than Sonnet 4.6, even though its per-token price is lower than Opus. Because it uses more tokens to complete the same kind of benchmark task. i.e. Sonnet 5 works harder and talks/thinks more, so the final bill becomes bigger even though each token is cheaper. The promo pricing changes the story for now. Until August 31, 2026, Sonnet 5 is discounted to $2 per 1M input tokens and $10 per 1M output tokens, then it moves back to $3/$15 from September 1, 2026.

译Claude Sonnet 5 在 Intelligence Index 上每任务成本为 $2.29，比 Sonnet 4.6 高约 2 倍，比 Opus 4.8 高约 15%。尽管每 token 单价低于 Opus，但 Sonnet 5 为完成相同任务使用了更多 token，导致总费用更高。标准定价为 $3/百万输入 token、$15/百万输出 token；Anthropic 提供促销价 $2/$10，持续至 2026 年 8 月 31 日，之后恢复原价。目前 Sonnet 5 成本仅次于 Claude Fable 5。

Chubby♨️@kimmonismus · 2天前68

tl;dr: Sonnet 5 is cheaper per token, but more expensive per solved problem – and still lags behind Opus 4.8 in overall intelligence. Thats honestly disappointing and not a good release.

译Claude Sonnet 5 在 Artificial Analysis Intelligence Index 得分 53，与 GPT-5.5 (xhigh) 和 Opus 4.8 (max) 差 2-3 分。标准定价（$3/$15 per 1M tokens）下每任务成本 $2.29，比 Sonnet 4.6 贵约 2 倍，比 Opus 4.8 贵约 15%。推理和知识密集型基准落后 Opus 4.8（如 CritPt 物理推理仅 17%），但在 agentic 知识工作（AA-Briefcase 和 GDPval-AA）上匹配或超越 Opus 4.8。上下文窗口 100 万 token，Anthropic 提供至 9 月 1 日促销价 $2/$10。新增 xhigh effort 设置。整体表现令人失望，并非一次好的发布。

Yuchen Jin@Yuchenj_UW · 2天前31

Claude Sonnet 5 costs more than Claude Opus 4.8 on the Artificial Analysis Intelligence Index task, and 4.75X more than GLM-5.2. Token efficiency is important.

译Claude Sonnet 5 在 Artificial Analysis Intelligence Index 任务上的成本高于 Claude Opus 4.8，并且是 GLM-5.2 的 4.75 倍。Token 效率很重要。

Artificial Analysis@ArtificialAnlys · 2天前60

Claude Sonnet 5 achieves 53 on the Artificial Analysis Intelligence Index, but without promotional pricing will cost more per task than Opus 4.8 We supported @AnthropicAI to evaluate Claude Sonnet 5 ahead of release: with max effort it improves 6 points over Sonnet 4.6 to achieve the same Intelligence Index as GPT-5.5 with high reasoning, but remains behind Opus 4.7 and 4.8 Key takeaways: ➤ Claude Sonnet 5 is the #5 model on the Artificial Analysis Intelligence Index, only 2-3 points behind GPT-5.5 (xhigh) and Opus 4.8 (max) ➤ With max effort, Sonnet 5 works harder than previous Anthropic models: it used ~40% more output tokens per Intelligence Index task than Sonnet 4.6, and ~3x the agentic turns for our knowledge work evaluations AA-Briefcase and GDPval-AA. This behavior scales well with the ‘effort’ setting, with the max effort using around 6x more turns than low effort on GDPval-AA ➤ Claude Sonnet 5 costs more per task than Opus 4.8 before accounting for promotional pricing: Claude Sonnet 5 costs $2.29 per task on the Intelligence Index, a ~2x increase compared to Sonnet 4.6 and ~15% more than Claude Opus 4.8. This is driven entirely by increased token usage. Sonnet 5 retains the same $3/$15 per 1M input/output token pricing as Sonnet 4.6 (compared to $5/$25 for Opus 4.8), however Anthropic is offering a one-third reduction to $2/$10 until September 1. Our results use standard $3/$15 pricing ➤ Sonnet 5 matches or outperforms Opus 4.8 on agentic knowledge work tasks: on both AA-Briefcase and GDPval-AA, Claude Sonnet 5 sits just ahead of Opus 4.8, trailing only Claude Fable 5 (which is not currently generally available). These benchmarks test the ability of models to produce accurate and well-presented professional outputs using our open source reference agent harness, Stirrup ➤ For reasoning and knowledge-heavy tasks, Sonnet still sits behind its larger siblings: despite substantial gains across many evaluations, heavy reasoning and knowledge benchmarks still show Opus 4.8 ahead of Sonnet 5. On CritPt, a frontier physics reasoning benchmark developed by researchers at Argonne and UIUC, Sonnet 5 scores 17% - this is 14 points higher than its predecessor, but behind GLM-5.2, Claude Opus and Fable, and GPT-5.5 (xhigh and Pro) ➤ Sonnet 5 also showed significant improvements over Sonnet 4.6 on Terminal-Bench v2.1 (+9 points), Humanity’s Last Exam (+10 points), and SciCode (+7 points), with relatively flat scores elsewhere Other key model details: ➤ Context window of 1 million tokens (equivalent to Sonnet 4.6) ➤ Pricing of $3/$15 per 1M tokens of input/output (reduced to $2/$10 until September 1); cache pricing remains at a 25% premium for cache writes ($3.75 per million tokens) with 5-minute time to live, and 90% discount for cache hits ($0.3 per million tokens) ➤ Effort remains the recommended way of configuring model performance and latency. Sonnet 5 adds an additional ‘xhigh’ effort setting relative to Sonnet 4.6, matching the 5 effort levels available on Opus 4.8 (max, xhigh, high, medium, low)

译Claude Sonnet 5 以 max effort 在 Artificial Analysis Intelligence Index 上得分 53（第 5 名），比 Sonnet 4.6 提高 6 分，与 GPT-5.5 (xhigh) 持平，落后 Opus 4.7/4.8 约 2-3 分。标准定价下每任务成本 $2.29，比 Sonnet 4.6 贵约 2 倍、比 Opus 4.8 贵 15%，主要因输出 token 增加 40%、agentic 任务调用次数增加约 3 倍。定价 $3/$15 每百万 token（促销至 9 月 1 日降至 $2/$10），上下文窗口 1M tokens，新增 xhigh 力度设置。在 agentic 知识工作基准 AA-Briefcase 和 GDPval-AA 上匹配或超越 Opus 4.8，推理基准仍落后。Terminal-Bench v2.1（+9）、HLE（+10）、SciCode（+7）显著提升。

Artificial Analysis@ArtificialAnlys · 2天前58

Announcing the Artificial Analysis Controlled Voice Arena - compare Text to Speech models on the same set of 8 cloned voices The Controlled Voice Arena standardizes, through voice cloning, the set of voices that each model’s performance is evaluated on - separating specific voice preference from broader aspects of model quality, e.g., audio quality, pronunciation, pacing and tone. It complements our Provider Voice Arena, where each model uses a select set of its own available voices. We have generated speech samples on models that offer voice cloning abilities using the same voice categories as our existing Provider Voice Arena, namely: 2 US Male voices, 2 US Female voices, 2 UK Male voices, 2 UK Female voices. Each model has been cloned on the same 1-2 minute audio recordings for each voice. Voting is open now and we plan to announce the first leaderboard results this week.

译Artificial Analysis 发布 Controlled Voice Arena，通过语音克隆标准化 8 种声音（2 美男、2 美女、2 英男、2 英女），评估 TTS 模型的音频质量、发音、节奏与语调，分离声音偏好与模型质量。每个模型基于同一 1-2 分钟录音进行克隆。投票已开放，本周公布首批排行榜。

Artificial Analysis@ArtificialAnlys · 2天前53

GLM-5.2 is the most intelligent open weights model available, but also the most verbose among the leading models GLM-5.2 (max) used ~141M output tokens (95% reasoning) to run the Artificial Analysis Intelligence Index (1.8x the average model). Key takeaways: ➤ GLM-5.2 generates more tokens (141M) to run the Artificial Analysis Intelligence Index than Claude Opus 4.8 (117M) and nearly double GPT-5.5 (72M), while scoring below both (51 vs 56 and 55) ➤ Almost two-thirds of that goes to a single benchmark, Humanity's Last Exam: ~88M tokens, 3.2x GPT-5.5's, and it still scores lowest of the three (40% vs Opus 46% and GPT-5.5 44%) ➤ The verbosity is not focused on recalling facts. On AA-Omniscience, which measures hallucination rates, GLM-5.2 thinks less than GPT-5.5 yet scores just 4, far below Opus 4.8 (27), GPT-5.5 (20), and Gemini 3.5 Flash (23) ➤ Additional thinking pays off most on agentic real-world work: on GDPval-AA v2 GLM-5.2 is the top open weights model and #3 overall, beating GPT-5.5 ➤ Several open models generate even more output, but all score lower on intelligence; the strongest of them, DeepSeek V4 Pro, trails GLM-5.2 by 7 points (44 vs 51)

译GLM-5.2 在 Artificial Analysis Intelligence Index 中以 51 分成为开源权重智能最高的模型，但输出 token 达 1.41 亿（95% 推理），远超平均模型的 1.8 倍。相比之下，Claude Opus 4.8 输出 1.17 亿 token 得分 56，GPT-5.5 输出 7200 万 token 得分 55。近三分之二 token（8800 万）集中在 Humanity's Last Exam，是 GPT-5.5 的 3.2 倍，得分仅 40%（Opus 46%，GPT-5.5 44%）。AA-Omniscience 幻觉率评测中 GLM-5.2 仅得 4 分，远低于 Opus 4.8（27）、GPT-5.5（20）和 Gemini 3.5 Flash（23）。在 agentic 任务 GDPval-AA v2 上 GLM-5.2 为开源第一、整体第三，超过 GPT-5.5。其他开源模型如 DeepSeek V4 Pro 得分 44，落后 7 分。

fofr@fofrAI · 2天前32

> Change the table to be underwater sand

译Omni Flash 模型具有出色的图像编辑能力，能够将桌子变为浅水池，并逼真呈现手部湿润、水波、折射、阴影和音效。该模型现已通过 API 提供，其编辑能力非常适合实现炫酷的流水线。

AK@_akhaliq · 2天前31

OSWorld2.0 Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

译OSWorld2.0 对计算机使用智能体在长程真实世界任务上进行评测

Berryxia.AI@berryxia · 2天前34

乔纳森的OpenAI 首作产品，真的没有什么新鲜玩意了。

Rohan Paul@rohanpaul_ai · 3天前73

Arena’s AI leaderboard has become a $100M annualized revenue business. By turning public model comparisons into paid performance testing for AI labs and enterprises. Arena began as a UC Berkeley research project that asked users to compare 2 anonymous model answers and vote for the better one. That setup created a large human preference dataset, because every vote says something about what people value in AI responses. Model labs care about those votes because benchmarks alone often miss the messy cases where users judge tone, reasoning, code quality, visual skill, or task completion. Arena’s commercial move was to package that public testing engine into AI Evaluations, a service that gives customers deeper analytics from the same community feedback loop. The business works because model companies badly need high-quality human preference signals after training, since small ranking gains can decide which model wins users, enterprise contracts, and investor attention. --- techcrunch. com/2026/06/29/arena-the-ai-leaderboard-everyone-uses-is-now-a-100m-business/

译Arena 的 AI 排行榜从 UC Berkeley 研究项目起步，通过让用户匿名对比两个模型答案并投票，积累了大规模人类偏好数据集。该平台随后将这一公开测试引擎包装为商业服务 AI Evaluations，为客户提供更深入的分析。模型厂商迫切需要高质量的人类偏好信号，因为微小的排名提升就能决定用户选择、企业合同和投资者关注。如今 Arena 已成为年化收入 1 亿美元的业务。

StepFun@StepFun_ai · 3天前41

Step 3.7 Flash hits #2 on Claw-Eval General for autonomous agents. We’re seeing strong performance across multi-step execution and robustness in long-horizon tasks, ranking just behind Claude Opus 4.6. Promising signals for real-world agent workloads.

译Step 3.7 Flash 在 Claw-Eval General 自主智能体评测中排名第二。我们在多步执行和长周期任务鲁棒性方面表现强劲，排名仅次于 Claude Opus 4.6。这是面向真实世界智能体工作负载的有前景的信号。

elvis@omarsar0 · 3天前56

LLM-as-a-Judge explained in ~10 mins. Knowing how to build AI verifiers and judges is one of the most important emerging AI skills today. Here is a quick intro on the topic and where to learn how to apply LLM-as-a-Judge.

译LLM-as-a-Judge 在约10分钟内解释完毕。学会构建AI验证器和裁判是当今最重要的新兴AI技能之一。这里提供一个快速介绍，以及在哪里学习如何应用LLM-as-a-Judge。

Chubby♨️@kimmonismus · 3天前50

.@emollick has used data from Artificial Analysis to show how the development of intelligence compares with open source. Two interesting points: -The development is still unquestionably exponential. There is no doubt about that. AI is not just getting better, it is getting better faster and faster. This is a truth that all of us probably already notice when using AI. -The gap to open source remains fairly constant. Chinese models in particular are still about half a year behind closed source. But that also means that Mythos-class models as open-source variants are genuinely realistic toward the end of the year. A very interesting graph.

译基于Artificial Analysis的AA-Briefcase评分（模拟多周复杂咨询任务），@emollick 绘制前沿曲线发现：闭源AI模型发展呈指数级增长且加速，开源模型（尤其中国）仍落后约半年。但乐观预测，年底前可能出现“神话级”开源变体。

karminski-牙医@karminski3 · 3天前61

给大家带来 Flash 系列模型横评! 各个厂商除了旗舰级别模型, 也都有Flash级别的模型, 而这些模型的定位主要都是多智能体系统的驱动模型和RAG系统的驱动模型. 那么现有这些Flash模型应该怎么选? 给大家带来本篇评测! 本次主要从 Agent Loop 迭代能力, Agent 能力, 前端, 后端, 空间理解, 美学, 性价比等多个角度评测了 Gemini-3.5-Flash, Step-3.7-Flash, DeepSeek-V4-Flash 这三个模型. 从测试来看, Gemini-3.5-Flash 更适合干"漂亮活", 比如前端页面, 建模等. 而 Step-3.7-Flash 则极具性价比, 在Agent测试中取得了比旗舰模型还要高的Token效率(用最少的token干最多的事情). 所以特别适合用在Agent框架中(比如OpenClaw或者Hermes), 或者复杂的Agent系统中用来做驱动模型. DeepSeek-V4-Flash 则后端能力很不错, 很适合用来写脚本, 甚至给服务器安装一个 DeepSeek-V4-Flash 驱动的 ClaudeCode, 用来 AI-Ops. #flash模型 #step37flash #deepseekv4flash #gemini35flash #AgentLoop

译推文对三款Flash级模型（Gemini-3.5-Flash、Step-3.7-Flash、DeepSeek-V4-Flash）进行横评。这些模型定位为多智能体系统和RAG系统的驱动模型。评测维度包括Agent Loop迭代能力、Agent能力、前端/后端、空间理解、美学、性价比等。Gemini-3.5-Flash更适合前端页面、建模等“漂亮活”。Step-3.7-Flash极具性价比，在Agent测试中Token效率极高（用最少Token完成最多任务），适合作为OpenClaw、Hermes等Agent框架的驱动模型。DeepSeek-V4-Flash后端能力出色，适合写脚本或驱动ClaudeCode用于AI-Ops。

Ethan Mollick@emollick · 3天前54

Even though I made this graph, it is also kind of wrong. Fable is guardrailed Mythos. If we use the Mythos date

译根据@ArtificialAnlys的AA-Briefcase评估（让AI执行多周咨询任务），@emollick绘制了开放与封闭模型的前沿曲线，显示令人惊讶的快速进步，且开放权重模型与封闭模型之间存在明显差距。

Ethan Mollick@emollick · 3天前70

I took the new AA-Briefcase scores from @ArtificialAnlys (basically having the AI do multi-week consulting gigs with a lot of complexity) and graphed the frontier curve for open and closed models: 1) Surprise, rapid gains! 2) The open weights gap is clear https://artificialanalysis.ai/evaluations/aa-briefcase

译我采用了 @ArtificialAnlys 最新的 AA-Briefcase 评分（基本上是让 AI 完成为期数周、复杂度高的咨询任务），并绘制了开放与封闭模型的前沿曲线： 1) 令人意外的是，进展迅速！ 2) 开放权重差距清晰可见。

Rohan Paul@rohanpaul_ai · 4天前44

This paper asks whether AI agents have a real memory system yet, and finds the answer is mostly no. The problem is that AI agents now need memory that can store, search, update, and clean up information across long tasks. The authors say current tests mostly check final answers, so they miss whether the memory system itself is fast, reliable, or good at handling changed facts. They split agent memory into 4 parts: how memories are stored, how facts are extracted, how useful memories are found, and how old or conflicting memories are maintained. They tested 12 memory systems across 5 workloads and 11 datasets, including long conversations, multi-session recall, database tasks, and update-heavy settings. The main result is that no memory design wins everywhere, because graph memories help with linked facts, hybrid systems help with filtered search, and raw traces help when exact action history matters. ---- Link – arxiv. org/abs/2606.24775 Title: "Are They Ready For An Agent-Native Memory System?"

译一篇新论文指出AI智能体目前缺乏真正的记忆系统。现有测试只检查最终答案，忽略了记忆系统本身的性能。论文将智能体记忆拆分为存储、事实提取、有用记忆检索、旧/冲突记忆维护四部分，在12个记忆系统、5个工作负载、11个数据集上评测。核心发现：没有一种记忆设计能在所有场景胜出——图记忆擅长关联事实，混合系统善于过滤搜索，原始痕迹则在精确动作历史记录中表现最佳。

Rohan Paul@rohanpaul_ai · 4天前65

This paper shows that LLM agents still struggle to plan through big, messy tool libraries. The paper builds a retail benchmark PlanBench-XL, to test whether LLM agents can solve long tool-use tasks when tools are hard to find. With 327 tasks and 1,665 tools, where agents must uncover hidden intermediate facts before they can answer. Even strong models struggle, with GPT-5.4 getting 51.90% accuracy normally and dropping to 11.36% in the hardest blocked setting. The problem is that real agents often face huge tool libraries, so they cannot see every tool at once and must search for useful ones while solving the task. The core idea is to make agents plan both forward from what they know and backward from what they need, instead of giving them a clear tool path. The authors also add broken or misleading tools, so agents must notice when a promising path fails and then find another path. ---- Link – arxiv. org/abs/2606.22388 Title: "PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems"

译论文提出PlanBench-XL基准，包含327个任务和1,665个工具，测试LLM智能体在工具难以发现时完成长程工具使用任务的能力。GPT-5.4常规准确率为51.90%，最困难的blocked设置降至11.36%。核心思路是让智能体同时从已知向前推理和从需求向后推理，而非依赖显式工具路径。论文还加入破损或误导性工具，考验智能体在路径失败时自主切换策略。

fofr@fofrAI · 4天前20

Gemini 3.5 Flash is a great workhorse model, especially for subagents. Determined, fast, gets jobs done.

译Gemini 3.5 Flash 是一个很棒的工作马模型，尤其适合子智能体。它坚定、快速，能完成任务。

DogeDesigner@cb_doge · 4天前59

BREAKING: Elon Musk confirms Grok 4.5 is now in private beta at SpaceX and Tesla. • Early evals show performance close to, possibly exceeding Opus • Based on xAI’s 1.5T V9 foundation model • Trained with Cursor data added • Grok Build harness is getting better every day • New models trained from scratch will be released every month this year The pace at SpaceXAI is absolutely insane.

译BREAKING: Elon Musk 确认 Grok 4.5 现已在 SpaceX 和 Tesla 进入私有 beta。 • 早期评估显示性能接近，甚至可能超过 Opus • 基于 xAI 的 1.5T V9 基础模型 • 训练中加入了 Cursor 数据 • Grok Build 工具每天都在改进 • 今年每月将发布从头训练的新模型 SpaceXAI 的节奏简直疯狂。

Ethan Mollick@emollick · 4天前60

Nice example of the increasing benefits of open science and transparent methodologies when writing papers about AI.

译针对AI研究论文因同行评审周期长导致结果过时的问题，一篇医疗AI论文开源其评估框架（GitHub: health-ai-readiness-eval）。@yishan 用该框架在最新模型上复现测试：GPT-5.5 Pro 在放射影像解读中得分79/100，优于论文原始最佳模型（69/100），但未达到论文设定的“适合可靠医疗使用”标准（需抗扰动、识别信息不足、给出临床合理推理）。@yishan 未能完整复现定性评估，但基本测试表明最新模型虽有提升，尚不足以可靠用于临床。他呼吁所有AI论文开源实验框架，以便社区持续验证。