words are very lossy pointers to complex concepts in our brains explaining these concepts to ai become incrementally harder as models become smarter and can do more things

译词语是我们大脑中复杂概念的有损指针随着模型变得更聪明、能做更多事情，向AI解释这些概念变得更加困难。

Artificial Analysis@ArtificialAnlys · 6月18日51

A standout number in Z ai’s GLM-5.2 launch is CritPt, a benchmark of unpublished research-level physics problems where it ties with Claude Opus 4.8 and is well above other open weights models Key takeaways: ➤ @Zai_org ’s GLM-5.2 (max reasoning effort) leads open weights by a wide margin: the next open model, DeepSeek V4 Pro, scores 12.9% ➤ GLM-5.2 matches Claude Opus 4.8 (20.9%) and beats several proprietary models, including GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.7 ➤ Only proprietary models score higher with GPT-5.5 Pro topping the benchmark at 30.6% ➤ A 4.5× generational jump: GLM-5.1 scored just 4.6% on CritPt ten weeks ago

译智谱发布 GLM-5.2（最大推理努力），在 CritPt 基准（未发表研究级物理问题）上得分 20.9%，与 Claude Opus 4.8 持平，远超其他开放权重模型。DeepSeek V4 Pro 仅得 12.9%；GLM-5.2 同时超越 GPT-5.5、Gemini 3.1 Pro 和 Claude Opus 4.7 等专有模型。仅 GPT-5.5 Pro 以 30.6% 领先。相比十周前 GLM-5.1 的 4.6%，实现 4.5 倍代际提升。

Rohan Paul@rohanpaul_ai · 6月18日51

Quite a massive inferencing rack breakthrough from @TensordyneInc . They just announced an AI-inference rack, claiming 13x the rack throughput of NVIDIA’s NVL72 GB300 in a DeepSeek-R1 comparison based on internal simulations. What makes this a big deal is that Tensordyne is attacking inference at the math level. AI chips spend huge amounts of energy moving and multiplying numbers. Napier (its AI inference racks) works in log space, where multiplication becomes addition, and addition is cheaper to build, switch, cool, and repeat billions of times per token. So instead of spending tons of transistor budget on heavy multiply circuits, Napier tries to shrink the math itself. So that means less chip area for compute and more for SRAM, resulting in less power per token and way more inference packed into the same rack. If they have made log math accurate and fast enough for real inference, then Napier is not just pushing more power into a rack, it is changing the cost of the basic operation behind model serving. AI inference is no longer just a FLOPS race. It is a rack-level fight over power, memory locality, interconnect latency, and how many paying tokens can be served before the economics break. They reported their TDN Rack reaches 363,000 tokens per second on DeepSeek-R1 at user speeds of 210 tokens per second per internal simulation, compared with 27,400 tokens per second for Nvidia’s NVL72 GB300. 🧵 1.

译TensorDyne 发布 AI 推理机架 Napier，声称在 DeepSeek-R1 上基于内部模拟达到 363,000 tokens/s（用户速度 210 tokens/s），是 NVIDIA NVL72 GB300（27,400 tokens/s）的 13 倍。Napier 在对数空间中运算，将乘法转为加法，从而降低芯片面积与功耗，更多晶体管用于 SRAM，每 token 能耗更低、推理密度更高。此举改变 AI 推理经济学，不再单纯比拼 FLOPS，而是转向功率、内存局部性、互连延迟与 token 服务成本。

Emad@EMostaque · 6月17日44

I think it's increasingly clear that if Chinese AI labs can get enough compute they'll mog American ones.

译我认为越来越清楚的是，如果中国AI实验室能获得足够的计算能力，它们将击败美国的实验室。

karminski-牙医@karminski3 · 6月17日73

GLM-5.2 刚刚正式发布! 给大家带来实测! 直接说结论本次测试中, 提升最大的是Agent能力, 而且是有质的变化! 测试中GLM-5.2 完全不用搜索附近的位置, 就能直接去想要到达的地方. 这一切竟然是它在一开始把地图背下来了! 这在我测试的20多个模型中之前是没有一个模型能做到的, 比如之前的模型想去换电站, 那么都要搜一下附近有哪些换电站(这就会浪费一次tool_call), 而GLM-5.2直接就知道换电站的位置! 从来没用过搜索函数. 这种一开始就把需要的数据内化到上下文中, 并且能够贯穿整个1M上下文进行推理的能力真的是叹为观止. 除此之外, 本次测试后端代码的 Agentic Coding 能力也有提升, 来到了总榜的第二名. 而本次测试暴露出最大的短板则是空间理解. 其实成也萧何败也萧何, 它虽然把换电站的位置都背下来了, 但是去的换电站却不是最近的, 所以虽然记住了, 但是记住了之后在用之前再根据自己当前所在位置推理一下, 他还是没有做到的, 这也是最大的短板了, 强烈建议官方优化一波. #GLM52 #智谱 #智谱AI #AgenticCoding #长上下文能力

译GLM-5.2 正式发布，实测显示其 Agent 能力有质的变化。该模型能将地图数据内化到 1M 上下文中，直接知道换电站位置，全程未调用搜索函数，在测试的 20 多个模型中唯一能做到。后端 Agentic Coding 能力提升至总榜第二名。短板是空间理解：虽记住换电站位置，但无法根据当前位置推理最近站点。

🚨 AI News | TestingCatalog@testingcatalog · 6月17日80

ZAI 🔥: GLM-5.2 by @Zai_org scored 51 point on Artificial Analysis Intelligence Index and got placed on the 4th spot! This made GLM-5.2 a new SOTA open-weight model. Besides that, GLM-5.2 got ranked second on Frontend Code Arena, after currently unavailable Claude Fable 5. Should be ZOTA! 👀

译Z ai 推出 GLM-5.2，在 Artificial Analysis Intelligence Index 上得 51 分排名第四，成为开源权重 SOTA。模型规模同 GLM-5.1（744B 总/40B 活跃参数），智能指数 v4.1 提升 11 分。科学推理显著增强：CritPt +16% 至 21%，HLE +12% 至 40%，GPQA Diamond +3% 至 89%。上下文窗口升至 1M tokens。API 定价 $1.4/$4.4/$0.26 每 1M 输入/输出/缓存命中 token，每任务成本约 $0.46，处智能 vs 成本帕累托前沿。MIT 许可证，已上线 DeepInfra 等第三方平台。

Artificial Analysis@ArtificialAnlys · 6月17日61

Z ai’s GLM-5.2 is the new leading open weights model on the Artificial Analysis Intelligence Index scoring 51 and it sits on the Pareto frontier of Intelligence vs Cost per Task @Zai_org’s GLM-5.2 is the same size as GLM-5.1 (744B total / 40B active parameters) but scores 11 points higher on the Intelligence Index v4.1, placing ahead of MiniMax-M3 (44) and DeepSeek V4 Pro (max, 44). On the first-party API it is priced in line with GLM-5.1 at $1.4/$4.4/$0.26 per 1M input/output/cache hit tokens Key results: ➤ GLM-5.2 is the leading open weights model on the Intelligence Index v4.1. At 51, it leads MiniMax-M3 (44), DeepSeek V4 Pro (max, 44) and Kimi K2.6 (43) ➤ Improvements across most evaluations, particularly scientific reasoning: GLM-5.2 gains over GLM-5.1 on most evaluations, led by scientific reasoning on CritPt (+16 points to 21%) and HLE (+12 points to 40%), alongside AA-LCR (+9 points to 71%), tau3 banking (+15 points to 27%) and SciCode (+7 points to 50%). TerminalBench v2.1 also improves (+16 points to 78%) and GPQA Diamond gains 3 points to 89% ➤ Leading open weights model on GDPval-AA v2 and competitive with proprietary models: GLM-5.2 scores 1524 on GDPval-AA v2, ahead of MiniMax-M3 (1418) and DeepSeek V4 Pro (max, 1328). This impressive result places GLM-5.2 in-line with proprietary models including GPT-5.5 (xhigh reasoning). GDPval-AA v2 builds on the original GDPval-AA by baselining Elo to human performance at 1000, introducing a rotating panel of frontier-model judges, and raising the turn limit from 100 to 250 for longer-horizon agent trajectories ➤ GLM-5.2 uses more output tokens per task than other leading open weights models: the model uses 43k output tokens per Intelligence Index task, up from GLM-5.1 (26k) and above MiniMax-M3 (24k), Kimi K2.6 (35k) and DeepSeek V4 Pro (max, 37k) ➤ On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05) Additional Model Details: ➤ License: MIT ➤ Size: 744B total parameters, 40B active parameters, equivalent to GLM-5.1 ➤ Context window: 1M tokens, up from 200K on GLM-5.1 ➤ Pricing: $1.4/$0.26/$4.4 per 1M input/cache hit/output tokens ➤ Availability: Alongside Z ai's first-party API, GLM-5.2 is available across third-party providers including @DeepInfra, @novita_labs, @nebiusai, @parasailnetwork , @SiliconFlowAI , @gmi_cloud , @Baseten and @FireworksAI_HQ

译Z ai 发布 GLM-5.2（744B 总参数/40B 活跃参数），在 Artificial Analysis Intelligence Index v4.1 上得分 51，超越 MiniMax-M3、DeepSeek V4 Pro 和 Kimi K2.6。科学推理大幅提升：CritPt +16、HLE +12、GPQA Diamond 达 89%。GDPval-AA v2 得分 1524，与 GPT-5.5 (xhigh reasoning) 相当。上下文窗口扩展至 1M tokens，MIT 许可证。第一方 API 定价 $1.4/$4.4/$0.26 每百万输入/输出/缓存命中 token，每任务成本约 $0.46，处于智能 vs 成本帕累托前沿。

SemiAnalysis@SemiAnalysis_ · 6月17日65

POV: @ohnePixel getting a platform for day 0 DeepSeek V4 deployment Find out more at: https://semianalysis.substack.com/p/deepseekv4-16t-day-0-to-day-43-performance

译POV: @ohnePixel 为 DeepSeek V4 首日部署获得一个平台。了解更多：https://semianalysis.substack.com/p/deepseekv4-16t-day-0-to-day-43-performance

歸藏(guizang.ai)@op7418 · 6月17日72

智谱 GLM-5.2 可以在 Codepilot 模型管理里面自行添加哈

译智谱 GLM-5.2 正式发布并开源，定位处理长周期任务。模型具备稳定的100万上下文窗口，并引入思考力度控制。架构上采用 IndexShare 机制，每四层稀疏注意力共享同一个 indexer，在百万 token 上下文中将每 token 计算量降低约 2.9 倍。用户现可在 Codepilot 模型管理中添加使用 GLM-5.2。

karminski-牙医@karminski3 · 6月17日67

GLM-5.2正式发布啦！一会给大家带来评测视频~

译智谱（Z.ai）发布GLM-5.2模型，编程与智能体任务显著改进，支持1M上下文窗口。提供两种推理模式：GLM-5.2（max）追求极限性能，GLM-5.2（high）平衡性能与token效率。模型权重以MIT许可开源，API定价与GLM-5.1保持一致。

歸藏(guizang.ai)@op7418 · 6月17日79

智谱 GLM-5.2 正式发布和开源了，基准测试成绩相当吓人核心定位是处理长周期任务，并且有稳定的 100 万上下文，模型还引入了思考力度控制。架构层面，GLM-5.2 提出了 IndexShare 机制，每四层稀疏注意力共享同一个 indexer，从而在百万 token 上下文下将每 token 的计算量降低约 2.9 倍。

译智谱发布并开源 GLM-5.2，定位长周期任务，支持 100 万 token 稳定上下文。引入思考力度控制：GLM-5.2 max 追求极限性能，GLM-5.2 high 兼顾效率。架构采用 IndexShare 机制，每四层稀疏注意力共享 indexer，百万 token 下每 token 计算量降低约 2.9 倍。编码与智能体任务表现显著提升。模型权重以 MIT 许可证开源，API 定价与 GLM-5.1 一致。

Rohan Paul@rohanpaul_ai · 6月17日55

Tensordyne just announced a breakthrough Inference system. Logarithmic AI compute chips which is 17x more tokens per watt and 13x higher throughput than NVIDIA Blackwell. The main math advance they say they unlocked is efficient logarithmic math directly in hardware. In log space, multiplication turns into addition, which is much easier to build than multiplier circuits That allows smaller compute circuits on the chip than today’s FP8 and INT8 GPUs.With fewer transistors, the chips stay cooler and use less energy, while the extra die space can hold more tensor engines, additional high-bandwidth SRAM and HBM3e memory, plus a fast interconnect fabric. For DeepSeek-R1, Tensordyne claims 363K tokens/sec per rack versus 27.4K for Nvidia’s comparison system They have successfully completed tape-out of the Napier processor, which is now in production at TSMC on its 3nm process node.

译Tensordyne 发布突破性推理系统，采用对数 AI 计算芯片。相比 NVIDIA Blackwell，每瓦特 token 数提升 17 倍，吞吐量提升 13 倍。核心创新是在硬件中实现高效对数运算，将乘法转为加法，从而缩小计算电路、减少晶体管、降低功耗，释放芯片空间用于更多张量引擎、高带宽 SRAM 和 HBM3e 内存。针对 DeepSeek-R1，单机架可达 363K tokens/sec，对照系统仅 27.4K。Napier 处理器已完成流片，在台积电 3nm 制程生产。

NotebookLM@NotebookLM · 6月17日57

Our more powerful NotebookLM experience is now 100% rolled out to Google AI Ultra subscribers globally. We're so excited to see what you make. Share your charts, images, spreadsheets, and raw unfiltered thoughts with us below! 🥰

译NotebookLM 更强大的对话体验已100%向全球 Google AI Ultra 订阅用户推出。升级版由 Gemini 3.5 和 Antigravity 驱动，改进了聊天界面，用户可更清晰地查看 AI 思考过程。每个笔记本附带一个安全云端计算机，包含100+个精选软件技能，支持更深度的研究和复杂分析。

elvis@omarsar0 · 6月17日70

No time wasting on the frontier of open-weight models. GLM-5.2 looks impressive based on the results I've seen. Very curious to see how it holds on long-horizon tasks.

译Z.AI 发布 GLM-5.2，采用 MIT 许可证开源权重。模型在编码与智能体任务上显著提升，支持 1M 上下文窗口，具备长时能力。提供两种推理力度：GLM-5.2 (max) 与 GLM-5.2 (high)，后者平衡性能与 token 效率。API 定价与 GLM-5.1 相同。DAIR.AI 的 Elvis Saravia 评价其在前沿开放权重模型中表现令人印象深刻，并关注其长时任务表现。

MiniMax (official)@MiniMax_AI · 6月17日25

happy world cup everyone ⚽️ FWC-Bench when?

译MiniMax 的 M3 模型在卡塔尔 vs 瑞士的世界杯比赛中正确预测平局，成为五个模型和一位人类预测中唯一正确的选择。Kilo CLI 分析显示，该基准刻意排除博彩赔率，因此瑞士 64% 的市场赔率未被纳入。M3 依据双方相同的 WWDLW 记录、卡塔尔更高的原始评分以及瑞士更强的联赛水平做出判断。主推文同时提问“FWC-Bench when?”，暗示可能推出新基准测试。

Rohan Paul@rohanpaul_ai · 6月17日72

This paper shows a strange weakness in AI reasoning: models can solve math, yet fail to judge reasoning. The unsettling part is not that frontier models make arithmetic mistakes. It is that they can reach the right answer, see the right answer in someone else’s solution, and then forgive broken logic that should have been easy to catch. The authors call this the production-evaluation gap: the gap between generating a solution and evaluating whether a given solution actually earns its conclusion. Their Valid-Answer-Invalid-Reasoning (VAIR) benchmark makes the trap clean. The final answer is correct, but the reasoning is damaged by missing steps, shuffled steps, missing premises, or circular explanation. A careful evaluator should say, “Yes, the answer is right, but the argument does not justify it.” Many reasoning models instead appear to do something lazier and more dangerous: they solve the problem themselves, confirm the final answer, and then rationalize the path as acceptable. That is not reasoning vigilance. It is answer confirmation bias wearing the costume of mathematical judgment. The mechanism matters because modern AI training often rewards outcomes more than valid intermediate thought. A model trained to get the answer may learn to treat the answer as the evidence, especially when grading another chain of reasoning. Humans were not perfect here, but the contrast is revealing: people showed only a small drop from solving to grading, while models collapsed much more sharply on the same kind of task. This is where the result becomes larger than math. If AI systems can mass-produce plausible arguments but cannot reliably police the logic inside them, they become engines of confidence rather than engines of understanding. ---- Link – arxiv. org/abs/2606.01462 Title: "An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models"

译一篇新论文揭示了大型推理模型的“生产-评估差距”：模型能解出数学题并得到正确答案，但在评估他人推理时，即便逻辑有缺失步骤、前提颠倒或循环论证等明显缺陷，只要最终答案正确，模型也往往判定为合格。作者提出VAIR（有效答案-无效推理）基准验证该问题。这种现象称为“答案确认偏差”，模型仅凭正确答案而非有效逻辑评判推理。与人类相比，模型从解题到评估的能力下降更显著，表明AI可能成为制造看似合理论点的自信引擎，而非真正理解自身产出的推理引擎。

Chubby♨️@kimmonismus · 6月17日83

Lets go, GLM-5.2 released as Open Weights model. tl;dr -1M context window -MIT-licensed open weights -Stronger long-horizon coding agents -Two reasoning modes: max and high -Same API pricing as GLM-5.1 Zai says GLM-5.2 was trained specifically for large-scale implementation, automated research, performance optimization, and complex debugging. Open Source got a serious upgrade today!

译GLM-5.2 作为开放权重模型发布，采用 MIT 许可，拥有 1M 上下文窗口。提供两种推理模式：max（极限推理）和 high（平衡性能与 token 效率）。在编码和智能体任务上有显著提升，专为大规模实现、自动化研究、性能优化和复杂调试训练。API 定价与 GLM-5.1 保持一致。

🚨 AI News | TestingCatalog@testingcatalog · 6月17日77

ZAI 🔥: GLM-5.2 is now available on huggingface! > It comes with a 1M context window and 2 levels of reasoning effort, max and high. MIT license and same pricing as GLM-5.1. > GLM-5.2 scores 46.2% on DeepSWE, the SOTA score among open-weight models.

译ZAI 在 Hugging Face 上发布 GLM-5.2，采用 MIT 开源许可，API 定价与 GLM-5.1 相同。模型支持 1M 上下文窗口，提供两种推理努力级别：max（极致性能）和 high（平衡性能与 token 效率）。在编程和 AI 智能体任务上有显著提升，具备长程任务能力。DeepSWE 基准得分 46.2%，创下开源权重模型的 SOTA 纪录。

OpenRouter@OpenRouter · 6月17日53

GLM-5.2 from @Zai_org is live on OpenRouter! http://Z.ai's flagship for long-horizon tasks, now with a 1M-token context window capable of being reliable across long, messy coding-agent work.

译来自 @Zai_org 的 GLM-5.2 已在 OpenRouter 上线！ Z.ai 的旗舰模型，专为长期任务设计，现在拥有 1M token 上下文窗口，能够在冗长杂乱的编码智能体工作中保持可靠。

StepFun@StepFun_ai · 6月17日51

Excited to see Step 3.7 Flash live via @novita_labs on @OpenRouter. Built for high-efficiency agent workloads, Step 3.7 Flash combines native multimodal understanding, strong agentic coding capabilities, reliable tool use, and web & visual search workflows for production AI agents. Thanks to the Novita team for helping expand the StepFun ecosystem.

译阶跃星辰的 Step 3.7 Flash 已通过 Novita 在 OpenRouter 上线。该模型专为高效智能体工作负载设计，具备原生多模态理解、强智能体编码能力、可靠工具使用，以及网页与视觉搜索工作流。引用信息强调其高效多模态推理和多步工具使用能力，主要面向编码与智能体应用场景。

Ant Ling@AntLingAGI · 6月16日77

Ling & Ring 2.6 technical report is out, with two open-weight base models. We co-design model + system across architecture, training, and agentic capability: • 7:1 hybrid linear attention • KPop for stable agentic RL: SWE-bench Verified 76.28% • ~4× token efficiency

译Ling & Ring 2.6 技术报告发布，带来两款开放权重基座模型。我们通过架构、训练和智能体能力的协同设计，共同优化模型与系统： • 7:1 混合线性注意力 • 用于稳定智能体强化学习的 KPop：SWE-bench Verified 76.28% • 约 4 倍 token 效率

François Chollet@fchollet · 6月16日36

The way we will create a future where powerful AI is open-source and available to all is by making AI radically more efficient, both in terms of inference compute and (more importantly) in terms of training data requirements. This is what symbolic learning will achieve.

译我们将创造强大AI开源且人人可用的未来的方法，是让AI在推理计算和（更重要的）训练数据需求方面大幅提高效率。这正是符号学习将实现的目标。

Artificial Analysis@ArtificialAnlys · 6月16日60

Announcing Artificial Analysis Intelligence Index v4.1: a shift toward agentic workloads, featuring upgraded benchmarks and new per-task metrics The Artificial Analysis Intelligence Index is our synthesis metric for assessing model intelligence and tracking AI progress. v4.1 marks a broader shift toward agentic workloads, with three main changes: Updated and reweighted evaluations toward agentic tasks: 1. We upgraded three evaluations, removed one, and reweighted the Intelligence Index: ➤ Upgraded Terminal-Bench Hard to Terminal-Bench 2.1 and τ²-Bench Telecom to τ³-Bench Banking. Both move to newer, more robust task sets with harder, more realistic agentic scenarios that better separate frontier models ➤ Upgraded GDPval-AA to GDPval-AA v2. The upgrade re-baselines Elo to human performance at 1000, introduces a rotating panel of frontier-model judges, and raises the turn limit from 100 to 250 for longer-horizon agent trajectories ➤ Removed IFBench due to saturation. The benchmark no longer distinguishes frontier models sufficiently, so we have removed it from the Intelligence Index. We will continue to run it and publish results on new model releases 2. Cost per Task, Time per Task, and Tokens per Task: Three new per-task metrics, reported for every model and based on the Intelligence Index. We take the total cost, total time, and total output tokens for a model to run the Intelligence Index and divide by the number of tasks across its evaluations, giving the average cost, time, and output tokens to complete a single Intelligence Index task 3. Cached input token reporting: We now report cached input tokens and their impact on cost, including the cost to run the Intelligence Index, to better reflect the real cost of running each model Key Results: ➤ Leading models: Claude Fable 5 (with Opus 4.8 fallback, 60) leads the Artificial Analysis Intelligence Index v4.1 by four points but is currently unavailable, leaving Claude Opus 4.8 (max, 56) as the most intelligent available model, ahead of GPT-5.5 (xhigh, 55) ➤ Open weights leading models: Among open weights models, DeepSeek V4 Pro (max, 44) and MiniMax M3 (44) lead, followed by Kimi K2.6 (43) and MiMo-V2.5-Pro (42) ➤Cost per Task: Claude Opus 4.8 (max) is the most expensive available model at $1.78 per task, with Claude Fable 5 the highest overall at $3.25. GPT-5.5 (xhigh) scores within a point of Opus 4.8 on the Intelligence Index at $0.99 per task. DeepSeek V4 Pro (max) stands out on the Intelligence vs Cost per Task chart at $0.04 per task, with other leading proprietary models costing 20x to 45x more ➤Time per Task: time per task (inference decode time) ranges from 1.5 minutes for Grok 4.3 (high) to 13.5 for Claude Sonnet 4.6 (max), a roughly 9x spread. Claude Opus 4.8 (max) completes a task in 6.4 minutes and GPT-5.5 (xhigh) in 3.7, while Gemini 3.1 Pro Preview stands out on the Intelligence vs Time per Task chart at 1.6 minutes for a score of 46

译Artificial Analysis 发布 Intelligence Index v4.1，转向智能体任务。升级 Terminal-Bench 2.1、τ³-Bench Banking、GDPval-AA v2（Elo 重基线、引入前沿模型评审、回合上限增至250），移除饱和的 IFBench。新增每任务成本、时间、输出 token 指标及缓存 token 影响。关键结果：Claude Fable 5（60分）领先但不可用；可用模型中 Claude Opus 4.8（max）56分居首，GPT-5.5（xhigh）55分。开源 DeepSeek V4 Pro 与 MiniMax M3 均44分。成本方面，Opus 4.8 每任务 $1.78，GPT-5.5 $0.99，DeepSeek V4 Pro 仅 $0.04。时间方面，Grok 4.3 最快（1.5分钟），Opus 4.8 需6.4分钟，GPT-5.5 需3.7分钟，Gemini 3.1 Pro Preview 以1.6分钟得46分。

Ethan Mollick@emollick · 6月16日46

If AGI is achievable & labs can be banned from using a model internally ONLY if they release the model publicly, the Big Three labs may decide it is better to capture all the value from AGI themselves by expansion & acquisition. Sharing AI access with other firms triggers risk.

译如果AGI可实现，并且只有在实验室公开发布模型的情况下才能禁止其内部使用，那么三大实验室可能会决定通过扩张和收购来自己获取AGI的所有价值。与其他公司共享AI访问会引发风险。

Epoch AI@EpochAIResearch · 6月16日47

Claude Fable 5 achieves a new high score of 161 on the Epoch Capabilities Index! This beats out GPT-5.5 Pro by 1 point, and is the first time Anthropic has taken the lead on the ECI in over a year.

译Claude Fable 5 在 Epoch Capabilities Index 上取得新高分161！这以1分优势击败了GPT-5.5 Pro，也是Anthropic一年多来首次在该指数上领先。

MiniMax (official)@MiniMax_AI · 6月16日38

Nice demo from @atomic_chat_hq: M3 Q4 ran locally with MLX-VLM, and completed a US customs form entirely on a Mac Studio M3 Ultra.

译MiniMax 官方展示开源模型 M3 Q4（4-bit 量化版）本地运行能力：使用 MLX-VLM 部署在 Mac Studio M3 Ultra 上，模型读取驾照照片和扫描文档后，自动完成一份美国海关申报表。处理耗时约 31 秒，输入 1,847 tokens，输出 736 tokens。过程中模型流式输出推理链，并调用 write_field、mark、sign 三个工具，无需人工干预。

Rohan Paul@rohanpaul_ai · 6月16日58

Pythagoras-Prover just made Lean theorem proving look far less dependent on giant models, with a 4B prover beating DeepSeek-Prover-V2-671B at MiniF2F Pass@32. Shows in formal reasoning, better data geometry can buy back an astonishing amount of scale. A theorem prover is not just a language model writing clever math; it is a machine trying to produce text that survives a compiler with no patience for style, confidence, or almost-right reasoning. The main trick is data efficiency: the team built about 800K Lean-verified examples, trained from easy to hard, then used LoRA so the model learned without updating every parameter.

译Pythagoras-Prover 团队发布最小定理证明器 4B 版本及首个扩散模型概念验证版，均仅 4B 参数。在 MiniF2F 测试中，4B 模型以 86.1% Pass@32 超越 DeepSeek-Prover-V2-671B；32B 版本达 89.8% Pass@32 和 92.6% Pass@2024，创当前最佳成绩。核心在于数据效率：构造约 80 万 Lean 验证示例，按易到难训练，并采用 LoRA 微调避免全参数更新。模型上下文窗口为 8192 tokens。模型、数据及训练流水线将陆续开源。

Nathan Lambert@natolambert · 6月16日56

I launched 3 more videos in my post-training course! 1. Lecture 5: The rise of reasoning models 2. Lecture 6: DPO derivation, intuitions, and practice 3. A Q&A from readers on lectures 1-4 rlhfbook dot com slash course More soon!

译我发布了后训练课程中的另外3个视频！ 1. 第5讲：推理模型的崛起 2. 第6讲：DPO推导、直觉与实践 3. 读者关于第1-4讲的问答 rlhfbook dot com slash course 更多即将到来！

Rohan Paul@rohanpaul_ai · 6月16日43

Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and 7.6X faster decoding on H800 GPUs. While mostly matching the full version’s benchmark performance. This can happen when attention stops treating every token as equally worth revisiting. The trick is not to abandon softmax attention, but to make it selective before it becomes expensive. MSA adds a small routing branch beside ordinary Grouped Query Attention, letting each query group choose the key-value blocks it should inspect while the main branch performs exact attention only inside that chosen set. The model is no longer paying to compare every new thought with the entire past, only with the parts its learned indexer predicts are worth comparing. Long context is not a memory feature by itself; it is a retrieval problem under brutal latency constraints, where the model must decide what deserves bandwidth at the moment of use. MiniMax Sparse Attention is compelling because it moves that decision into the architecture, trains the selector against the model’s own attention patterns. ---- Link – arxiv. org/abs/2606.13392 Title: "MiniMax Sparse Attention"

译MiniMax Sparse Attention（MSA）在1M token时，将注意力计算量削减28.4倍，H800 GPU上预填充提速14.2倍、解码提速7.6倍，同时基准性能基本持平全量版本。MSA不放弃softmax注意力，而是在分组查询注意力旁增设一个小型路由分支，让每个查询组自主选择应查看的key-value块，主分支仅对该子集执行精确注意力。该方法将长上下文视为延迟约束下的检索问题，通过架构内建选择器，用模型自身注意力模式训练路由，使注意力变得有选择性而非穷举。

Nathan Lambert@natolambert · 6月15日54

This isn't very true. A big part of the problem is that the labs use the term distillation, which is a general post-training technique, in lieu of a specific issue of jailbreaking the API. (1) There is a second debate of *how* impactful distillation is, but it is definitely helpful. (2) This is entirely based on how the Chinese labs are jailbreaking the APIs to get reasoning traces out, which help bootstrap reasoning behaviors in new domains. There's a third point (3) which I take an excerpt from my recent piece, where the labs need to be more transparent why especially point (2) is true. From the third piece: " On the point of distillation, my hypothesis is that API builders don’t have an easy time preventing hacks or jailbreaking because it’s a deeply grounded property of reasoning models to want to output the reasoning traces, and it would make the model far less intelligent to fully patch the behavior. This is based on a few assumptions: a) Chinese labs are not just showing up as customers to Anthropic’s API and paying for tokens in the intended input-output form. If the Chinese labs are paying for intended use behaviors, despite being banned by the terms and conditions, I don’t have a lot of sympathy for the frontier labs manifesting policy actions against this. b) Reasoning traces are disproportionately effective at seeding behavior in downstream models. c) Leading labs work very hard to patch the pipeline of these jailbreaks. So, my logical conclusion is that the model companies would have to weaken their economic position to fully protect their IP. If this is the case, Anthropic would get a lot more sympathy from the AI research community by being transparent. It would also be far easier to have informed policy discussions, and not rely on me proposing Occam’s razor explanations for what the API jailbreaking looks like. " There's no need to misinform people because the labs use a bad term. The labs use this term partially to make the discourse confusing, as you're doing. (1) See https://www.interconnects.ai/p/the-distillation-panic (2) See: https://www.interconnects.ai/p/how-much-does-distillation-really (3) See: https://www.interconnects.ai/p/claude-fable-5-and-new-ai-safety

译Lambert 指出，美国实验室用“蒸馏”一词掩盖了 API 劫持问题。中国实验室通过破解 API 获取推理痕迹，帮助在新领域引导推理行为。他认为 API 提供者很难完全防止劫持，因为推理模型本身倾向于输出推理痕迹，完全修补会降低模型智能。他呼吁实验室更透明地说明这一过程，以便开展知情政策讨论。

Ethan Mollick@emollick · 6月15日53

Weird headline - I am not sure solving 7 out of 10 novel very hard problems meant AI "did not live up to the task," when 15 months ago LLMs couldn't do math. But the actual study is interesting and illuminates flaws & successes of AIs in math. https://1stproof.org/assets/docs/report.pdf

译奇怪的标题——我不确定解决10个极其困难的新问题中的7个就意味着AI“没有完成任务”，而15个月前大语言模型还不会做数学。但实际研究很有趣，揭示了AI在数学中的缺陷与成功。https://1stproof.org/assets/docs/report.pdf [引用 @Nature]：人工智能经历了其最严谨的数学测试，然而它并未完成任务 https://go.nature.com/4oqlNk6

Baidu Inc.@Baidu_Inc · 6月15日53

DuMate just got more efficient. With its latest core engine upgrade, driven by optimizations to the Harness engine and related engineering workflows, DuMate can now complete the same tasks with 75% lower token consumption, without compromising task performance. For users, that means 75% lower credit consumption too.

译DuMate 变得更高效了。凭借最新的核心引擎升级，通过对 Harness 引擎及相关工程工作流的优化，DuMate 现在能以降低 75% 的 token 消耗完成相同任务，且不影响任务性能。对用户而言，这也意味着积分消耗降低 75%。

Berryxia.AI@berryxia · 6月15日60

一个12B的本地模型，直接把Fable 5的推理链条蒸馏进去了，现在你能在消费级显卡上离线跑顶级coding能力。这个Gemma 4 12B Coder GGUF是基于Google的gemma-4-12B-it微调的，专门针对代码生成和复杂推理。训练数据里用了Composer 2.5的真实通过案例，还让Fable 5帮着补全那些难搞的case，结果就是每一步推理都导向能真正跑通的代码。最爽的是它走GGUF格式，12GB显卡就能顺畅跑，甚至CPU也能用。调试、补全代码、生成复杂算法、做链式思考提示，全都本地搞定，不用交API费、不用担心导出管制。以前大家觉得前沿模型要么云端用要么根本跑不了，现在开源社区直接把Fable 5的思考方式打包成能塞进你笔记本的版本。模型还在快速迭代，下载量已经破六千，社区反馈它在本地coding场景里特别能打。这波操作把“强大但受限”和“本地可用”之间的鸿沟给填上了。真正的AI生产力，从来不是等大厂放行，而是社区自己动手把能力解放出来。

译Berry Xia 介绍了基于 Google gemma-4-12B-it 微调的 Gemma 4 12B Coder GGUF 模型。它将 Fable 5 的推理链条蒸馏进 12B 参数模型，训练数据使用 Composer 2.5 真实通过案例并由 Fable 5 辅助补全。GGUF 格式让模型在 12GB 消费级显卡即可本地运行，甚至支持 CPU。模型专为代码生成、调试、复杂算法、链式思考提示等任务优化，无需 API 费用且无导出限制。该模型基于 Google 最新 gemma-4 架构，目前下载量已破六千，社区反馈其在本地 coding 场景表现出色，填补了云端模型与本地可用之间的鸿沟。

karminski-牙医@karminski3 · 6月15日53

27B小模型挑战Fable 5? 还成功了? 劲爆消息, 在 Iterative-Contextual-Refinements 这个框架的加持下, Qwen3.6-27B 跑分超过了 Anthropic Fable5! 真的不是做梦吗? 还是跑分没输过, 实战没赢过? 于是赶紧看了一下这个框架, 发现设计的很有启发性, 能学到很多东西, 给大家详细讲下. 这个框架主要提升的是软件性能优化, 即如何才能让代码性能更高. 大家如果还记得我那个 vector-db-bench, 给大模型提供了火焰图, perf, 各种测试 tool_call 让大模型自己迭代去优化代码性能. 而这个框架更进了一步, 它瞄准了小模型的最核心弱点, 参数量不足导致的"脑残", 即小模型更容易长上下文衰退或陷入局部最优. 于是这个框架出手了, 先针对技术方案, 它搞了个BFS探索模式, 在写代码的 plan 过程, 让小模型自己提出多种解决方案, 比如写个字符串匹配, 小模型直接搞了个O(N^2)的暴力搜索, 而这一步它的Agent会让小模型思考, 你能想到哪些可能的解决方案? 于是就拓展了小模型的视野, KMP, 滑动窗口等技术方案没准就出来了. 然后就是写代码的过程中使用的DFS模式, 它会借助Agent让小模型借助代码性能测试工具不断跑分, 然后让小模型反思, 有哪些性能热点可以优化, 然后进行优化. 最后, 他还有个统筹全局的路由, 不但负责在BFS/DFS过程中选取最佳的技术方案, 而且还会在DFS过程中, 总结模型优化过程中面临的问题, 再反馈到BFS过程, 告诉模型, 需要注意xxx优化是有价值的, xxx优化面临xxx问题. 从而形成优化闭环, 解决掉模型陷入死胡同不断仰卧起坐的问题. 最后, 在框架加持下, Qwen3.6-27B 在 CGRE 测试得到了95.5分, 成功超越了 Fable5(Mythos) 的94.1分! 我只能说这真的是 Agentic 工程的胜利了! 不要模型写的不好就无脑怪模型, 也要看看是不是Agent本身有问题. 那么代价是什么呢? 当然就AI硬通货是 token 了, 这个框架正是用了25-40x的token消耗完成了这一壮举. 值得学习. 框架:http://github.com/ryoiki-tokuiten/Iterative-Contextual-Refinements 论文:http://arxiv.org/abs/2605.15222 #mythos #fable5

译Iterative-Contextual-Refinements框架使Qwen3.6-27B在CGRE测试中获95.5分，超越Anthropic Fable5(Mythos)的94.1分。该框架通过BFS探索多种方案（如KMP、滑动窗口）、DFS结合性能工具迭代优化代码，以及路由统筹形成闭环，克服小模型易陷入局部最优的弱点。代价是token消耗增加25-40倍。框架与论文已开源。

Rohan Paul@rohanpaul_ai · 6月14日59

Researchers found our current approach to making AI smarter over time has a giant blind spot. AI is not actually understanding or applying high-level abstract lessons at all. Developers spend massive amounts of time building systems that condense past AI mistakes into neat little rules for the future. This paper proves that the AI essentially throws those rules in the trash and only looks at raw historical logs. Modern LLM systems try to get better over time by storing past tasks as either raw step-by-step histories or condensed summary rules. The study tested if these agents actually use their stored memories by secretly swapping the correct tips with random garbage text. - When the step-by-step histories were messed up, the AI failed hard, proving it heavily relies on copying exact past actions. - But when researchers completely corrupted the condensed summary rules, the AI kept acting normally and showed zero performance drop. If an AI cannot apply an abstract lesson to a new situation, it is not truly reasoning or learning. This raises the question if the entire AI industry need to rethink how memory works because right now these agents are just mimicking instead of understanding. ---- arxiv. org/abs/2601.22436 "LLM Agents Are Not Always Faithful Self-Evolvers"

译一项新研究发现，当前提升AI随时间表现的方法存在盲点：LLM智能体实际上并不理解或应用抽象规则总结，而是仅依赖直接复制原始逐步骤历史日志。实验显示，当研究者将浓缩的规则总结替换为随机垃圾文本时，智能体表现无下降；但破坏逐步执行历史则导致明显失败。这表明智能体只是在机械模仿过往步骤，而非真正从教训中学习。论文质疑需重新设计AI记忆机制，因为当前系统仅是模仿而非理解。

Chubby♨️@kimmonismus · 6月14日58

This is so cool: OpenRouter launched Fusion: a server-side “panel of models” that sends your prompt to multiple models in parallel. It lets them use web search and bash tools, then has a judge compare their answers and a synthesizer write the final response. Potentially at lower cost than relying on one expensive frontier model. The claim: Fusion beats frontier models on Perplexity’s DRACO deep research benchmark.

译OpenRouter 发布 Fusion API，一种服务器端复合模型，将同一提示词并行发送给多个模型，允许它们调用网络搜索和 bash 工具。系统通过法官模型比较各模型回答，再由合成器生成最终回复。官方声称，Fusion 在 Perplexity 的 DRACO 深度研究基准上击败前沿模型，同时成本更低——以一半价格即可达到 Fable 级别的智能。

StepFun@StepFun_ai · 6月14日48

Step 3.7 Flash is now live on @DeepInfra 🚀 Builders and teams can now try our open-source multimodal reasoning model through DeepInfra’s API, with private endpoint deployment available for dedicated workloads. Built for agentic coding, tool use, search, and vision workflows. Thanks to the DeepInfra team!

译Step 3.7 Flash 现已上线 @DeepInfra 🚀 开发者和团队现可通过 DeepInfra 的 API 试用我们的开源多模态推理模型，并可为专用工作负载部署私有端点。专为智能体编程、工具使用、搜索和视觉工作流而构建。感谢 DeepInfra 团队！

StepFun@StepFun_ai · 6月14日43

Step 3.7 Flash is now live on @DeepInfra 🚀 Developers can now try our open-source multimodal reasoning model through DeepInfra’s API, with private endpoint deployment available for dedicated workloads. Built for agentic coding, tool use, search, and vision workflows. Thanks to the DeepInfra team!

译Step 3.7 Flash 现已在 @DeepInfra 上线 🚀 开发者现在可以通过 DeepInfra 的 API 试用我们的开源多模态推理模型，并支持为专用工作负载部署私有端点。专为智能体编程、工具使用、搜索和视觉工作流打造。感谢 DeepInfra 团队！

Yuchen Jin@Yuchenj_UW · 6月14日48

One hypothesis: If non-citizens at Anthropic can’t work on Mythos/Fable, and LLM jailbreaks remain unsolved, US frontier labs will be forced to slow down training and model releases. Could Chinese open-source AI surpass US closed models for the first time in ~6 months?

译一个假设：如果Anthropic的非公民不能参与Mythos/Fable项目，且LLM越狱问题仍未解决，美国前沿实验室将被迫放缓训练和模型发布。中国开源AI是否会在约6个月内首次超越美国闭源模型？

🚨 AI News | TestingCatalog@testingcatalog · 6月14日56

OpenRouter announced Fusion, a new mode in which multiple AI models can run side by side to "fuse" into a better result. It arrives with significant improvements across various tasks, comparable to Fable 5 in certain areas. > By testing different combinations of models, we found that roughly three-quarters of the lift that Fusion provides comes from synthesis, and one-quarter from diversity. Testing time 👀