Add more wins for GLM. The model has some brittle characteristics, and is getting crushed by closed models here, but we should expect open models to be more jagged, and you use multiple of them depending on the task. Congrats again to @Zai_org and am excited for the next one

译为GLM再添胜绩。该模型有一些脆弱的特性，在这方面被闭源模型压制，但我们应该预期开源模型更加参差不齐，你可以根据任务使用多个模型。再次祝贺@Zai_org，并期待下一个。

Ethan Mollick@emollick · 6月25日57

Gemini 3 Pro was the first model to achieve at least 23% on ARC-AGI-2, which it did in November, 2025 (it actually scored 31%). So the 8-12 month gap between closed and open weights models still seems to hold. But they are also more jagged, better at some tasks, worse at others.

译Gemini 3 Pro 是首个在 ARC-AGI-2 上达到至少 23% 的模型，它在 2025 年 11 月就做到了（实际得分 31%）。所以闭源与开源模型之间 8-12 个月的差距似乎仍然存在。但它们也更参差不齐，有些任务表现更好，有些则更差。

ChatGPT@ChatGPTapp · 6月25日65

The new GPT-5.5 Instant is very smart, very intuitive, and very fun to chat with. Rolling out now to everyone, starting with Pro and then Plus users. Free users should have the new GPT-5.5 Instant model by tomorrow.

译新的GPT-5.5 Instant非常智能、非常直观，聊天起来非常有趣。现已开始向所有人推送，先从Pro用户，然后是Plus用户。免费用户应在明天前获得新的GPT-5.5 Instant模型。

OpenAI@OpenAI · 6月25日67

We have a new version of GPT-5.5 Instant for you, and it's much more fun to talk to. Our most-used model is now better at understanding the intent behind a question and adapting its response accordingly. It also handles complex constraints more reliably and makes shopping and local recommendations more useful and cohesive. Rolling out today to paid users, tomorrow to free users.

译我们为你带来了新版 GPT-5.5 Instant，它现在聊起天来有趣多了。我们最常用的模型现在能更好地理解问题背后的意图，并相应地调整回应。它也能更可靠地处理复杂约束，让购物和本地推荐更加实用和连贯。今天向付费用户推送，明天向免费用户推送。

宝玉@dotey · 6月25日61

看起来Fable 5快要回归了，而且永久包含在订阅中。但不知道是否要更严格的身份认证才能用

Berryxia.AI@berryxia · 6月25日63

别只吹OpenAI的芯片牛逼了… OpenAI今天官宣自研第一颗AI芯片「Jalapeño」（辣椒芯片），全网都在吹“垂直整合时代来了”…… 但真实情况没人说：这不是胜利宣言，是被推理成本逼到墙角后的无奈自救。推理（跑模型回答用户）成本正在爆炸式吞噬OpenAI的利润，甚至威胁生存。前因：ChatGPT每天要处理海量用户查询，NVIDIA GPU又贵又抢手。 2025年10月，OpenAI就和Broadcom宣布合作开发自定义AI加速器，目标10吉瓦规模。现在Jalapeño出来了，OpenAI自己从头设计，Broadcom负责生产。后果：如果2026年底实现吉瓦级部署——推理成本有望降低约50%（Broadcom CEO原话），性能功耗比大幅优于当前顶级加速器。让ChatGPT、API和未来Agent产品跑得更快更便宜。 OpenAI将从“模型公司”彻底变成“全栈AI基础设施公司”，服务更多人，但也意味着大公司对底层算力的掌控更深。别人最忽视的细节（这些才是真正震撼的点）： ✅ 开发速度离谱：从初始设计到制造流片仅用9个月！而且是用OpenAI自己的AI模型辅助设计的（AI在帮自己设计加速自己的硬件，meta到爆）。 ✅ 这颗芯片只针对Inference（推理），不是训练。训练阶段大概率还是得继续依赖NVIDIA。 ✅ 首批样片已经到手，正在实测中。早期数据：性能功耗比显著优于当前最先进的水平”。 ✅ Broadcom CEO直接说：性能能媲美NVIDIA Blackwell + Google TPU，同时成本省一半。 ✅ 它不是孤零零一颗芯片，而是OpenAI未来多代计算平台的第一步，还带Broadcom的网络技术。 ✅ 名字叫「Jalapeño」，够辣，够应景这个越来越“spicy”的AI时代。这枚芯片的出现，其实在无声宣告：AI已经开始用自己加速自己的基础设施建设。而人类对算力的胃口，只会越来越大。你怎么看？是OpenAI的聪明自救，还是AI军备竞赛又一次疯狂升级？

译OpenAI 发布首颗自研 AI 芯片 "Jalapeño"，专为 LLM 推理设计，与 Broadcom 合作生产。从设计到流片仅 9 个月，且由自身 AI 模型辅助设计。首批样片已到手，性能功耗比显著优于当前顶级加速器，Broadcom CEO 称性能媲美 NVIDIA Blackwell 与 Google TPU，同时成本降低约一半。目标 2026 年底实现吉瓦级部署，推理成本有望下降约 50%。该芯片将驱动 ChatGPT、Codex、API 及未来 Agent 产品，标志着 OpenAI 从模型公司向全栈 AI 基础设施公司转型。

Berryxia.AI@berryxia · 6月25日66

别只吹OpenAI的芯片牛逼了… OpenAI今天官宣自研第一颗AI芯片「Jalapeño」（辣椒芯片），全网都在吹“垂直整合时代来了”…… 但真实情况没人说：这不是胜利宣言，而是被推理成本逼到墙角后的无奈自救。推理（跑模型回答用户）成本正在爆炸式吞噬OpenAI的利润，甚至威胁生存。前因：ChatGPT每天要处理海量用户查询，NVIDIA GPU又贵又抢手。 2025年10月，OpenAI就和Broadcom宣布合作开发自定义AI加速器，目标10吉瓦规模。现在Jalapeño出来了，OpenAI自己从头设计，Broadcom负责生产。后果：如果2026年底实现吉瓦级部署——推理成本有望降低约50%（Broadcom CEO原话），性能功耗比大幅优于当前顶级加速器。让ChatGPT、API和未来Agent产品跑得更快更便宜。 OpenAI将从“模型公司”彻底变成“全栈AI基础设施公司”，服务更多人，但也意味着大公司对底层算力的掌控更深。别人最忽视的细节（这些才是真正震撼的点）： ✅ 开发速度离谱：从初始设计到制造流片仅用9个月！而且是用OpenAI自己的AI模型辅助设计的（AI在帮自己设计加速自己的硬件，meta到爆）。 ✅ 这颗芯片只针对Inference（推理），不是训练。训练阶段大概率还是得继续依赖NVIDIA。 ✅ 首批样片已经到手，正在实测中。早期数据：性能功耗比显著优于当前最先进的水平”。 ✅ Broadcom CEO直接说：性能能媲美NVIDIA Blackwell + Google TPU，同时成本省一半。 ✅ 它不是孤零零一颗芯片，而是OpenAI未来多代计算平台的第一步，还带Broadcom的网络技术。 ✅ 名字叫「Jalapeño」，够辣，够应景这个越来越“spicy”的AI时代。这枚芯片的出现，其实在无声宣告：AI已经开始用自己加速自己的基础设施建设。而人类对算力的胃口，只会越来越大。你怎么看？是OpenAI的聪明自救，还是AI军备竞赛又一次疯狂升级？

译OpenAI发布首款自研AI芯片Jalapeño，专为ChatGPT、Codex、API及未来Agent产品的LLM推理设计，由Broadcom生产。从设计到流片仅用9个月，借助AI模型辅助设计。首批样片实测性能功耗比显著优于当前顶级加速器，Broadcom CEO称性能媲美NVIDIA Blackwell与Google TPU，成本减半。若2026年底实现吉瓦级部署，推理成本有望降低约50%。Jalapeño仅针对推理，训练仍依赖NVIDIA。此举标志OpenAI从模型公司向全栈AI基础设施公司转型。

Greg Brockman@gdb · 6月25日64

Introducing Jalapeño — designed from scratch for LLM inference over nine months, accelerated by our models. Perf per watt looking incredible.

译OpenAI 联合 Greg Brockman 正式推出其首款 AI 芯片 Jalapeño，专为大语言模型推理任务从头设计，历时九个月。芯片已与 Broadcom 合作投入量产，将加速 ChatGPT、Codex、API 及未来智能体产品。Jalapeño 利用 OpenAI 自身模型进行加速，官方称其每瓦性能“令人难以置信”。这标志着 OpenAI 从产品到模型再到基础设施的全栈平台扩展，旨在规模化智能并扩大 AI 可及性。

OpenRouter@OpenRouter · 6月25日54

Fugu Ultra by @SakanaAILabs is live on OpenRouter! Excited to see more multi-model systems pushing the frontier.

译Fugu Ultra by @SakanaAILabs 现已上线 OpenRouter！很高兴看到更多多模型系统推动前沿。

François Chollet@fchollet · 6月24日28

The best way to understand a complex system is via edge cases and failure modes, because they define the contour of the system.

译理解复杂系统的最佳方式是通过边缘案例和故障模式，因为它们定义了系统的轮廓。

Rohan Paul@rohanpaul_ai · 6月24日65

OpenAI rolls out its 1st chip through a Broadcom tie-up as part of its “build the full stack” push. Jalapeño is an ASIC, so it is less flexible than an Nvidia GPU, but can be cheaper and faster when the workload is known very well. They say "the architecture reduces data movement and balances compute, memory, and networking resources to achieve realized utilization much closer to theoretical peak performance." Overall better performance per watt. Jalapeño also signals OpenAI’s shift from buying compute to shaping the whole stack: models, software, servers, networks, and now silicon. There was a 9-month tape-out, means OpenAI and Broadcom finalized the chip design and moved it to manufacturing unusually fast for advanced AI silicon. OpenAI says its own models helped speed up parts of the design work.

译OpenAI与Broadcom合作推出首款自研AI芯片Jalapeño（ASIC），专为ChatGPT、Codex、API及未来AI智能体产品的LLM工作负载设计。在已知工作负载下，Jalapeño比NVIDIA GPU更便宜、更快，通过减少数据移动、均衡计算/内存/网络资源实现更接近理论峰值的实际利用率，能效更优。该芯片从设计到流片仅用9个月，OpenAI自己的模型加速了部分设计工作。这标志着OpenAI从购买算力转向构建完整堆栈（模型、软件、服务器、网络、芯片）的战略转变。

AYi@AYi_AInotes · 6月24日61

所有人都以为OpenAI的护城河是AI大模型，今天他们证明了，真正的胜负手在硅片里。和博通合作，九个月从设计到流片，首款自研AI芯片Jalapeño，专门面向大模型推理。不做训练，只负责用户对话时的响应计算，每瓦性能，明显优于当前最先进水平。有几个反常识的结论跟大家分享，第一个反常识，为什么先切推理，不做更酷的训练芯片。训练是一次性烧钱，烧完就结束，推理是每天几亿用户持续消耗，是真正的成本大头。把推理成本打下来三成到五成，规模上来就是天文数字的利润，这其实是最务实的商业选择。第二个反常识，九个月流片意味着什么，传统高性能芯片，两到三年才是正常设计周期。他们用大模型，辅助设计跑大模型的芯片。 AI造AI硬件的自循环一旦跑通，整个半导体的迭代速度，都会被彻底改写。最本质的战略意图是 OpenAI不想再当英伟达的超级客户了，他们要走全栈路线，从硅片到模型到产品全部自控，说白了谁掌握底层算力，谁就掌握定价权和利润空间。模型权重很重要，算力的话语权同样致命。以前是人类造硬件，硬件跑AI，以后是AI辅助人类造更好的硬件，硬件再跑出更强的AI，这个自增强的循环，才是真正的奇点序章。

译OpenAI与博通合作，九个月内完成首款自研AI芯片Jalapeño的设计到流片。该芯片专为大语言模型推理场景打造，用于ChatGPT、Codex、API及未来智能体产品，每瓦性能优于当前最先进水平。推理成本可降低30%–50%，为日常持续消耗大头。传统芯片设计周期2–3年，Jalapeño通过AI辅助设计跑通“AI造AI硬件”自循环。OpenAI意图走全栈路线，摆脱对英伟达依赖，掌握底层算力定价权。

Chubby♨️@kimmonismus · 6月24日55

Absolutely insane: "Jalapeño was co-developed from initial design to manufacturing tape-out in just nine months, and the custom AI accelerator program represents what we believe to be the fastest ASIC development cycle ever achieved in high-performance advanced semiconductors." ChatGPT helped design the chip so they could reach 9 months of developement cycle "If AI can help engineers design better chips faster, it can lower the cost of compute across the industry and help democratize access to advanced AI."

译OpenAI 推出首款自研 AI 芯片 Jalapeño，专为 LLM 推理从零设计。从初始设计到流片仅用 9 个月，ChatGPT 参与了芯片设计，堪称高性能先进半导体领域最快的 ASIC 开发周期。该芯片由 Broadcom 和 Celestica 代工，针对 ChatGPT、Codex、API 及未来 Agent 产品的实际负载优化。早期样片已在实验室达到目标频率和功耗，成功运行 GPT-5.3-Codex-Spark 等 ML 负载；性能功耗比显著优于当前 SOTA，详细基准后续公布。部署计划于 2026 年底启动，战略上旨在减少对外部 GPU 依赖，加强对算力经济的控制。

meng shao@shao__meng · 6月24日66

OpenAI 发布首款自研推理芯片 Jalapeño OpenAI 联合 Broadcom（和 Celestica）从零设计了一款专为 LLM 推理优化的加速器 Jalapeño，9 个月完成流片，宣称能效显著优于当前 SOTA，计划 2026 年底起以吉瓦级规模部署——这是 OpenAI 把"全栈"延伸到芯片层的标志性一步。为什么 OpenAI 要造芯片？官方用了 "full-stack advantage"（全栈优势）和一个飞轮模型来论证：更好的基础设施 → 更高算力效率 → 更好的训练与推理 → 更强模型 → 更好产品 → 更多使用与收入 → 再投入下一代基础设施。逻辑上是把芯片作为飞轮的最底层杠杆：只有自己掌握芯片架构，才能让内核、内存、网络、调度、产品体验围绕同一目标协同优化。这与 Google（TPU）、Amazon（Trainium/Inferentia）、Meta（MTIA）走的是同一条垂直整合路径——前沿 AI 公司自研推理芯片已成行业共识。对 OpenAI 而言，还有一个直接的商业落点：推理是 AI 触达用户的环节。每一点成本、速度、可靠性的改善，都会直接转化为更快的 ChatGPT 回答、能多走几步的 Codex 任务、更便宜的 API、以及高峰期更稳的访问。

译OpenAI 联合 Broadcom 与 Celestica 从零设计首款自研推理芯片 Jalapeño，9 个月完成流片，专为 LLM 推理优化，能效优于当前 SOTA。计划 2026 年底起以吉瓦级规模部署，用于 ChatGPT、Codex、API 及未来智能体产品。OpenAI 称这是“全栈优势”关键环节，通过自研芯片构建飞轮：更好基础设施→更高算力效率→更好训练与推理→更强模型→更好产品→更多使用与收入→再投入。推理芯片直接改善成本、速度与可靠性，是 AI 触达用户的环节。

🚨 AI News | TestingCatalog@testingcatalog · 6月24日58

OpenAI 🤝 Broadcom OpenAI announced its first AI chip, designed and produced in a partnership with Broadcom. > New SOTA in performance per watt. > OpenAI models were used to accelerate its development. > Will be deployed at gigawatt scale over multiple generations. OpenAI is full stack now 👀

译OpenAI与Broadcom合作推出首款AI芯片Jalapeño，专为ChatGPT、Codex、API及未来Agent产品等大语言模型工作负载设计。该芯片在能效上实现新SOTA，开发过程使用OpenAI模型加速，计划以千兆瓦规模多代部署。此举标志着OpenAI从产品到模型再到基础设施的全栈化。

Chubby♨️@kimmonismus · 6月24日60

OpenAI just unveiled Jalapeño, its first custom AI chip designed from scratch for LLM inference- It is OpenAI moving deeper into the full stack: chips, kernels, memory, networking, racks, scheduling, deployment and product experience. OpenAI has learned from Cerebras-deal what is valuable in specialized inference hardware and is now attempting to translate that lesson into its own controllable platform. Built with Broadcom and Celestica, Jalapeño is optimized around the workloads OpenAI actually runs across ChatGPT, Codex, the API and future agentic products. Early samples are already running ML workloads in the lab at target frequency and power, including GPT-5.3-Codex-Spark. OpenAI says performance per watt should be substantially better than current state of the art, with detailed benchmarks coming later! The strategic angle is obvious: less dependence on external GPUs, more control over compute economics, and a stronger flywheel between models, products, revenue and infrastructure. Deployment is planned to start by the end of 2026.

译OpenAI 推出其首款自研 AI 芯片 Jalapeño，与 Broadcom 和 Celestica 合作构建，针对 ChatGPT、Codex、API 及未来智能体产品的工作负载优化。早期样品已在实验室以目标频率和功耗运行 ML 工作负载，包括 GPT-5.3-Codex-Spark。OpenAI 称每瓦性能显著优于当前最先进水平，详细基准稍后公布。部署计划于 2026 年底启动。此举旨在减少对外部 GPU 的依赖，增强对计算经济的控制，并强化模型、产品、收入与基础设施之间的飞轮效应。

OpenAI@OpenAI · 6月24日63

We’ve designed and built our first AI chip: Jalapeño. Designed from the ground up by OpenAI and brought to production with @Broadcom, Jalapeño is purpose-built for the LLM workloads powering ChatGPT, Codex, the API, and future agentic products. Chips are foundational to the AI economy. Building our own expands our full-stack platform from products to models to infrastructure, and will help us scale intelligence, serve more people, and expand access to AI.

译我们设计并制造了首款 AI 芯片：Jalapeño。由 OpenAI 从零设计，并与 @Broadcom 合作投入生产，Jalapeño 专为支撑 ChatGPT、Codex、API 及未来智能体产品的 LLM 工作负载而打造。芯片是 AI 经济的基础。自建芯片扩展了我们从产品到模型再到基础设施的全栈平台，并将助力我们扩展智能、服务更多人、扩大 AI 的普及。

OpenBMB@OpenBMB · 6月24日36

LLMs don't just hallucinate because they lack knowledge—they hallucinate because they don't know what they don't know. Existing knowledge augmentation blindly injects more data, treating every error as a knowledge gap. But overconfident wrong answers and uncertain correct ones reveal a deeper problem: cognitive misalignment. 🤔 Today, we dive into Know More, Know Clearer—a meta-cognitive framework by @TsinghuaNLP (OpenBMB member) alongside researchers from Harbin Institute of Technology and Northeastern University. The team proposes a unified system that diagnoses a model's cognitive state and applies targeted intervention—not indiscriminate knowledge stuffing. 📄 arXiv: https://arxiv.org/abs/2602.12996 🤗 Paper: https://huggingface.co/papers/2602.12996 Why it matters: 1⃣️ The Structural Decay Law: A Universal Foundation: The team discovers that accuracy exhibits a stable exponential decay relative to uncertainty: E[Acc|U] ≈ a·exp(−U) + b. Validated across 6 architectures (Qwen, Llama, Mistral), this proves internal confidence signals structurally encode performance—not random noise—providing a rigorous basis for meta-cognitive optimization. 2⃣️ Know More (CGKE): Differentiated, Not Indiscriminate: Rather than uniform knowledge injection, the framework partitions the knowledge space into Mastered, Confused, and Missing regions via self-sampled behavioral profiling. Each region receives a tailored augmentation strategy—boundary expansion, structural disambiguation, or epistemic foundation—targeting exactly where the model needs it most. Ablation shows removing the "Confused" category causes the largest performance drop. 3⃣️ Know Clearer (CDKC): Aligning Confidence with Correctness: A cognitive consistency alignment mechanism built on GRPO actively recalibrates the model's confidence landscape—sharpening distributions on correct paths, dispersing them on incorrect ones. Result: average ECE drops from 60.41 to 24.34, and the model learns to genuinely know its own limits rather than learning to refuse everything. 4⃣️ Results: 24.59-Point Gain and True Self-Knowledge: On 11 QA benchmarks, CDKC (2-round) lifts Llama-3.1-8B from 30.91% to 55.50% (+24.59 pts) and Qwen2.5-7B from 25.76% to 48.29% (+22.53 pts). On self-knowledge benchmarks, the framework achieves a CBS of 73.43% and CAE of 68.18%—delivering 63.37% correct answering decisions while maintaining 79.07% boundary recognition, the best balance of any method tested. Knowledge augmentation is not merely about knowing more—it's about knowing more clearly. This framework sets a new standard for reliable, calibrated knowledge in LLMs. #AI #THUNLP #OpenBMB #LLM #KnowledgeAugmentation #Hallucination #MetaCognition #NLP

译面壁智能 OpenBMB 联合清华NLP、哈工大、东北大学提出元认知框架 Know More, Know Clearer，应对 LLM 因认知错位导致的幻觉。框架包含三项：结构性衰减定律（准确率随不确定性指数衰减）；Know More（CGKE）将知识空间分为掌握/混淆/缺失三区针对性增强；Know Clearer（CDKC）基于 GRPO 对齐置信度，使平均 ECE 从 60.41 降至 24.34。在 11 个 QA 基准上，CDKC 将 Llama-3.1-8B 从 30.91% 提升至 55.50%（+24.59 点），Qwen2.5-7B 从 25.76% 提升至 48.29%（+22.53 点）。自知识基准上 CBS 达 73.43%、CAE 达 68.18%，正确决策率 63.37%，边界识别 79.07%，达到最佳平衡。

Rohan Paul@rohanpaul_ai · 6月24日46

New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next tokens. The problem is that normal transformers can look back at every earlier token, so they do not have to squeeze the past into a clean summary. token prediction alone can reward shortcuts that do not become coherent world models. That can work beautifully on familiar data and still fail when the model has to plan, detour, reason, or carry a hidden structure forward. NextLat fixes this by adding a training task where the model must predict its next hidden state, not just the next word. A hidden state is the model’s private summary of what it has seen, so predicting the next one pushes the model to learn how situations change over time. The authors tested this on map-like world modeling, math reasoning, graph planning, story prediction, and regular language modeling. The main result is that NextLat often learned more compact and useful internal states, solved planning tasks better, and sped up generation by up to 3.3x. Overall, it gives transformers some of the useful memory behavior of recurrent models without changing the transformer architecture or slowing normal inference. ---- Link – arxiv. org/abs/2511.05963 Title: "Next-Latent Prediction Transformers Learn Compact World Models"

译微软新论文Next-Latent Prediction (NextLat) 提出一种自监督学习方法，在常规token预测基础上增加预测下一隐藏状态的任务，迫使Transformer学习紧凑的内部世界模型。该方法在地图式世界建模、数学推理、图规划、故事预测等任务上表现更优，生成速度通过自推测解码最高提升3.3x，且无需改变Transformer架构或减慢正常推理。

Alibaba Cloud@alibaba_cloud · 6月24日64

Off-Peak Rates are live across all Qoder products. Qwen 3.7 Max: 80% off. Qwen 3.7 Plus: 60% off. 10 hours every day. Automatic. No opt-in needed. If you're in the Americas, here's the twist: off-peak covers most of your workday. 🧵

译非高峰时段费率已在所有Qoder产品中上线。 Qwen 3.7 Max：80%折扣。Qwen 3.7 Plus：60%折扣。每天10小时。自动生效。无需手动选择。如果你在美洲，亮点是：非高峰时段覆盖了你大部分工作日。🧵

Rohan Paul@rohanpaul_ai · 6月24日52

VibeThinker is a 3B param model, with almost head to head benchmark result with Opus 4.5 on reasoning with novel SFT+GRPO. Unusually strong for its size: with only 3B parameters, 94.3 on AIME26, 80.2 Pass@1 on LiveCodeBench v6, and 96.1% acceptance on recent unseen LeetCode contests. "places it in the performance band of first-tier reasoning systems, matching or exceeding flagship models that are orders of magnitude larger, such as DeepSeek V3.2" They start from a 3B Qwen2.5-Coder base model, then train it with carefully filtered hard examples, multi-solution supervised training, reinforcement learning on math/code/STEM tasks with verifiable rewards, self-distillation, instruction-focused RL, and a test-time answer-checking method called CLR.

译VibeThinker是一个仅3B参数的推理模型，采用SFT+GRPO训练，在推理基准上与Opus 4.5几乎持平。在AIME26上达94.3，LiveCodeBench v6上80.2 Pass@1，近期未见过的LeetCode竞赛中接受率达96.1%，匹配或超越DeepSeek V3.2等大数个量级的旗舰系统。模型基于Qwen2.5-Coder 3B，经过硬样本筛选、多解监督训练、数学/代码/STEM可验证奖励强化学习、自蒸馏、指令聚焦RL及测试时答案检查方法CLR训练而成。

Rohan Paul@rohanpaul_ai · 6月24日49

This paper argues that intelligence is the ability to make rare but valid futures more likely. So an intelligent system is said to be “thermodynamically intelligent” when it uses information and control to make a rare but valid outcome much more likely Most existing intelligence measures judge task success, but they do not explain what brains, LLMs, controllers, and physical information engines have in common. The paper’s answer is that an intelligent system models the world with itself inside it, then uses that model to choose actions that change what futures become likely. A future counts only if it is rare under normal passive behavior and still valid, so random strange outcomes do not get counted as intelligence. The authors turn this into a measure called rare-valid lift, which asks how much more often a system produces those unlikely but acceptable futures than a passive baseline would. They show that high lift is impossible unless the system can accurately spot the rare valid futures, and high spotting accuracy can nearly produce high lift when the system can act well. The main point is that intelligence becomes a physical probability-shifting process, not just a score on tests or a label for human-like behavior. ---- Link – arxiv. org/abs/2606.20231 Title: "Thermodynamic Measure of Intelligence"

译该论文提出“热力学智能”概念，将智能定义为通过信息与控制显著提高罕见有效结果概率的能力。现有评测仅关注任务成功率，而论文指出大脑、大语言模型、控制器等智能体的共同点：系统将自身纳入世界模型，并基于模型选择行动以改变未来概率。有效未来需满足在被动行为下罕见且仍有效。作者提出“罕见有效提升”度量，衡量系统比被动基线更频繁产生此类未来的倍数。高提升取决于系统能否准确识别罕见有效未来。核心论点：智能是物理层面的概率转移过程，而非测试分数或类人行为标签。

meng shao@shao__meng · 6月24日51

我用 Apodex 做了一次深度研究测试。 Apodex 的定位是 Self-Evolving Heavy-Duty Solver，也就是“自进化重型求解器”。它面向的不是简单问答，更专注那些重要、复杂、没有现成答案的问题：需要拆解、搜索、比较证据，再在下结论前核查关键主张。这次我选的问题是： AI Agent 公司如何选择产品方向：开发者工具、企业工作流、研究助手，哪个更值得做？这个问题比单纯问“某个技术最近有什么进展”更难，因为它没有标准答案。要同时看市场需求、付费意愿、竞争格局、技术门槛、销售周期、融资叙事、短期落地难度和长期空间。我用中档 Deep Reasoning 跑了一次，也尝试了 Deep Discovery。后面这个模式更能体现 Apodex 的核心能力：它会把问题拆成多条研究线，分别查开发者工具、企业工作流、研究助手，再补充 VC 视角、企业采用率、市场规模、客户流失风险和具体创业机会。比较有意思的是，它没有在第一轮搜索后马上给结论。它先做总览，再发现证据不够，于是继续补查 TAM、创业方向排名、Menlo Ventures、SaaStr、BCG、企业 AI 报告等来源。这个过程能看到它在不断确认：哪些判断有数据支撑，哪些只是看起来合理。最后它给出的排序是： 1. 垂直企业工作流 Agent 2. 垂直研究助手 3. 开发者工具它认为，2026 年对大多数 AI Agent 创业公司来说，最值得做的是“垂直企业工作流 Agent”。理由是这类产品更容易找到明确买方，也更容易证明价值：比如保险理赔、医疗账单、物流异常处理、合规监测、采购和库存管理。这些场景本来就有人力和外包成本，Agent 如果能节省时间、降低错误率或提升收入，客户更容易付费。开发者工具当然是 AI 最成熟的应用之一，但竞争也最强。Codex、Cursor、Claude Code、Devin 这些玩家已经占住用户心智。新公司如果还只是做通用 coding assistant，很难讲出差异。除非团队本身有很强的开发者工具背景，并且能切入更细的方向，比如合规代码、安全审查、CI/CD 自动化、企业代码治理。研究助手的机会也存在，但前提是必须垂直化。通用 research assistant 很容易被大模型和浏览器插件覆盖。更有价值的是法律、金融、药研、监管、投研这类高价值场景，因为它们需要引用来源、审计记录和人工确认。换句话说，好的研究助手最后往往会变成“研究型企业工作流 Agent”。这次测试让我更清楚地感受到 Apodex 和普通聊天机器人的区别：它的重点是先验证、后下结论。对这种变量多、信息散、需要做取舍的问题，过程透明和证据核查比答案本身更重要。所以我觉得 Apodex 更适合拿来处理这类问题： · 一个创业方向值不值得做？ · 某个行业现在是否适合进入？ · 技术趋势背后有没有真实商业机会？ · 一个投资判断有哪些反方证据？ · 复杂议题里，哪些结论可以相信？这类问题很难靠一次搜索或一次对话解决，需要一个系统把资料找齐、拆开比较、反复验证。Apodex 想做的就是这件事。体验入口：http://www.apodex.ai 开发者可以在 Hugging Face 下载模型：http://huggingface.co/apodex 感兴趣也可以加入 Discord。

译博主用自进化重型求解器Apodex测试“AI Agent公司如何选择产品方向”。Deep Discovery模式下，Apodex拆解为开发者工具、企业工作流、研究助手三条线，补充VC视角、市场规模等来源，持续验证后给出排序：1. 垂直企业工作流Agent（有明确买方和成本替代逻辑）；2. 垂直研究助手（需针对法律、金融等高价值场景）；3. 开发者工具（竞争被Codex、Cursor、Claude Code等占据）。Apodex强调先验证后下结论，适合变量多、需取舍的复杂议题。体验入口apodex.ai，Hugging Face可下载模型。

Rohan Paul@rohanpaul_ai · 6月24日66

AI video is moving into its real-time reaction era, with MaineCoon now leading in low-latency AI video. @catnips_ai just introduced MaineCoon, a 22B real-time text-to-audio-video model built for live AI characters, not offline video generation i.e. to make AI video feel live by generating synced speech and visuals in real time. A record-breaking frame rate of up to 47.5 FPS on a single H100 GPU. Audio-visual generation cost drops significantly below $0.001 per second and continues to fall. It positions the paradigm of social world models for social-interactive purposes. MaineCoon serves as the first generative core toward this paradigm and provides a technical foundation for next-generation AI-native social platforms. It proposes a multi-stage forcing-free streaming training paradigm that includes self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). These components enable 22B-scale native and efficient streaming audio-visual training. It designs an agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift through agentic cache management, chunk commitment, long-context rollout, and prompt planning. The big deal is long-duration streaming at low cost. Text goes in, the first frame appears in under 1s, and the model keeps producing synced video and audio while playback is already happening. So it is not making a full video first, then dubbing it later. It generates forward in small chunks, and each chunk continues from the last one. That is hard because tiny chunks usually break consistency. Faces drift. Voices change. Motion gets weird. Audio and mouth movement separate. MaineCoon tries to solve this with a dual-stream Diffusion Transformer: one stream for video, one stream for audio, and cross-stream attention between them so expression, lip motion, voice, timing, and body movement stay tied together. It also uses a history key-value cache and an attention sink. In plain words, the model keeps useful memory from previous chunks, so the next chunk does not feel like a new disconnected clip. The speed claim is also big: up to 47.5 fps on a single H100, and real-time 30 fps on a single RTX Pro 6000 GPU. That is the low-cost part. You do not need a huge multi-GPU serving setup just to get real-time audio-video generation. They also describe an agentic streaming system that can keep generation going for more than 10 minutes while holding identity, voice, scene state, visual quality, and synced audio. If the stream starts drifting, the system repairs future chunks instead of editing already-shown frames. So MaineCoon is best understood as a streaming-native visual reaction layer: fast first frame, continuous audio-video output, long-horizon memory, and low inference cost. 🧵 1/n.

译MaineCoon是一款22B参数的实时文本到音频-视频模型，专为实时AI角色设计。单H100 GPU可达47.5 FPS，成本低于0.001美元/秒；单RTX Pro 6000实现实时30 FPS。采用多阶段无强制流式训练（自采样、跨模态对齐、域偏好优化、强化在线策略蒸馏）及智能体流式推理框架，支持千秒级连续生成。双流扩散Transformer（视频+音频交叉注意力）保持表情、口型与声音同步，历史KV缓存和attention sink确保片段连贯。首帧小于1秒，生成与播放同步，不先制作完整视频再配音。

François Chollet@fchollet · 6月24日43

AI in 2040 will not be built on the stack we are using today. It will be much closer to optimal. The current stack has 3-4 orders of magnitude of data inefficiency and 4-5 orders of magnitude of compute inefficiency. Near-optimal AI is what symbolic learning will deliver.

译2040年的AI将不再基于我们今天使用的技术栈。它将更接近最优。当前的技术栈有3-4个数量级的数据低效和4-5个数量级的计算低效。接近最优的AI将由符号学习实现。

AYi@AYi_AInotes · 6月23日64

怎样最大程度的延长自己的寿命？

译日本Fugu仅0.6B参数，本质是AI项目经理，自动拆分任务，从顶级模型池挑选选手，分配思考、执行、验证三种角色，多轮协作合成答案。API调用与普通模型无异，编排策略由训练习得。跑分超越Claude和GPT，绕过scaling law军备竞赛。缺点包括黑箱、复杂任务延迟高、简单题成本更高。信号意义在于多智能体编排从实验室玩具正式变为可用生产力工具，orchestration layer新赛道开启。

AYi@AYi_AInotes · 6月23日38

最近把传得神乎其神的白毛股神叙事，丢给AI逐条拆了一遍，结果挺意外的。三个月几百万浏览，几百条帖子，整套逻辑顺得不行。英伟达爆CPO需求，硅光子是卖铲子的，SIVE是最纯的那把铲子。评论区一片跟单，聊着聊着连杠杆都加上了。我没急着信也没急着骂，把整条叙事链拆出来，交给会自动溯源核证据的AI，按公开资料一条条核对，五条核心声称，四条站不住。最关键的几个硬事实，我又自己翻了一手来源对了一遍，结论基本扛得住。这篇不是要扒谁的皮，我更在意的是另一件事，现在大家都知道AI会幻觉，会一本正经地编，可编得太离谱的反而不可怕，你一眼就能看穿。真正麻烦的是另一种，术语没错，单点都有出处，语气又特别笃定，像干了二十年的老分析师，你顺着它的结论真去下单，钱就没了。有时候比胡说更危险的是听起来全对，这种伪正确的叙事，才是AI时代最杀人的陷阱，你亏了钱，甚至都找不到怪的人，只会觉得是自己运气不好。

译作者将流传的“白毛股神”投资叙事（英伟达CPO需求驱动硅光子、SIVE是最纯标的）交给具备自动溯源核证能力的AI，逐条交叉验证公开资料。五条核心声称中有四条缺乏依据，唯一站住脚的一条也被夸大。作者进一步人工复核硬事实后确认结论。推文警示：AI精确引用术语、逐条出处、语气笃定的“伪正确”叙事比明显胡诌更危险，可能诱导投资者盲目跟单。

Artificial Analysis@ArtificialAnlys · 6月23日60

Open weights models make up the majority of the cost-performance Pareto frontier on AA-Briefcase, our new agentic knowledge work benchmark Last week we released AA-Briefcase, our proprietary agentic knowledge work benchmark testing models on long horizon tasks built by industry experts. AA-Briefcase requires models to build deliverables such as financial models, board presentations, and design mock-ups in the context of realistic multi week projects. The cost to run a single AA-Briefcase task varies by over 700x in the initial set of models we tested. With the highest performing model, Claude Fable 5, costing over $20 per task, cost efficiency is a key element in model selection for knowledge work. While the two highest performing models on the cost-performance Pareto frontier are proprietary models from @AnthropicAI, most of the remaining frontier is made up of open weights models. Notable cost efficiency trade offs: ➤ At $2.40 per task, GLM 5.2 (max) from @Zai_org scores within 90 Elo points of Claude Opus 4.8 (max) while costing 65% less ➤ At $0.08 per task, DeepSeek V4 Pro (max) from @deepseek_ai scores ~60 Elo points above Gemini 3.5 Flash while costing over 98% less

译Artificial Analysis发布AA-Briefcase智能体知识工作基准测试，评估模型在长期任务中的表现。任务成本差异超700倍，最高性能模型Claude Fable 5每任务超$20。成本-性能帕累托前沿上，除Anthropic两个最高分模型外，其余大部分由开放权重模型占据。关键性价比：GLM 5.2 (max)每任务$2.40，得分仅比Claude Opus 4.8低90 Elo，成本低65%；DeepSeek V4 Pro (max)每任务$0.08，得分比Gemini 3.5 Flash高约60 Elo，成本低98%以上。

Artificial Analysis@ArtificialAnlys · 6月23日59

GLM-5.2 leads open weights models and sits at #3 overall on GDPval-AA, a real-world agentic work benchmark GLM-5.2 from @Zai_org scores 1524 Elo on GDPval-AA, which measures performance on real-world, economically valuable knowledge work through long-horizon, multi-turn tasks. Key takeaways: ➤ #3 overall, behind only Claude Fable 5 (1783) and Claude Opus 4.8 (1615), and level with GPT-5.5 (xhigh, 1509) ➤ The leading open weights model by a wide margin: the next open model, MiniMax-M3, scores 1408 ➤ Ahead of many proprietary models, including Google's Gemini 3.5 Flash (1357), Qwen 3.7 Max (1289), Muse Spark (1158) ➤ The tasks are agentic. GLM-5.2 averaged ~31 turns per task across 1,999 matches ➤ Consistent with the rest of its launch, GLM-5.2 also leads open weights on the Artificial Analysis Intelligence Index, ranks #3 on the Agentic Index, and #3 on AA-Briefcase

译智谱 AI 的 GLM-5.2 在真实世界智能体工作基准 GDPval-AA 上获得 1524 Elo，排名第三，仅次于 Claude Fable 5 和 Claude Opus 4.8，与 GPT-5.5 持平。它是开源权重模型中领先的，超越 Gemini 3.5 Flash、Qwen 3.7 Max 等专有模型。任务为智能体型，平均每任务约 31 轮。此外，GLM-5.2 在 Artificial Analysis Intelligence Index 上也领先开源权重，并在 Agentic Index 和 AA-Briefcase 上均排名第三。

Ethan Mollick@emollick · 6月23日64

I have been trying Sakana Fugu Ultra-high and, first, it is incredibly slow: my typical coding tests (shaders, interactive scenes) take 30 minutes to run And the results are... fine. It does not match Fable in real use. Its harbor is a good example: https://ai-harbor-town-gallery.netlify.app/#sakura-ultra-high

译宾大教授Ethan Mollick实测Sakana Fugu Ultra-high模型，指出其速度极慢——典型编码测试需30分钟，实际效果仅“fine”，未能匹配此前Sakana官方宣称的“与Fable和Mythos性能相当”。Mollick表示，在真实编码场景中Fugu Ultra远不及Fable，并附上AI港口小镇生成样例链接作为例证。

SemiAnalysis@SemiAnalysis_ · 6月23日69

CUDA MOAT ALERT 🔥: In less than 70 days, GB200 NVL72 serving costs decreased by 2.5x through software improvements alone for the Kimi architecture, which is the same model architecture as xAI’s popular Cursor Composer 2.5. One of the key software optimizations was rewriting the NVFP4 MoE kernel using CuTe-DSL, which is additive to the existing wide-expert parallelism optimization. This takes advantage of NVL72’s copper backplane, which has 18x higher bandwidth than standard RoCEv2/InfiniBand. Great work by Xin Li, Jun Yang, & the NVIDIA team on decreasing serving costs by 2.5x in less than 70 days! 🔥

译SemiAnalysis发布CUDA MOAT警报：在不到70天内，通过纯软件优化，Kimi架构（与xAI的Cursor Composer 2.5相同模型架构）在GB200 NVL72上的服务成本降低2.5倍。关键优化是使用CuTe-DSL重写NVFP4 MoE kernel，作为现有宽专家并行优化的补充。该优化利用了NVL72的铜背板，带宽是标准RoCEv2/InfiniBand的18倍。此项工作由Xin Li、Jun Yang及NVIDIA团队完成。

Berryxia.AI@berryxia · 6月23日63

这个是小日子搞的嘛？Fugu？今天被刷屏了很多次！ Sakana直接发布了一个能匹配Fable和Mythos性能的多智能体编排系统，而且还是通过单个API调用。地址：https://sakana.ai/fugu 他们推出的Sakana Fugu，把整个多智能体系统包装成了一个普通模型的样子。你只调用一个端点，它内部自己决定怎么拆解任务、挑选最合适的模型、递归调用自己或其他agent、验证结果，最后合成答案。用户完全不用操心底层怎么编排。 Fugu Ultra在工程、科学、推理等硬核基准上能和Fable/Mythos并肩，而Sakana特别强调的一点是。因为它能动态编排全球各种模型，所以天然绕开了单一供应商的出口管制风险。这已经不是单纯的技术优化，把“集体智能”当成了对抗地缘和供应链风险的实际方案。这其实是在重新定义前沿模型的形态。以前大家觉得最强能力来自单个最强的单体模型。现在Sakana在说：真正的强大系统，应该是能智能调度全球模型池的“编排层”。用户要的不是一个模型，而是一个能持续进化、不会突然被切断的智能体生态。这波操作把多智能体从“复杂工程”变成了“开箱即用”的产品形态。

译Sakana AI 发布 Sakana Fugu，一个多智能体编排系统，用户仅需调用单个模型 API。其 Fugu Ultra 版本在工程、科学、推理等硬核基准上性能匹敌 Fable 和 Mythos。系统内部自主拆解任务、挑选最优模型、递归调用自身或其他智能体、验证结果并合成答案，用户无需关心底层编排。关键优势在于动态编排全球各类模型，天然避开单一供应商的出口管制风险，将多智能体从复杂工程变为开箱即用的产品形态。

Berryxia.AI@berryxia · 6月23日75

Sakana AI 是2023年在东京成立的AI研发公司，核心定位是开发“自然启发”（nature-inspired）的AI模型。强调集体智能（collective intelligence）和演化方法，目标是打造不受单一大模型限制的系统，并服务于日本的AI主权（sovereignty）需求。三位联合创始人： • David Ha（CEO）：前Google Brain日本团队负责人，曾在日本高盛担任衍生品交易主管，有很深的日本工作和生活背景（多伦多大学本科、东京大学博士）。 • Llion Jones（CTO）：著名Transformer论文（《Attention Is All You Need》）共同作者之一，前Google Research。 • Ren Ito（Chairman）：前日本外交官（外务省，曾为安倍晋三写演讲稿）、日本独角兽Mercari早期员工并担任欧洲CEO。公司完全以日本为基地，团队和运营都在东京。

译Sakana AI 是 2023 年成立于东京的 AI 公司，由前 Google Brain 的 David Ha（CEO）、Transformer 论文共同作者 Llion Jones（CTO）及前日本外交官 Ren Ito（主席）联合创立。其产品 Sakana Fugu 将多智能体系统封装成单个 API 调用，内部自动拆解任务、调度全球模型并验证结果。Fugu Ultra 在工程、科学、推理等基准上对标 Fable/Mythos，通过动态编排多模型天然绕开单一供应商出口管制风险，被视为将多智能体从复杂工程变为开箱即用的产品形态。

karminski-牙医@karminski3 · 6月22日54

想买Mac运行大模型? 这是劝退贴其实估算方法很简单, 现在买 MacStudio 哪怕运行 Qwen3.6-27B 4bit 量化版本, 然后开 DFlash 使用Qwen的内置投机解码, 也就飙到 65token/s. 而现在普遍大模型都能跑到 40 token/s. 如果专门买 MacStudio M3 Ultra 96G 运行大模型, 如果把设备售价 (32999) 换算成使用API, 以 GLM-5.2 为例, 每百万token 28块, 一台 MacStudio 的价格大概能买到 32999/28 = 1178M token. 而为了输出这些token, 买到的 MacStudio 运行 Qwen3.6-27B 要持续运行 209天. 也就是说回本周期至少是200天不间断运行. 然后运行模型才是纯赚. 这还是没算电费和不直接买API而是买套餐的情况.而且, 最重要的是这还是在运行一个只有27B的小模型. 如果真的买512G的 MacStudio (108749, 而且好像已经断货了), 然后运行量化版本的 GLM-5.2, 速度就会跌到只有 17 token/s, 回本周期大概在 7 年左右... 对于现在1.5个月模型就发新版本的情况下, 普通用户自用是绝对不划算的. 所以大部分用户买 coding plan 会更划算, 如果像我一样要测新模型, 直接租卡也会比直接买划算很多. 当然, 如果你本身就有Mac或者显卡, 那么空闲的时候(比如睡觉的时候)让它跑大模型运行任务, 反而是划算的. #本地大模型 #mac #qwen36 #glm52

译买MacStudio运行大模型性价比不高。以M3 Ultra 96G（32999元）为例，运行Qwen3.6-27B 4bit量化版并开投机解码，速度约65 token/s。设备成本换算成API调用（GLM-5.2，每百万token 28元）可买约1178M token，需连续运行209天才能回本。512G版（108749元）运行量化GLM-5.2速度仅17 token/s，回本约7年。模型每1.5个月更新，建议普通用户买coding plan或租卡。已有Mac或显卡者，闲置时跑模型才划算。

郭明錤｜Ming-Chi Kuo@mingchikuo · 6月22日52

Google 與聯發科深化 TPU v9 合作，開發升級版 Triggerfish，聚焦 AI 代理、強化學習與有效算力最大化 1. 我最新的產業調查顯示，Google 將在 TPU v9 / Humufish 的基礎上，開發代號可能是 Triggerfish 的升級版 v9 晶片，並由聯發科獨家取得此單價更高的新增訂單。 2. 此升級版晶片是以 Humufish 為基礎的延伸新案，定位為更強推論能力的 v9 改版，可同時緩解 CPU wall 與 memory wall。此案也進一步驗證聯發科是 Google 在 TPU v9 世代的合作首選廠商。 3. 此 v9 改版與 Humufish 的主要差異，在於：SRAM 容量顯著提升至 Humufish 的 2–3 倍、新增 simulation die、升級至 HBM4E（vs. Humufish 的 HBM4）。 4. 新增 simulation die 的可能功能除了本地 TPU 管理、訓練 / 推論模式切換等，關鍵聚焦在強化學習（RL）與 AI 代理協作。 5. 更大容量的 SRAM 可將更多 RL 與 AI 代理所需的活躍工作集（active working set）留在 TPU 本地，降低資料搬移成本，提升超低延遲 decode 階段的效率。 6. 在 Humufish 生命週期 400‒500 萬顆出貨預估不變下，Google 額外追加 100–200 萬顆 Triggerfish 訂單，預計 2027 年底開始生產、2028 年放量；因Triggerfish 單價較 Humufish 高約 30%，有望成為聯發科 2028 年營運動能的新增量。

译郭明錤产业调查显示，Google 在 TPU v9 (Humufish) 基础上开发升级版晶片 Triggerfish，由联发科独家代工。升级包括：SRAM 容量提升至 Humufish 的 2–3 倍、新增 simulation die（聚焦强化学习与 AI 代理协作）、内存升级至 HBM4E。Google 额外追加 100–200 万颗订单，单价较 Humufish 高约 30%。预计 2027 年底开始生产，2028 年放量。Humufish 生命周期出货量 400–500 万颗预估不变。

郭明錤｜Ming-Chi Kuo@mingchikuo · 6月22日39

Google and MediaTek Deepen TPU v9 Collaboration with Upgraded Triggerfish, Targeting AI Agents, Reinforcement Learning, and Effective Compute Maximization 1. My latest industry checks indicate that Google is developing an upgraded v9 chip, likely codenamed Triggerfish, based on TPU v9 / Humufish, with MediaTek exclusively securing this new, higher-priced order. 2. This upgraded chip is a Humufish-based follow-on program, positioned as a v9 variant with stronger inference capabilities that can help mitigate both the CPU wall and the memory wall. The project also further confirms MediaTek as Google’s preferred development partner for the TPU v9 generation. 3. The key differences between this v9 variant and Humufish are: SRAM capacity is significantly increased to 2–3 times that of Humufish, a new simulation die is added, and memory is upgraded to HBM4E, versus HBM4 on Humufish. 4. Beyond local TPU management and training / inference mode switching, the newly added simulation die's likely role centers on reinforcement learning (RL) and AI-agent coordination. 5. The larger SRAM keeps more of the active working set required by RL and AI agents local to the TPU, reducing data-movement costs and improving efficiency in the ultra-low-latency decode stage. 6. With Humufish lifetime shipments still estimated at 4–5 million units, Google is adding an incremental Triggerfish order of 1–2 million units, with production expected to begin in late 2027 and ramp in volume in 2028. As Triggerfish carries a unit price roughly 30% higher than Humufish, it could become an incremental driver of MediaTek’s 2028 business momentum.

译郭明錤爆料，Google基于TPU v9 / Humufish开发升级版芯片Triggerfish，由MediaTek独家代工。相比Humufish，Triggerfish的SRAM容量提升2-3倍，新增模拟die（用于强化学习和AI智能体协同），内存升级至HBM4E（Humufish为HBM4），强化推理能力以缓解CPU墙和内存墙。Humufish生命周期出货量约400-500万颗，Triggerfish追加订单100-200万颗，预计2027年底试产、2028年放量，单价高约30%，有望推动联发科2028年业绩增长。

Rohan Paul@rohanpaul_ai · 6月22日50

Can LLM agents actually discover hidden rules by interacting? The answer is uncomfortable. The more complicated the hidden world gets, the faster AI agents fall behind. LLMs often cannot turn growing evidence into a stable internal model. Current LLM agents can sometimes discover hidden structure through interaction, but they are still weak at planning questions, using memory, and turning feedback into a reliable world model. ---- Link – arxiv. org/abs/2606.16576 Title: "Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning"

译Rohan Paul引用新论文指出，尽管LLM智能体有时能通过交互发现隐藏结构，但其推断世界模型的能力存在根本局限：随着隐藏世界复杂度增加，AI智能体的表现迅速落后，难以将积累的反馈转化为稳定的内部模型，尤其在提问规划、记忆利用和反馈整合方面表现薄弱。结论是，在复杂环境中，LLM智能体建立可靠心智模型的速度跟不上难度增长。

Rohan Paul@rohanpaul_ai · 6月22日36

"Can AI ever be Newton? Can AI ever be Einstein? Can AI ever be Picasso?" Dr. Fei-Fei Li ( @drfeifei ) gives a very simple explanations of how today's AI still has a long way to go. --- From 'FII Institute' YT channel (full link in comment).

译"AI 能成为牛顿吗？AI 能成为爱因斯坦吗？AI 能成为毕加索吗？" 李飞飞博士给出了一个非常简单的解释，说明今天的 AI 还有很长的路要走。 --- 来自 'FII Institute' 的 YouTube 频道（完整链接在评论中）。

Orange AI@oran_ge · 6月22日22

正在测试一个全新系统的模型太有想象力了有些激动人类又朝前迈进了一步

Chubby♨️@kimmonismus · 6月22日50

friends, i got the feeling that coming week will be super duper exciting.

译Anthropic 的 Mythos 模型更强大版本已结束训练。Mythos 于 4 月 7 日通过 Project Glasswing 上线，仅两个月后即迎来新迭代。目前仍存三点疑问：新版是否仍通过 Project Glasswing 发布；性能相比 Mythos‑1 提升多少；能否通过 Fable 5.1（或后续命名）获得权限。消息来自可靠信源 Andrew Curran。