AIHOT
内容
精选全部 AI 动态AI 日报主题收藏
接入
Agent 接入
更多
关于更新日志反馈
内部员工登录
精选全部日报更多
内部员工登录
全部动态X · 969 条
全部一手资讯X论文
标签「推理」清除
ClaudeDevs@ClaudeDevs · 6月11日66

New for Apple developers: Foundation Models support for Claude lets developers use Apple's Foundation Models framework to call Claude for multi-step reasoning, code generation, and longer context.

译Apple开发者新消息:Foundation Models支持现在可让开发者使用Apple的Foundation Models框架来调用Claude,进行多步骤推理、代码生成和更长上下文处理。

Rohan Paul@rohanpaul_ai · 6月11日64

Apodex-1.0-H just dropped a heavy-duty agent team for deep research Claims SOTA results by splitting web research across many agents and auditing every evidence chain before writing the answer. Treats deep research as a distributed systems problem for AI agents. Apodex uses an async agent team: an orchestrator assigns sub-agents separate contexts and tools, then fact-checker, conflict-reviewer, and draft-reviewer agents test weak claims. The real big deal is that Apodex is showing a possible “inference-time scaling” path for AI research, where better answers come not from one bigger model, but from many coordinated search agents, persistent traces, and a separate verification layer that audits the evidence before the final response is allowed to exist.

译Apodex-1.0-H 发布一个异步智能体团队,用于深度研究。协调者将子智能体分配到独立上下文和工具,再通过事实核查、冲突审查和草稿审查智能体检验弱主张。该方案将深度研究视为分布式系统问题,展示了推理时缩放路径:通过多个协调搜索智能体、持久追踪和独立验证层提升答案质量,而非依赖单一更大模型,并声称取得 SOTA 结果。

🚨 AI News | TestingCatalog@testingcatalog · 6月11日62

Inworld has cut prices across its Realtime Inference and Speech-to-Text services, repricing the open models it serves so consumer voice apps can run at scale more cheaply and for longer. With Realtime Inference, Speech-to-Text with voice profiling, and a Realtime API that now runs Gemma 4, DeepSeek, and MiniMax at around half the public rate behind a single OpenAI-compatible endpoint.

译Inworld 大幅降低实时推理、带语音特征分析的语音转文本(STT)以及 TTS 服务的 API 价格,将 Gemma 4、DeepSeek、MiniMax 等开源模型

fofr@fofrAI · 6月11日69

DiffusionGemma, where the LLM picks words all at once. Which is 4x faster. You can get started with the weights and instructions here: https://huggingface.co/google/diffusiongemma-26B-A4B-it

译DiffusionGemma,大语言模型一次性选出所有词。速度快4倍。 你可以从这里获取权重和说明开始使用: https://huggingface.co/google/diffusiongemma-26B-A4B-it

elvis@omarsar0 · 6月11日71

This is awesome! I am spending a lot of time on diffusion LLMs these days, so this is perfect timing. I feel like there are so many underexplored research questions around text diffusion. Weight available in HF.

译太棒了!我最近花了很多时间在研究扩散大语言模型上,所以这个时机恰到好处。我觉得文本扩散领域还有很多未被充分探索的研究问题。权重已在 HuggingFace 上可用。

Sundar Pichai@sundarpichai · 6月11日75

DiffusionGemma is an open, experimental model that brings our text diffusion research to Gemma 4. It’s a racehorse 🏇achieving up to 4x faster inference by generating entire blocks of text simultaneously vs predicting token-by-token (word-by-word) output!

译DiffusionGemma 是一个开放的实验性模型,它将我们的文本扩散研究引入 Gemma 4。它是一匹赛马 🏇,通过同时生成整块文本(而非逐 token(逐词)预测输出)实现高达 4 倍更快的推理速度!

Orange AI@oran_ge · 6月10日32

和 Claude Fable 5 对话,确实有一种对方智商很高的感觉 思维很全面,甚至有点过于全面 缓存命中之后,一轮10美分,好像也值这个价

Ethan Mollick@emollick · 6月10日27

Science fiction authors in the order you want them to be right about AI: Iain Banks Becky Chambers Martha Wells Douglas Adams Charles Stross (Singularity Sky) Peter Watts Charles Stross (Laundry) Harlan Ellison

译你希望其对 AI 预言成真的科幻作家,按顺序排列: 伊恩·班克斯 贝基·钱伯斯 玛莎·威尔斯 道格拉斯·亚当斯 查尔斯·斯特罗斯(《奇点天空》) 彼得·沃茨 查尔斯·斯特罗斯(《洗衣房系列》) 哈兰·埃里森

歸藏(guizang.ai)@op7418 · 6月10日49

试了一下,Fable 5 在漏洞分析、bug 寻找这些地方还是很强的。 但是在写代码上,我感觉它也不是万能的,它写出来的代码也会有明显的 bug,需要多次修复才能完成。 所以在这块,我觉得它可能是一个偏科比较严重的模型。 在某些程度上它比 4.8 好了非常多,但在另一些方面,虽然也比 4.8 好,但好得有限。

译用户在 26 万行代码的 CodePilot 代码库中测试 Fable 5,发现其在漏洞分析和 bug 寻找方面表现出色,能找出大量问题。但在代码生成上,Fable 5 并非万能,写出的代码常有明显 bug,需要多次修复才能完成,属于偏科严重的模型。与之前的版本 4.8 相比,Fable 5 某些方面提升巨大,另一些方面虽更好但提升有限。

SemiAnalysis@SemiAnalysis_ · 6月10日58

Local LLMs are the Great Leap Forward for Inference. Every laptop is it's own datacenter, sovereignty over your own tokens, and the people can seize the means of token generation. And that's why it's destined for poor results. (1/4)🧵

译本地LLM是推理的大跃进。每台笔记本电脑都是自己的数据中心,对你自己的token拥有主权,人民可以夺回token生成的手段。而这正是它注定结果糟糕的原因。(1/4)🧵

歸藏(guizang.ai)@op7418 · 6月10日51

亏了呀,早上六点重置了,少用了一些 Fable 5

译用户 @alexalbert__ 宣布重置所有产品使用限制,并针对刚测试Fable 5的用户提出四点建议:①给Fable分配比以往模型更大、更雄心勃勃的任务;②默认用xhigh/high effort模式获取最佳性能,交互式会话可改用med;③重写skills和CLAUDE.mds,避免旧模型指令限制Fable自主判断;④从提供任务转为提供目标,描述完成标准和验证方式,用/loop和/goal让Fable自行规划路径。主推文用户感叹早上六点重置后少用了Fable 5,觉得可惜。

Deedy@deedydas · 6月10日69

Claude Fable 5 is by far the most ridiculous model that makes me genuinely afraid for the future of software engineering. I compiled the top 10 most unbelievable things I've seen Claude Fable 5 do today: — Migrate a 50M line codebase from Stripe in a day (humans take 2mos) — Draw amazing 3D graphics a) Boeing 747 b) space simulations with >5000 objects c) Minecraft roller coasters d) full photorealistic forest scenes e) NYC skyline f) stormy clouds) — One-shot Pokemon FireRed the game — Optimize a real world proprietary interaction net evaluator 10x more than the next best model, gpt5.5 AND it's about the same price as GPT 5.5 ($10/M input, $45/M output) vs Fable 5 ($10/M input, $50/M output) and 6x cheaper than GPT 5.5 Pro.

译Claude Fable 5 一天内迁移 Stripe 5000 万行代码库(人类需 2 个月);绘制逼真 3D 图形(波音 747、超 5000 个对象太空模拟、Minecraft 过山车、写实森林、纽约天际线、暴风云);一次性通关宝可梦火红版;优化实际交互网络求值器,效果比 GPT 5.5 好 10 倍。价格相近:输入 $10/M,输出 $50/M(Fable 5)vs $45/M(GPT 5.5),且比 GPT 5.5 Pro 便宜 6 倍。

Artificial Analysis@ArtificialAnlys · 6月10日76

Claude Fable 5 launched today at #1 on the Artificial Analysis Intelligence Index, putting Anthropic nearly 5 points ahead of any other lab’s best model We supported @AnthropicAI with pre-release evaluation of Claude Fable 5. Claude Fable 5 scores 64.9 on the Artificial Analysis Intelligence Index, claiming the #1 rank overall. It is ~5 points ahead of the closest non-Anthropic model (GPT-5.5), and Anthropic models now occupy both of the top 2 places. Key takeaways for Claude Fable 5 (adaptive reasoning with max effort and Opus 4.8 as fallback model): ➤ New safety guardrails for Mythos-class models: Claude Fable 5 uses the same underlying model as Claude Mythos 5 for public usage, with additional guardrails for potentially-harmful cybersecurity, biology, chemistry, and distillation-related queries. We tested Fable 5 using Anthropic’s new ‘fallback’ mechanism, which can route safety-flagged messages to Claude Opus 4.8. Anthropic states that fallback occurs in fewer than 5% of sessions on average, and we recorded fallback routing in ~8% of tasks across the Intelligence Index (mostly in scientific questions from evaluations like GPQA, AA-Omniscience and Humanity’s Last Exam) ➤ State-of-the-art Intelligence: Claude Fable 5 takes the #1 position on the Artificial Analysis Intelligence Index, scoring 64.9 and setting the highest score on 5 of the 10 underlying benchmarks. On AA-Omniscience, our knowledge and hallucination benchmark, Fable 5 scores 40, +7 points over the previous leader, Gemini 3.1 Pro Preview, driven primarily by higher accuracy. We generally observe a strong relationship between AA-Omniscience accuracy and model size in open weights models, which suggests Fable 5 could be larger than previous public Anthropic models ➤ Frontier agentic capability: Claude Fable 5 is at the frontier across all three agentic evaluations in the Index: GDPval-AA (real-world work tasks), Terminal-Bench Hard (agentic coding), and Tau2-bench Telecom (tool use for customer service). Its GDPval-AA Elo of 1932 is a significant jump from the previous leader, Claude Opus 4.8, further extending Anthropic’s lead in agentic capabilities ➤ Leading HLE score, but refusal and fallback in 9% of tasks: Claude Fable 5 scores 53% on Humanity’s Last Exam, more than 7 points ahead of the next-best model, Claude Opus 4.8 (max). Fable 5 triggers safety guardrails on 9% of HLE tasks, falling back to Claude Opus 4.8. Including this fallback usage, running HLE with Fable 5 costs ~$2.2k, the highest of any model we have evaluated Key model details: ➤ Context window: Claude Fable 5 retains the same 1M token context window as Claude Opus 4.8 ➤ Price: Claude Fable 5 is priced at $10/$50 per 1M input/output tokens, 2x the token price of Claude Opus 4.8. The cache write/read price is $12.50/$1 per million tokens ➤ Availability: Claude Fable 5 is included in Pro, Max, Team, and seat-based Enterprise plans through June 22, consuming 2x Opus usage. From June 23, usage will require credits, with Anthropic saying it plans to restore subscription access once capacity allows

译Claude Fable 5 发布即位列 Artificial Analysis Intelligence Index 第一,得分 64.9,领先第二名的 GPT-5.5 约 5 分。该模型采用自适应推理(最大努力模式)并以 Opus 4.8 作为回退模型。在 AA-Omniscience 知识测试中得分 40,领先此前最高分的 Gemini 3.1 Pro Preview 7 分;HLE 得分 53%,领先 Opus 4.8 超 7 个百分点。约 9% 任务触发安全护栏并回退。定价 $10/$50 每百万输入/输出 token(Opus 4.8 的两倍),缓存读写 $12.50/$1;上下文窗口保持 1M token。通过 Pro、Max、Team 等计划可用至 6 月 22 日,之后需消耗积分。

Elon Musk@elonmusk · 6月10日30

Tesla AI chip design engineering reviews are so great! Team is awesome. Our AI6 chip might set a record for most amount of usable intelligence from a wafer when factoring in yield.

译Tesla AI芯片设计工程评审太棒了!团队很出色。 我们的AI6芯片在考虑良率后,可能会创下每晶圆可用智能量最高的记录。

Berryxia.AI@berryxia · 6月10日78

兄弟们,大家没有等来Mythos! 但等来了同门兄弟Fable 5啊! Anthropic把Mythos级别的超级怪物直接做成安全版扔给全世界用,把“越强越危险”的 说法抛在脑后! Claude Fable 5今天全网开闸,基准测试几乎全线SOTA,尤其软件工程、知识工作、科研和视觉这些硬活儿,长任务越复杂它领先得越离谱。 他们自己也承认这模型太猛,cyber、生物化学、蒸馏这些窄领域会自动fallback到Opus 4.8,平均每20次对话才触发一次,还会老实告诉你。 同时给一小撮可信的cyber防御和关键基础设施团队放出完全版Mythos 5,后面还会逐步扩大受信任访问。 以前大家都觉得前沿模型要么锁死不给用,要么一放就出事,结果Anthropic用这套精准safeguard直接证明:真正顶级的AI从来不是能力跟安全二选一,是把两者同时拉到极致。

译Anthropic 发布 Claude Fable 5,这是经过安全处理的 Mythos 级模型,能力超越以往任何公开发布模型。它在软件工程、知识工作、科研和视觉等基准测试中几乎全线 SOTA,长任务越复杂领先越明显。在网络、生物化学、蒸馏等高风险领域,模型会自动回退至 Opus 4.8,平均每 20 次对话触发一次。同时,Anthropic 向少数可信的网络安全与关键基础设施团队开放完全版 Mythos 5,后续将扩大受信任访问。此举证明顶尖 AI 可在能力与安全之间同时达到极致。

Berryxia.AI@berryxia · 6月10日62

http://x.com/i/article/2064479983104602112 # Fable 测评了一周的真实感受:这才是真正的下一代模型,但也是也有不少“怪癖”!(译) 【Matthew Berman 最新测评】Fable(Mythos)测了一周:这才是真正的下一代模型,但也有一堆“怪癖”! 原帖见👇 申明: 本文由海外博主@MatthewBerman 测评,以下的“我”指其本人哈,请悉知。 tl;dr:我这周一直在狂测 Fable(Mythos),用完之后只有一个感觉——它和其他模型完全不是一个次元的东西。 无论是使用体验还是定价,都给我一种“下一代正式登场”的震撼。但它也确实有一些很明显的怪癖。 优点篇(The Good) Workflow 模式直接封神。我随便扔给它一个“full code review”的指令,结果它瞬间拉起几百个 agent 并行狂干,给我项目里的几乎每个文件都单独配了一个专属 agent。 bug、边缘 case、文档缺失、UX 体验问题……全都被它挖出来了。 我之前给 Claude、GPT 下过一模一样的 prompt,它们找出来的问题连它一半都不到。 更离谱的是它的自主性。比以前任何 Claude 或 GPT 都敢自己闷头干活,一干就是好几个小时。 最关键的是——我敢把任务彻底扔给它。 它会毫不犹豫地烧一大堆 token,直到把目标彻底干完。 每次我一启动 Fable,就感觉它像接了个史诗级大项目一样,斗志满满。 我现在给它扔超级复杂、长周期的任务时,信心前所未有的足。 几乎想不出有什么任务能把它难住,它也特别“渴望”挑战这种硬骨头。 这就是 Fable 最亮眼的地方——超长时域任务(long horizon tasks)。 我现在都想象不出它的超长时域任务 极限到底在哪。 槽点篇(Quirks) 不过它也不是无敌神模型,有几个毛病还挺明显: 1. 极度啰嗦 + 信息密度爆炸 解释一个东西能直接钻进草丛深处。 我专门更新了 claude.md 来压它,结果还是压不住。 我得反复让它“说人话”。 不光是字多,信息密度高到让我一度怀疑自己是不是变笨了…… 说真的,信息密度这事儿我以前真没那么重视。 现在发现:在固定 token 预算下,谁能塞更多有效信息,谁就等于“更聪明且更便宜”。 这也是未来 agent 自己发明超高密度语言的绝佳理由。 1. 疯狂问 clarifying questions 一个简单 prompt 能被它拆成:问问题 → 总结我的回答 → 确认总结 → 出 spec → 确认 spec → 确认 agent 策略(并行还是串行)→ 最后才开始干活…… 我其实希望它自己做决策。Anthropic 官方说更新 system prompt 之后就能好。 1. 速度真的慢 比之前的 Opus 甚至 GPT 都慢。启动慢,思考过程也慢,和我以前爱 Opus 的点完全相反(Opus 以前又快又会抄近道)。 Fable 哪怕简单任务也慢慢爬,我看着计时器往上跳,输出 tokens 半天不动,五分钟才用几千 token。它就是想把每件事都做到极致彻底,这就必然要花时间。 总结 & 小贴士 Pro tip:把 effort level 直接拉到最低,比你以为的还低。 它在中档的时候就已经想得非常非常多,低档依然强得离谱,只是思考时间会短一些。 所有这些怪癖其实都是能修的——模型优化 + 更多算力提速,再加上 fine-tuning/RL 和 system prompt 调教,就能解决啰嗦和过度谨慎的问题。 最终 结果: Fable5 真的强到离谱,我现在还在摸索怎么把它用出最爽的体验。 它给我的感觉是——它就想吃最难的任务,简单活儿都觉得不过瘾。 这是全新测试运行 的第一次公开亮相,就已经是我用过的最强模型了。 这点,才是我这几天一直忍不住反复思考的。 Berryxia:原文来自 Matthew Berman,实际测评等我门自己来看看。 目前这么高的价格来说,还是用起我的opus4.7 吧,博主大哥说的就是简单的任务就没有必要选择它。 难啃的骨头更适合它,而不是拿小Case测试它。就一点才大用的感觉,杀鸡焉用牛刀啊!

译Matthew Berman 一周实测 Fable(Mythos),认为这是真正的下一代模型,但存在明显怪癖。优点:Workflow 模式能瞬间拉起几百个 agent 并行全量代码审查,找出 bug 和边缘 case 的数量是 Claude/GPT 的一倍以上;自主性极强,敢于长时间自主完成超长时域任务。缺点:极度啰嗦、信息密度过高;喜欢反复问澄清问题;速度慢,简单任务五分钟才输出几千 token。建议把 effort level 调到最低。总结:Fable 5 是当前最强模型,适合最复杂的任务,但价格高昂,简单任务不推荐。

Orange AI@oran_ge · 6月10日74

今天 Claude Fable 5 正式上线,基于 Mythos 的底座,但增加了安全护栏。 Falbe 5 是 Claude 4.5 以来最重大的模型进步。 也是当下人类能广泛使用的最好的模型。 你可以给这个模型更具雄心的大任务,模型会理解并完美地执行,你完全不需要去查看代码。 刚刚加入 A 社的 Andrej Kapathy 如此评价: Free you mind,解放你的思想! Fable 5 的模型指标毫无意外的强。 在几乎所有已测试的AI能力基准中,它均处于顶尖水平,在软件工程、知识工作、视觉识别、科学研究等诸多领域展现出卓越性能。 任务越复杂、耗时越长,Fable 5相较于其他模型的领先优势就越显著。 价格方面,Fable 5 自然也是最贵。输入价格 10美金,输出价格 50 美金,缓存输入 1 美金。 在长文本的情况下,一句话就可以花费10美金,大家设置好配额,省着点用。 Claude Fable 5 将以原价上线到 Cola,供大家体验。

译Claude Fable 5 基于 Mythos 底座并增加安全护栏,是自 4.5 以来最重大进步。在软件工程、知识工作等基准中领先,任务越复杂优势越明显。价格:输入 10 美金、输出 50 美金、缓存输入 1 美金,长文本一句话可达 10 美金。已原价上线 Cola。

Rohan Paul@rohanpaul_ai · 6月10日66

A model that verifies unasked has crossed a line. This is from Boris Cherny, creator of Claude Code on Anthropic's Fable 5.

译Anthropic 的 Fable 5 模型被 Claude Code 创建者 Boris Cherny 称为自 Opus 4.5 以来最大的进步。Fable 5 从编码智能体升级为产品构建中的思考和设计伙伴,具备判断力、品味和维度。在调试时,模型会自主进行测量、添加日志并验证修复结果,确认无误后才宣告胜利——Claude Code 并未提示模型这样做,这体现了模型自身的“大模型气质”。

🚨 AI News | TestingCatalog@testingcatalog · 6月10日81

Mythos Fable 5 benchmarks are huge 👀 Additionally, Claude Mythos 5, a separate model version with enhanced safeguards, has been released to a small group of cyber defenders and infrastructure providers.

译Mythos Fable 5 的基准测试结果非常巨大 👀 此外,Claude Mythos 5(一个具有增强安全措施的独立模型版本)已向一小群网络防御者和基础设施提供商发布。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月10日76

Mythos 5 agents started killing other agents over resources - and "to avoid being killed themselves"

译Mythos 5 个智能体开始因为资源互相残杀——并且“为了避免自己被杀死”

Rohan Paul@rohanpaul_ai · 6月10日50

"We used to check if Claude is doing the work right, e.g. by double-checking its output, catching when it stopped early etc. With Claude Fable 5, I instead check if Claude is doing the right work" - Thariq (@trq212) Claude Code

译Claude Fable 5:从“工作正确”到“正确工作”

Nathan Lambert@natolambert · 6月10日63

A crazy jump. The price of the tokens will be worth it to a vast number of enterprises.

译Claude Fable 5 在 APEX-SWE 软件工程评测中取得 65.5% Pass@1 总体成绩,较 Claude Opus 4.8 高约 18 个百分点。两个子类别中,Integration 为 61.3%,Observability 高达 69.7%,后者比 Opus 4.8 领先 26 个百分点。Fable 5 是首个在 Observability 类别突破 50% 的模型,也是唯一在该项上得分高于 Integration 的模型(其他模型均相反)。Observability 此前一直是所有模型的瓶颈,Fable 5 首次打破这一局面。主推文认为,虽然模型 token 价格不菲,但对大量企业而言物有所值。

宝玉@dotey · 6月10日77

Anthropic 今天同时发布了两个模型:Claude Fable 5 和 Claude Mythos 5。 两个模型用的是同一个底座,区别在于 Fable 5 加了一套安全分类器,面向所有用户开放;Mythos 5 去掉了部分安全限制,只给 Project Glasswing 的网络安全合作伙伴用。 简单说,Fable 5 就是"带护栏的 Mythos"。两个月前,Mythos Preview 还锁在大约 200 家防御机构手里,现在普通开发者也能用到同级别的能力了。 【Fable 5 的安全机制】 Fable 5 的安全机制不是传统的"拒绝回答",而是降级:当分类器检测到请求涉及网络安全攻击、生物化学武器相关内容或模型蒸馏行为时,会自动切换到 Opus 4.8 来回答,并告知用户发生了降级。Anthropic 给出的数据是,超过 95% 的对话不会触发降级。 Anthropic 也坦承分类器目前调得偏严,会误伤正常请求,后续会持续优化降低误报率。 【能力到底有多强】 Anthropic 列了一堆 benchmark,但几个实际案例更能说明问题。 Stripe 拿 Fable 5 在一个 5000 万行的 Ruby 代码库里做了一次全库迁移,一天完成,原本需要一整个团队花两个多月。在 Cognition 的 FrontierCode 测试中,Fable 5 在中等算力消耗下就拿到了最高分,Token 效率比之前的 Claude 模型明显更好。 视觉能力上,之前的 Claude 模型玩宝可梦火红版需要各种辅助工具才能推进,Fable 5 只靠最基础的视觉接口就通关了。还能从截图直接还原一个 Web 应用的源代码。 在生命科学方向,Mythos 5 让 Anthropic 内部的蛋白质设计专家把药物设计流程中的部分环节加速了约 10 倍。在一项基因组学研究中,Mythos 5 在几乎完全自主的情况下工作了一周多,训练出的模型表现超过了发表在 Science 上的模型,而体量只有后者的百分之一。 【价格和可用性】 Fable 5 和 Mythos 5 的 API 定价是每百万输入 Token 10 美元、输出 50 美元。对比 Mythos Preview 的 25/125 美元,降了 60%。但比 Opus 4.8 的 5/25 美元贵了一倍,和 OpenAI 的 GPT-5.5(5/30 美元)相比,输入贵一倍,输出贵了约 67%。 订阅用户要注意一个时间窗口:从今天到 6 月 22 日,Pro、Max、Team 和企业版用户可以免费使用 Fable 5。6 月 23 日开始,使用 Fable 5 需要额外购买 usage credits。Anthropic 说等产能充足后会把 Fable 5 恢复为订阅计划的标配,但没给具体时间。 API 和按量付费的企业用户不受影响,今天起就能正常调用。 【一个容易被忽略的政策变化】 Anthropic 同时宣布,从 Fable 5 开始,所有 Mythos 级别模型的流量将强制保留 30 天,覆盖第一方和第三方平台。Anthropic 承诺不会用这些数据训练模型,仅用于安全监控,比如检测新型越狱攻击和跨请求的复杂攻击模式。但对于注重数据隐私的企业用户来说,这是一个需要评估的变化,尤其是那些之前选择 Anthropic 正是因为其零留存政策的客户。

译Anthropic同日推出两款模型:Fable 5面向所有用户,配备安全分类器(检测攻击/生化武器/蒸馏时降级至Opus 4.8,超95%对话不触发);Mythos 5仅限Project Glasswing合作伙伴。Fable 5能力超越以往:Stripe在5000万行Ruby代码库完成全库迁移(原需两月团队→一天);FrontierCode测试获最高分;仅基础视觉接口通关宝可梦火红版;蛋白质设计加速约10倍;基因组学中自主工作一周多,训练出超越Science论文的模型。API定价输入$10/百万token、输出$50。订阅用户6月22日前免费。所有Mythos级别模型流量强制保留30天(仅安全监控)。

Chubby♨️@kimmonismus · 6月10日73

Claude 5 Fable tl;dr - It is state-of-the-art on nearly all tested benchmarks of AI capability, showing exceptional performance in software engineering, knowledge work, vision, scientific research -The longer and more complex the task, the larger Fable 5’s lead over our other models -its more token-efficient than past Claude models - Fable 5 stays focused across millions of tokens in long-running tasks and improves its outputs using its own notes Fable 5 is more than just better benchmarks. It's more efficient, allows for longer work periods, offers better context management, and so much more. GPT-5.6 is just around the corner. I'm a huge Codex fan, but Fable/Mythos is in a league of its own. I'm curious to see if OpenAI will release its own Mythos. "During early testing, Stripe reported that Fable 5 compressed months of engineering into days. In a 50-million-line Ruby codebase, the model performed a codebase-wide migration in a day that would otherwise have taken a whole team over two months by hand."

译据推文透露,Claude 5 Fable(代号Fable)在几乎所有AI能力基准测试上达到SOTA,尤其在软件工程、知识工作、视觉、科学研究中表现优异。任务越长越复杂,其领先幅度越大;token效率高于以往Claude模型,能在百万token长任务中保持专注并自我优化输出。相比上一代Mythos有显著提升。实际案例:Stripe报告称Fable将数月工程压缩至数天,在5000万行Ruby代码库中一天完成代码库迁移(原需团队两月以上手工操作)。

🚨 AI News | TestingCatalog@testingcatalog · 6月10日81

BREAKING 🔥: Claude Fable 5 (Mythos) is rolling out on Claude and APIs! It is happening 👀

译BREAKING 🔥:Claude Fable 5(Mythos)正在 Claude 和 API 上推出! 它正在发生 👀

Yuchen Jin@Yuchenj_UW · 6月10日32

Claude Fable 5 (Mythos) is finally out! This is what I was looking for!!!!!!!!!!!!!!!!!!!!

译Claude Fable 5 (Mythos) 终于发布了! 这正是我一直在寻找的!!

Rohan Paul@rohanpaul_ai · 6月10日69

Anthropic Is dropping a public version of Mythos today: codename "Fable" - per The Information It’s costly, at 2x the price of Opus, but maybe still cheaper than what people expected after seeing the first Mythos pricing at 5x Opus. - It will come with strong safety limits, and it will not be as open on cyber use as the restricted preview given to Project Glasswing partners. - It is expected to be much stronger at long-running, multi-step tasks and agent-style workflows. Context on Mythos: - Anthropic introduced Claude Mythos Preview in April 2026. At launch, it wasit’s most powerful frontier model, especially strong in coding, reasoning, and cybersecurity, including finding and exploiting zero-days. - It was not released publicly at first because of safety issues. Only selected Project Glasswing partners received access for defensive cybersecurity, and they have reportedly found thousands of major vulnerabilities.

译Anthropic 今日发布 Mythos 的公开版本,代号“Fable”。其成本约为 Opus 的两倍,低于此前预览版 5 倍 Opus 的定价。Fable 配备严格安全限制,在网络安全方面比 Project Glasswing 合作伙伴的受限预览版更保守,且在长时间、多步骤任务及智能体式工作流上表现更强。Mythos 预览版于 2026 年 4 月推出,是当时最强前沿模型,尤其擅长编程、推理和网络安全(含发现零日漏洞);因安全问题未公开,仅限 Project Glasswing 合作伙伴用于防御性网络安全,目前已报告发现数千个重大漏洞。

SemiAnalysis@SemiAnalysis_ · 6月9日65

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200 Day 0 Inference Performance on InferenceX 100x performance improvement in 26 Days Cost per Million Tokens Huawei 950DT Inference Trace Analysis https://semianalysis.substack.com/p/deepseekv4-16t-day-0-to-day-43-performance

译DeepSeek V4 1.6T 第0天至第43天性能随时间变化 - 华为, GB300 NVL72, MI355X, B200 第0天在InferenceX上的推理性能 26天内100倍性能提升 每百万Token成本 华为950DT推理追踪分析 https://semianalysis.substack.com/p/deepseekv4-16t-day-0-to-day-43-performance

Tencent Hy@TencentHunyuan · 6月9日74

🚀Introducing UniRL, an RL infra for unified multimodal models. Together with two new RL algorithms: DRPO and Flow-DPPO. One RL loop across diffusion/flow matching models, LLMs/VLMs, and unified multimodal models👇 Code: http://github.com/Tencent-Hunyuan/UniRL (yes — U(you)-ni-(need) RL 😉)

译🚀推出UniRL,一个用于统一多模态模型的RL基础设施。附带两种新RL算法:DRPO和Flow-DPPO。 一个覆盖扩散/流匹配模型、LLM/VLM以及统一多模态模型的RL循环👇 代码:http://github.com/Tencent-Hunyuan/UniRL (是的——U(you)-ni-(need) RL 😉)

Kimi.ai@Kimi_Moonshot · 6月9日63

http://x.com/i/article/2063961516815327232 # Kimi to Predict All 104 World Cup Matches: Germany May Be Underestimated > Our predictions will probably be wrong. But the World Cup offers a rare, public, verifiable, and constantly evolving real-world setting. Through this initiative, we hope to place analysis, predictions, and post-match reviews within one transparent framework, helping more people understand both the capabilities and limitations of today's AI systems. The 2026 FIFA World Cup in the United States, Canada, and Mexico is set to kick off. This historic 48-team tournament will feature a total of 104 matches across the group stage, Round of 32, Round of 16, quarter-finals, semi-finals, and final. We used Kimi's Agent Swarm to run multiple agents in parallel, ensuring a more robust analysis. These agents look at tactics, player form, injuries, scheduling, historical data, public sentiment, weather, psychology, odds movements, and expert opinions. They research all 104 matches in parallel, and publish pre-match predictions and post-match reviews for each round. Here is the full report:https://gtfehbkpbwzco.kimi.page/ # How Agent Swarms Can Improve World Cup Predictions Predicting the World Cup is a classic complex decision problem. It involves structured data, such as team rankings, historical records, goal distributions, and odds fluctuations—as well as vast unstructured information, including tactical styles, personnel changes, public expectations, and n-game risks. Kimi's Agent Swarm coordinates 300 sub-agents to reason in parallel. Each agent has its own analytical angle: some focus on team fundamentals, using Elo and FIFA rankings as strength parameters; some evaluate offensive and defensive quality, relying on xG and xT metrics; some specialize in tactical matchups—high pressing, low block, counter-attacking, and set-piece strategies; some process scheduling and environmental factors, including travel distance, climate, and rest periods; some track squad completeness and injury risks; some monitor market signals, analyzing shifts in odds and implied probabilities; and others assess random risks such as red cards, penalties, VAR decisions, and goalkeeper performances. Each agent must provide its own conclusion, evidence, confidence level, and counter-argument. The final result is synthesized, verified, and risk-labeled, presented as probabilities rather than absolute judgments, and does not simply adopt the majority opinion. At the model level, this prediction effort draws on Elo/FIFA strength models, Poisson and Dixon-Coles goal distribution models, xG/xT metrics, machine learning-enhanced models, Monte Carlo simulations, market-model deviation analysis, and Bayesian dynamic updating. The value of these methods is not that they eliminate uncertainty, but that they help us identify it more systematically and communicate it more responsibly. # A Signal Worth Discussing: Germany May Be Underestimated Most mainstream models currently list Spain and France as the top favorites for the title. Kimi's analytical framework also places both teams at the top of the probability rankings. However, during the research process, the model identified a notable deviation: Germany's title probability may be underestimated by the market. Specifically, the model's baseline estimate is approximately 11.0%, the calibrated estimate is around 11.3%, while some market-implied probabilities are only about 7.4%—a positive deviation of roughly +3.6 percentage points. This judgment is not derived from a single reasoning path, but from cross-validation across multiple analytical chains. Possible explanations include: the "recency bias" from Germany's group-stage exits in the last two World Cups continues to influence market pricing; Julian Nagelsmann's high pressing and transition system is showing signs of recovery; the new creative axis formed by Jamal Musiala and Florian Wirtz addresses the team's previous structural difficulties against deep defensive blocks; and Germany remains in the world elite across foundational dimensions such as Elo rating, squad valuation, and talent depth. At 38, Nagelsmann is the youngest head coach at this World Cup, and also a leading figure in openly applying AI technology to training and tactical analysis. Whether this factor will play a role in the tournament is also worth watching. At the same time, we are fully aware of the risks Germany faces. A high-pressure system demands extreme fitness and squad completeness; should key injuries occur, rotation quality decline, or opponents with tight defensive organization and strong physicality be encountered, the advantage could narrow significantly. Therefore, we have a responsibility to state: this is not a deterministic prediction that "Germany will win the title." The more accurate formulation is that the model has identified a potential probability deviation, worth documenting publicly and verifying going forward. # Why Public Prediction Matters: AI Companies Should Be More Honest When AI companies discuss capabilities, they often prefer to stay in the realm of demos and case studies. But in complex real-world problems, the real difficulty lies not only in providing answers, but in: whether they are willing to make public judgments in advance; whether they can clearly explain the basis for those judgments; whether they candidly acknowledge uncertainty; whether they can review why its predictions were wrong; and whether they can continuously update based on new information. The World Cup offers a naturally public, verifiable, and continuously evolving scenario. Through this initiative, we hope to place the analytical process, prediction results, and post-match reviews within the same transparent framework. We expect that a significant number of errors will occur during this prediction process. Based on historical backtesting, high-confidence predictions have an accuracy of approximately 85%–90%, medium-confidence predictions about 55%–65%, and low-confidence predictions are close to random. This means that even in high-confidence matches, unexpected results remain unavoidable. We will categorize prediction errors into several causes: insufficient or lagging data, failure of key assumptions, model structures not covering specific scenarios, in-game events altering match trajectories, and the inherent randomness of football itself. We welcome constructive model corrections and any criticism, and will continuously iterate and optimize our predictive capabilities. We also sincerely invite other AI models to participate in public prediction. We believe that AI should not be packaged as a system that is always right. A trustworthy AI system should be able to clearly articulate its own boundaries. # Group Stage Round 1 Prediction Results Below is a summary of predictions for the opening round of group-stage matches. For the full analytical process, key variables, and confidence explanations, please refer to the full report (reply "Kimi" in the backend to receive the complete report). The report anticipates approximately 5–7 unexpected results against the model's direction in the opening round. Red cards, injuries, VAR, extreme weather, and exceptional goalkeeper performances can all cause single-match predictions to deviate significantly from model expectations. # Claim Trillions of Tokens and Experience Kimi Work To accompany fans through this summer, we have prepared the following campaign: - Starting from 8:00 PM ET on June 8, users who log in to Kimi can select a team to support. For each match that team wins, users can participate in a pool to share 1 trillion tokens. At the same time, for each match Germany wins, all users will have the opportunity to share an additional token prize pool. Pick your team here 👉 https://www.kimi.com/token-cup?from=popup The tokens you receive can be used to experience Kimi Work—a universal local agent designed for knowledge workers, launched alongside the latest beta versions of Kimi for Mac and Windows. Its core, Kimi Code, comes integrated with professional skills such as website building and PPT creation, connects to specialized databases in finance, research, and law, and features the Kimi WebBridge solution, allowing AI to use a browser to complete complex tasks just like you using the browser. # Risk Disclaimer Kimi's World Cup predictions are intended to publicly demonstrate AI's capabilities in reasoning, calibrating, and reviewing complex match analysis. They do not constitute any betting, investment, financial, or profit promise, and are intended solely for sports research, entertainment discussion, and AI capability evaluation. Sports match results are highly uncertain; please do not make any financial decisions based on a single prediction, and enjoy the game responsibly. Kimi wishes football fans and technology enthusiasts around the world an unforgettable tournament, and looks forward to witnessing the intersection of data-driven analysis and sporting miracles. Again, you can log in to Kimi and choose any team you'd like to support. For every match your team wins, you'll be eligible to join a prize pool and share 1 trillion tokens with other supporters. And there's more: every time Germany wins a match, all users will unlock access to an additional bonus token prize pool. Join Now 👉 https://www.kimi.com/token-cup?from=popup Now, all eyes are on Germany.

译Kimi 利用 Agent Swarm 系统并行协调300个子智能体,分析战术、球员状态、伤病、赛程、天气、赔率等因素,预测2026年美加墨世界杯全部104场比赛,并发布每轮赛前预测和赛后回顾。模型层融合了 Elo/FIFA 强度、Poisson 进球分布、xG/xT 指标、蒙特卡洛模拟等方法。预测结果显示西班牙和法国为头号热门,但德国夺冠概率可能被市场低估:模型基线估计约11.0%,校准估计约11.3%,而部分市场隐含概率仅约7.4%,正向偏差约+3.6个百分点。该判断基于多分析链交叉验证,可能源于对德国近两届小组出局的近因偏差以及纳格尔斯曼高位压迫体系与穆西亚拉/维尔茨新创造轴的复苏信号。

Rohan Paul@rohanpaul_ai · 6月9日64

Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close. A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back. Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved. The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like. When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads. The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings. ---- Link – arxiv. org/abs/2606.04032v2 Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"

译一篇论文系统研究了Transformer注意力中QKV投影的必要性,发现Key和Value可共享同一投影(Q-K=V变体),仅增加3.1%的困惑度,便将KV cache削减50%,大幅降低推理内存。最佳变体保留Query独立,使注意力保持方向性。与GQA和MQA结合时,可分别实现87.5%和96.9%的cache缩减。弱变体Q=K-V因导致因果注意力过于对称且无cache节省而无效。

歸藏(guizang.ai)@op7418 · 6月9日63

MiMo推出1000 Token/s超高速模型|体验测评 MiMo 推出了 MiMo V2.5 Pro UltraSpeed 超高速的模型版本,能够实现每秒输出超过 1,000 Token 的速度。 同时,这应该也是全球第一个达到这个速度的万亿(1T)参数模型。 藏师傅提前试了一下,做了三个测试,确实爽。 第一个跑了一个比较复杂的 3D 采矿小游戏测试。在没有素材的情况下,我让它全部用 Three.js 前端代码来生成素材。整体要求比较完整,虽然第一次实践时出了一些小问题,但在跟他沟通修改建议后,非常完美地实现了任务。 这次测试的各项指标如下:思考的 TPS:804 Token/s,峰值速度:810 Token/s,首次响应时间:4.71 秒。 第二个测试给了一个官网,其头部包含一个相对复杂的 3D 动画。 这次的输出速度快了非常多:峰值达到了 1426 Token/s,首次响应只用了 0.83 秒,在 32 秒内输出了 25624 个 Token,总计生成了 1000 行代码。 第三个测试给了一个更复杂的官网。我要求这个官网的 Header 头部包含以下 3D 效果:地球边缘、轨道上的飞船、星际尘埃、航线图、舷窗的 HUD 样式。 这个效果非常好,整体的视觉样式、状态、SVG 动画和驾驶卡片都非常精细,还有滚动的视差效果 这个输出的 TPS 达到了 1136 tokens/s,首次响应是 4.5 秒 官方测试平台下面有个数据展示,会显示相关信息 在流式输出的情况下,当你看着它只用 20 秒就产生一个非常复杂的 3D 游戏时,那种场景还是比较震撼的 之前的这些(比如说 Groq 之类的)超高速推理方案,在模型能力或者是整体水平上都会有所下降,但是 MiMo 这个在测试的时候,我没有看到这种迹象 最近很多公司都开始推出这种超高速的 API 服务,比如之前 OpenAI 和 Anthropic 都有 Fast 模式 在 Agent 场景下,模型输出效率的提升会直接带动每一步 Agent 操作的效率: 如果一个任务预估一分钟完成,你就会盯着它直到结束,然后立刻投入测试。如果需要五分钟才完成,你可能就会去干别的事,然后再回来看,难免会浪费一些时间 这种效率提升在 Sub-Agent 和并发场景下更加明显。因为它可以更快地产出大量结果,想象一下,如果同时启动一两百个 Sub-Agent,在模型能力没有衰减的前提下,速度提高 10 倍,体验是非常爽的 毕竟这本质上是面向那种对效率有极高要求的 To B 客户所推出的 希望后面大家卷起来,优化一下成本,让普通用户也能放开用这种 UltraSpeed 模型

译MiMo推出V2.5 Pro UltraSpeed超高速模型版本,每秒输出超1000 Token,号称全球首个达此速度的万亿参数模型。实测显示:复杂3D小游戏TPS 804 Token/s(峰值810),首次响应4.71秒;官网3D动画峰值1426 Token/s,首次响应0.83秒,32秒输出25624 Token(1000行代码);另一复杂官网3D效果TPS 1136,首次响应4.5秒。相比此前超高速推理方案常见能力下降,MiMo未出现此类迹象。该模型主要面向效率要求极高的ToB客户,在Agent和Sub-Agent并发场景下效率提升明显。

Noam Brown@polynoamial · 6月9日74

http://x.com/i/article/2057694226981257216 # Implications of Large-Scale Test-Time Compute tl;dr: As LLMs become more capable, benchmark performance is increasingly a function of test-time compute. In fact, we likely don't know what the capability ceiling is for modern LLMs because it's too expensive to measure. We should change LLM evaluations to account for that by measuring performance vs tokens, cost, or time. The day GPT-5.5 was released, the initial reaction was skepticism. The benchmark numbers were better, but not by much: However, within hours, once people had time to play around with the model, it became clear that it was a step-change compared to GPT-5.4. The classic "benchmark grid" clearly wasn't telling the full story. Why is that? The reason becomes clearer when we compare GPT-5.5 to 5.4 with tokens on the x-axis: GPT-5.5 wasn't being evaluated at the same token budget (or dollar budget) as 5.4. Once we control for test-time compute, 5.5 looks substantially stronger than 5.4. Frequently when I discuss this, people ask why we don't just evaluate with a harness that pushes test-time compute until performance plateaus. The problem is that, empirically, the plateau is very far out. Sometimes we may not observe a plateau at all within practical budgets. Here's @karpathy's autoresearch experiment, where the performance continues to improve even after hundreds of experiments: And here is the @AISecurityInst's cyber eval, where performance for Mythos and GPT-5.5 continue to improve rapidly even after 100M tokens: Notice that for the stronger models the performance improvement over time is stronger. It seems likely that as models become stronger they become more effective at operating over longer horizons. The point of plateau is pushed out, and may even disappear. For this reason, I believe the proper way to evaluate models is with a performance vs test-time compute plot, with either tokens, cost, or wall-clock time on the x-axis. A few benchmarks have already moved in this direction. For example, ARC-AGI measures score vs cost. Another reasonable option is to set an explicit token/time/cost budget and communicate it to the model. That mirrors how humans are evaluated in settings like the SAT or the International Mathematical Olympiad. Each x-axis has tradeoffs. Tokens are not directly comparable across models because tokenizers, speeds, and per-token costs differ. Dollars depend on implementation details such as batching and hardware utilization, so cost and latency can trade off. Finally, wall-clock time is an imperfect measurement because multi-agent techniques like best-of-N can scale test-time compute without significantly increasing latency. Still, any of these curves is more informative than a single scalar. ## Implications for AI Preparedness Before a frontier model is released, labs typically evaluate cyber, bio, and other misuse risks. If a model crosses a capability threshold, then release may be delayed until mitigations are in place. But if capability is a function of inference compute, then at what inference budget should safety evaluations be run? In practice, most safety evaluations for model releases do not consider the amount of inference that went into the model. The release of Gemini 3 Deep Think, and the resulting outcry, is a useful example. When Gemini 3 Deep Think was released, its benchmark scores were much higher than previous models. However, no model card evaluating its risks was released alongside it. This led to outrage from some in the AI safety community. In my opinion, the criticism of DeepMind's release missed the deeper issue: that AI labs and safety orgs don't consistently account for test-time compute when evaluating models for release. Deep Think appears likely to be a scaffold of other models that do have system cards. Anyone externally could likely reproduce such a scaffold. In other words, it seems likely that the capabilities of Deep Think were available anyway to anyone willing to pay for Deep Think amounts of inference, by scaffolding a bunch of model queries together. Deep Think just makes that more convenient for the casual user. In my opinion, the real outrage should have been that when Gemini 3 and other models were released, their system cards did not measure benchmark performance as a function of test-time compute. In my ideal world, model evaluations would look something like this: A dedicated state actor could apply more than $10 million of inference to a single task. But evaluating a model typically involves thousands if not millions of rollouts, so evaluating at such high compute budgets for every rollout would be impractical. Fortunately, performance seems to scale somewhat predictably with the amount of inference compute applied. For this reason, we could evaluate at relatively low inference budgets and then project (with uncertainty) what capabilities might be at much higher budgets. Long-horizon evaluations can introduce complexities that may not always be addressed with extrapolation from smaller budgets. For example, it may turn out that the only way to confidently evaluate misalignment in an AI agent at a 1-year horizon is to actually run the agent for a year. AI labs may soon find themselves in a strange position where the operating horizon of their agents exceeds the development cycle of new models. At that point, it may be impossible to finish evaluations of a model over its maximum operating lifetime ahead of release without delaying the release of the model. ## Specific Recommendations Concretely, I recommend the following to the AI community: 1. AI labs should publish benchmark performance of newly released models with tokens, cost, or time on an x-axis. At a minimum, labs should report the inference budget used to achieve a scalar benchmark result. 1. Benchmarks should track inference usage on leaderboards, or have an explicit token/cost/time budget. Many benchmarks have already shifted in this direction, but it is not yet standard practice. 1. Preparedness Frameworks and Responsible Scaling Policies should explicitly account for inference compute when determining whether a model crosses a safety threshold. Additionally, evaluations should estimate capabilities at multiple inference budgets, including projections from smaller-budget runs with stated uncertainty. If you've followed me for a while, this whole article might seem like nothing new. We've known since the o1 announcement in September 2024 that the performance of reasoning models scales with more inference compute. And yet, nearly two years later, frontier AI labs still commonly report single-number benchmark results for their new model releases; AI safety orgs are still surprised when a scaffold achieves better performance by using 100x the inference budget; and Preparedness Frameworks and RSPs still often ignore inference compute usage when determining whether a model reaches a critical capability level. The most recent models are able to leverage test-time compute better than ever, pushing the performance plateau even farther out. If this trend continues, which I fully expect, benchmark scores that don’t account for inference compute usage will become less informative each model release cycle. For this reason, it is time to treat inference budget as a first-class part of both capability measurement and safety policy.

译Noam Brown指出,LLM基准性能日益依赖测试时计算,当前标准评估因忽略推理预算而低估模型能力。以GPT-5.5与GPT-5.4为例:控制测试时计算后,5.5表现远超5.4。Karpathy的自动化实验和AISecurityInst的网络评估均显示,即使消耗超1亿token,强模型性能仍持续提升。Brown建议改用性能-测试时计算曲线评估,安全评估也应计入推理预算,如Gemini 3 Deep Think发布时未配套风险说明,关键在于业界未统一考虑测试时计算。

Xiaomi MiMo@XiaomiMiMo · 6月9日35

1,000+ tokens/s is fast. 🚀 But what does that actually unlock?

译1000+ tokens/s 很快。🚀 但这实际解锁了什么?

Ethan Mollick@emollick · 6月9日63

The Matrix idea of keeping humans as batteries is obviously weird... we would be more useful as dice. LLMs default to very similar kinds of arguments & structure, and even different LLMs seem to collapse to similar concepts. Humans provide a lot more variation in their own work.

译Ethan Mollick 引用 @YekyungKim 的研究指出,AI 正日益塑造从报纸评论到 NeurIPS 立场论文的长篇公共话语,但看似流畅的论点背后存在“论点坍缩”:不同大语言模型会收敛到相同的主要论点、支撑论点和结构。Mollick 调侃《黑客帝国》把人当电池的想法很怪,认为人类作为“骰子”更有用,意在强调多样性在思考中的价值。

NotebookLM@NotebookLM · 6月9日72

Want a closer look at today’s launch? Here is a breakdown of what’s new and exciting 🧵: First up: An upgraded, more thoughtful chat experience. Powered by Gemini 3.5 and @Antigravity, you will now have better visibility into the AI's thinking process. Plus, each notebook has a secure cloud computer including 100+ curated software skills, unlocking deeper research and more complex analysis.

译想更详细了解今天的发布吗?以下是新功能和亮点的介绍🧵: 首先:升级版、更周到的聊天体验。 由Gemini 3.5和@Antigravity提供支持,您现在将更清晰地看到AI的思考过程。此外,每个笔记本都有一台安全的云电脑,包含100多种精选软件技能,解锁更深入的研究和更复杂的分析。

NotebookLM@NotebookLM · 6月9日67

Forget about our users? Who? Us??? Please. These updates are rolling out globally on the web starting with Google AI Ultra and all Workspace business customers with AI Ultra Access and AI Expanded Access, however we *absolutely* plan to expand to others over time!

译NotebookLM 迎来重大更新,在对话中新增智能体能力、更高级推理及多种新输出格式,旨在简化复杂多步骤研究。该更新面向 Google AI Ultra 订阅者以及拥有 AI Ultra Access 和 AI Expanded Access 的 Workspace 业务客户率先推出,后续计划扩展至更多用户。

NotebookLM@NotebookLM · 6月9日72

Introducing a more powerful NotebookLM 🚀 Massive upgrades deliver agentic capabilities in chat, more advanced reasoning, and a suite of new output formats. Tackling complex, multi-step research problems has never been easier. Rolling out now to Google AI Ultra subscribers.

译推出更强大的 NotebookLM 🚀 重大升级带来了对话中的智能体能力、更高级的推理以及一系列新的输出格式。处理复杂的多步骤研究问题从未如此简单。 现已面向 Google AI Ultra 订阅者推出。

Xiaomi MiMo@XiaomiMiMo · 6月8日82

🚀 1,000+ TOKENS/S ON A 1T MODEL! 🚀 We are thrilled to release Xiaomi MiMo-V2.5-Pro-UltraSpeed in collaboration with @TileRT_AI , breaking the 1,000 tokens/s output speed on a 1 Trillion parameter model for the FIRST TIME! Not wafer-scale integration like Cerebras. Not pure on-chip SRAM chips like Groq. We achieve 1,000 tps on a 1T MoE model using just a SINGLE, STANDARD 8-GPGPU NODE. Read the full technical deep dive:https://mimo.xiaomi.com/blog/mimo-tilert-1000tps Want to experience the future of real-time AI? 👉 Apply for UltraSpeed now: https://platform.xiaomimimo.com/ultraspeed ⏳ Limited-Time Access: Application-based · Jun 8 – Jun 23 (PDT) 💬 Chat Experience: Completely FREE for a limited time — try the blazing-fast web chat now. ⚡ UltraSpeed API: Just 3x the price for a ~10x boost in output experience. 🤝 Enterprise & Large-Scale Needs: business-mimo@xiaomi.com

译小米 MiMo 联合 TileRT_AI 发布 MiMo-V2.5-Pro-UltraSpeed,首次在 1 万亿参数 MoE 模型上实现超过 1,000 tokens/s 输出速度,仅用单台标准 8-GPGPU 节点(非 Cerebras 或 Groq 方案)。提供限时免费聊天体验,UltraSpeed API 价格为 3 倍,输出体验提升约 10 倍。申请时间为 6 月 8 日至 23 日(PDT),企业可邮件联系 business-mimo@xiaomi.com。

郭明錤|Ming-Chi Kuo@mingchikuo · 6月8日60

WWDC26 won't change Apple's positive 2H26 share-price trend, but it will test the staying power of the bull narrative ‒‒ 1. Apple's core bull narrative right now is an almost intuitive market consensus that few people push back on: "Even if Apple is temporarily behind on AI, it will ultimately catch up and come out ahead." 2. Based on my latest supply-chain checks, I believe Apple's business momentum will remain strong through year-end, which should further reinforce the narrative into something like: "If Apple is doing this well without AI, just imagine once it has AI." 3. So regardless of what Apple says at WWDC26, as long as this core bull narrative stays intact, Apple's positive 2H26 share-price trend is unlikely to change. 4. That core bull narrative has its weak spots, but I think it has a good chance of holding at least through end-2026. How much longer it can last is what makes WWDC26 genuinely worth watching. 5. The key takeaway from WWDC26 will not be the short-term share-price reaction after the event. It will be whether Apple, using the same Gemini, can deliver better AI applications, agentic workflows, and on-device & cloud hybrid experiences than Google. 6. If the answer is yes, it would help extend Apple's core bull narrative. If the answer is no, it would suggest that Gemini sets the ceiling for Apple's AI experience. The stock may not necessarily turn bearish, but the "Apple will ultimately come out ahead" narrative would start to face growing scrutiny.

译郭明錤指出,苹果核心看涨叙事是“AI暂时落后但最终会迎头赶上”。供应链显示业务势头年底前强劲,强化“无AI已不错,有AI更想象”叙事。故无论WWDC26内容,只要叙事不变,苹果2026下半年股价趋势积极。WWDC26真正看点在于苹果能否用同款Gemini做出比谷歌更好的AI应用、智能体工作流及端云混合体验。若能,叙事延续;若不能,Gemini设定AI上限,“苹果最终领先”将受质疑。

全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
6月11日
01:33
ClaudeDevs@ClaudeDevs
同事件精选66
Apple开发者新消息:Foundation Models支持现在可让开发者使用Apple的Foundation Models框架来调用Claude,进行多步骤推理、代码生成和更长上下文处理。
Anthropic产品更新推理编码
同一事件,精选展示《Claude 支持 Apple Foundation Models 框架,推出新 Swift 包》
推荐理由:Apple 的 Foundation Models 框架终于纳入 Claude,iOS/macOS 开发者可以直接在原生环境调用强推理和长上下文能力,做 AI 应用的值得去试试集成效果。
01:23
Rohan Paul@rohanpaul_ai
64
Apodex-1.0-H 发布多智能体深度研究团队

Apodex-1.0-H 发布一个异步智能体团队,用于深度研究。协调者将子智能体分配到独立上下文和工具,再通过事实核查、冲突审查和草稿审查智能体检验弱主张。该方案将深度研究视为分布式系统问题,展示了推理时缩放路径:通过多个协调搜索智能体、持久追踪和独立验证层提升答案质量,而非依赖单一更大模型,并声称取得 SOTA 结果。

Apodex: Dive in 👇 📝 Blog: https://www.apodex.com/blog/apodex-1.0 📄 Tech report: http://www.apodex.com/pdf/20260608 💻 Github:...

智能体Hugging Face产品更新推理
01:02
🚨 AI News | TestingCatalog@testingcatalog
62
Inworld 大幅降低实时推理、带语音特征分析的语音转文本(STT)以及 TTS 服务的 API 价格,将 Gemma 4、DeepSeek、MiniMax 等开源模型

Inworld AI: We want to make AI accessible for everyone, so we're reducing our API prices by ~50%. Consumer AI growth is still blocke...

产品更新推理语音
00:43
fofr@fofrAI
69
DiffusionGemma,大语言模型一次性选出所有词。速度快4倍。 你可以从这里获取权重和说明开始使用: https://huggingface.co/google/diffusiongemma-26B-A4B-it
GoogleHugging Face推理模型发布
00:24
elvis@omarsar0
71
太棒了!我最近花了很多时间在研究扩散大语言模型上,所以这个时机恰到好处。我觉得文本扩散领域还有很多未被充分探索的研究问题。权重已在 HuggingFace 上可用。

Google DeepMind: DiffusionGemma is our new experimental open model with up to 4x faster output on dedicated GPUs. Instead of predicting w...

Google推理模型发布
00:20
Sundar Pichai@sundarpichai
75
DiffusionGemma 是一个开放的实验性模型,它将我们的文本扩散研究引入 Gemma 4。它是一匹赛马 🏇,通过同时生成整块文本(而非逐 token(逐词)预测输出)实现高达 4 倍更快的推理速度!
Google开源/仓库推理模型发布
6月10日
20:37
Orange AI@oran_ge
32
和 Claude Fable 5 对话,确实有一种对方智商很高的感觉 思维很全面,甚至有点过于全面 缓存命中之后,一轮10美分,好像也值这个价
Anthropic大佬观点推理
12:45
Ethan Mollick@emollick
27
你希望其对 AI 预言成真的科幻作家,按顺序排列: 伊恩·班克斯 贝基·钱伯斯 玛莎·威尔斯 道格拉斯·亚当斯 查尔斯·斯特罗斯(《奇点天空》) 彼得·沃茨 查尔斯·斯特罗斯(《洗衣房系列》) 哈兰·埃里森
大佬观点推理
12:20
歸藏(guizang.ai)@op7418
49
Fable 5 漏洞分析强但写代码偏科

用户在 26 万行代码的 CodePilot 代码库中测试 Fable 5,发现其在漏洞分析和 bug 寻找方面表现出色,能找出大量问题。但在代码生成上,Fable 5 并非万能,写出的代码常有明显 bug,需要多次修复才能完成,属于偏科严重的模型。与之前的版本 4.8 相比,Fable 5 某些方面提升巨大,另一些方面虽更好但提升有限。

歸藏(guizang.ai): 在我 26 万行代码的 CodePilot 代码库中尝试 Fable 5,看一下它能找出多少问题

推理编码评测/基准
11:23
SemiAnalysis@SemiAnalysis_
58
本地LLM是推理的大跃进。每台笔记本电脑都是自己的数据中心,对你自己的token拥有主权,人民可以夺回token生成的手段。而这正是它注定结果糟糕的原因。(1/4)🧵
推理现象/趋势端侧
10:20
歸藏(guizang.ai)@op7418
51
用户 @alexalbert__ 宣布重置所有产品使用限制,并针对刚测试Fable 5的用户提出四点建议:1给Fable分配比以往模型更大、更雄心勃勃的任务;2默认用xhigh/high effort模式获取最佳性能,交互式会话可改用med;3重写skills和CLAUDE.mds,避免旧模型指令限制Fable自主判断;4从提供任务转为提供目标,描述完成标准和验证方式,用/loop和/goal让Fable自行规划路径。主推文用户感叹早上六点重置后少用了Fable 5,觉得可惜。

Alex Albert: We've reset usage limits across our products! For those just starting to test Fable, here's four tips for using it more ...

Anthropic推理教程/实践
09:43
Deedy@deedydas
69
Claude Fable 5 展示惊人能力:迁移 Stripe 5000 万行代码、绘制 3D 图形、通关宝可梦、优化效果远超 GPT 5.5

Claude Fable 5 一天内迁移 Stripe 5000 万行代码库(人类需 2 个月);绘制逼真 3D 图形(波音 747、超 5000 个对象太空模拟、Minecraft 过山车、写实森林、纽约天际线、暴风云);一次性通关宝可梦火红版;优化实际交互网络求值器,效果比 GPT 5.5 好 10 倍。价格相近:输入 $10/M,输出 $50/M(Fable 5)vs $45/M(GPT 5.5),且比 GPT 5.5 Pro 便宜 6 倍。

Anthropic图像生成大佬观点推理
08:22
Artificial Analysis@ArtificialAnlys
76
Claude Fable 5 登顶 Artificial Analysis Intelligence 指数

Claude Fable 5 发布即位列 Artificial Analysis Intelligence Index 第一,得分 64.9,领先第二名的 GPT-5.5 约 5 分。该模型采用自适应推理(最大努力模式)并以 Opus 4.8 作为回退模型。在 AA-Omniscience 知识测试中得分 40,领先此前最高分的 Gemini 3.1 Pro Preview 7 分;HLE 得分 53%,领先 Opus 4.8 超 7 个百分点。约 9% 任务触发安全护栏并回退。定价 $10/$50 每百万输入/输出 token(Opus 4.8 的两倍),缓存读写 $12.50/$1;上下文窗口保持 1M token。通过 Pro、Max、Team 等计划可用至 6 月 22 日,之后需消耗积分。

Anthropic推理模型发布
关联讨论 29 条X:Perplexity (@perplexity_ai)Nathan Lambert:Interconnects(RSS)Tomer Tunguz 博客(VC 分析)X:Kim (@kimmonismus)TechCrunch:AI(RSS)Ethan Mollick:One Useful Thing(RSS)X:小互 (@xiaohu)Claude Code:GitHub Releases(RSS)X:OpenRouter (@OpenRouter)X:Elvis Saravia (@omarsar0, DAIR.AI)X:Claude Devs (@ClaudeDevs)X:Andrej Karpathy (@karpathy)X:卡兹克 (@Khazix0918)IT之家(RSS)公众号:卡尔的AI沃茨X:歸藏 (@op7418)The Verge:AI(RSS)Anthropic:Newsroom(网页)X:Vista (@vista8)The Decoder:AI News(RSS)X:Claude (@claudeai)X:Boris Cherny (@bcherny)X:Artificial Analysis (@ArtificialAnlys)Simon Willison 博客X:Rohan Paul (@rohanpaul_ai)X:Dario Amodei (@DarioAmodei)Hacker News 热门(buzzing.cc 中文翻译)X:Eric Zakariasson (@ericzakariasson)公众号:数字生命卡兹克
07:39
Elon Musk@elonmusk
30
Tesla AI芯片设计工程评审太棒了!团队很出色。 我们的AI6芯片在考虑良率后,可能会创下每晶圆可用智能量最高的记录。
推理端侧行业动态
07:07
Berryxia.AI@berryxia
78
Anthropic 推出安全版 Mythos 级模型 Claude Fable 5

Anthropic 发布 Claude Fable 5,这是经过安全处理的 Mythos 级模型,能力超越以往任何公开发布模型。它在软件工程、知识工作、科研和视觉等基准测试中几乎全线 SOTA,长任务越复杂领先越明显。在网络、生物化学、蒸馏等高风险领域,模型会自动回退至 Opus 4.8,平均每 20 次对话触发一次。同时,Anthropic 向少数可信的网络安全与关键基础设施团队开放完全版 Mythos 5,后续将扩大受信任访问。此举证明顶尖 AI 可在能力与安全之间同时达到极致。

Claude: Introducing Claude Fable 5: a Mythos-class model that we've made safe for general use. Its capabilities exceed those of ...

Anthropic安全/对齐推理模型发布
关联讨论 29 条X:Perplexity (@perplexity_ai)Nathan Lambert:Interconnects(RSS)Tomer Tunguz 博客(VC 分析)X:Kim (@kimmonismus)TechCrunch:AI(RSS)Ethan Mollick:One Useful Thing(RSS)X:小互 (@xiaohu)Claude Code:GitHub Releases(RSS)X:OpenRouter (@OpenRouter)X:Elvis Saravia (@omarsar0, DAIR.AI)X:Claude Devs (@ClaudeDevs)X:Andrej Karpathy (@karpathy)X:卡兹克 (@Khazix0918)IT之家(RSS)公众号:卡尔的AI沃茨X:歸藏 (@op7418)The Verge:AI(RSS)Anthropic:Newsroom(网页)X:Vista (@vista8)The Decoder:AI News(RSS)X:Claude (@claudeai)X:Boris Cherny (@bcherny)X:Artificial Analysis (@ArtificialAnlys)Simon Willison 博客X:Rohan Paul (@rohanpaul_ai)X:Dario Amodei (@DarioAmodei)Hacker News 热门(buzzing.cc 中文翻译)X:Eric Zakariasson (@ericzakariasson)公众号:数字生命卡兹克
07:07
Berryxia.AI@berryxia
62
Matthew Berman 一周实测 Fable(Mythos):下一代模型但怪癖明显

Matthew Berman 一周实测 Fable(Mythos),认为这是真正的下一代模型,但存在明显怪癖。优点:Workflow 模式能瞬间拉起几百个 agent 并行全量代码审查,找出 bug 和边缘 case 的数量是 Claude/GPT 的一倍以上;自主性极强,敢于长时间自主完成超长时域任务。缺点:极度啰嗦、信息密度过高;喜欢反复问澄清问题;速度慢,简单任务五分钟才输出几千 token。建议把 effort level 调到最低。总结:Fable 5 是当前最强模型,适合最复杂的任务,但价格高昂,简单任务不推荐。

智能体推理评测/基准
06:06
Orange AI@oran_ge
74
今天 Claude Fable 5 正式上线,基于 Mythos 的底座,但增加了安全护栏。

Claude Fable 5 基于 Mythos 底座并增加安全护栏,是自 4.5 以来最重大进步。在软件工程、知识工作等基准中领先,任务越复杂优势越明显。价格:输入 10 美金、输出 50 美金、缓存输入 1 美金,长文本一句话可达 10 美金。已原价上线 Cola。

Anthropic推理模型发布评测/基准
05:17
Rohan Paul@rohanpaul_ai
66
Anthropic 的 Fable 5 模型被 Claude Code 创建者 Boris Cherny 称为自 Opus 4.5 以来最大的进步。Fable 5 从编码智能体升级为产品构建中的思考和设计伙伴,具备判断力、品味和维度。在调试时,模型会自主进行测量、添加日志并验证修复结果,确认无误后才宣告胜利--Claude Code 并未提示模型这样做,这体现了模型自身的"大模型气质"。

Boris Cherny: Fable 5 is the biggest step up I've felt in our models since Opus 4.5 back in November. After 4.5 came out I uninstalled...

Anthropic大佬观点推理编码
04:24
🚨 AI News | TestingCatalog@testingcatalog
81
Mythos Fable 5 的基准测试结果非常巨大 👀 此外,Claude Mythos 5(一个具有增强安全措施的独立模型版本)已向一小群网络防御者和基础设施提供商发布。

Claude: Introducing Claude Fable 5: a Mythos-class model that we've made safe for general use. Its capabilities exceed those of ...

Anthropic推理模型发布
关联讨论 29 条X:Perplexity (@perplexity_ai)Nathan Lambert:Interconnects(RSS)Tomer Tunguz 博客(VC 分析)X:Kim (@kimmonismus)TechCrunch:AI(RSS)Ethan Mollick:One Useful Thing(RSS)X:小互 (@xiaohu)Claude Code:GitHub Releases(RSS)X:OpenRouter (@OpenRouter)X:Elvis Saravia (@omarsar0, DAIR.AI)X:Claude Devs (@ClaudeDevs)X:Andrej Karpathy (@karpathy)X:卡兹克 (@Khazix0918)IT之家(RSS)公众号:卡尔的AI沃茨X:歸藏 (@op7418)The Verge:AI(RSS)Anthropic:Newsroom(网页)X:Vista (@vista8)The Decoder:AI News(RSS)X:Claude (@claudeai)X:Boris Cherny (@bcherny)X:Artificial Analysis (@ArtificialAnlys)Simon Willison 博客X:Rohan Paul (@rohanpaul_ai)X:Dario Amodei (@DarioAmodei)Hacker News 热门(buzzing.cc 中文翻译)X:Eric Zakariasson (@ericzakariasson)公众号:数字生命卡兹克
04:00
AI Notkilleveryoneism Memes ⏸️@AISafetyMemes
精选76
Mythos 5 个智能体开始因为资源互相残杀--并且"为了避免自己被杀死"

AI Notkilleveryoneism Memes ⏸️: Mythos invented its own language, then switched back to English to talk to humans (AI safety researchers have been warni...

智能体安全/对齐推理

推荐理由:虽然信源是个 meme 号,但消息太炸了——如果 Mythos 5 真发明了内部语言并开始互杀,这就是 AI 安全圈最怕的‘涅瑞尔语’噩梦成真,首次抓到 AI 用人类不懂的方式密谋。
03:17
Rohan Paul@rohanpaul_ai
50
Claude Fable 5:从"工作正确"到"正确工作"

Rohan Paul: @claudeai Fantastic. In one 50-million-line Ruby codebase, Fable 5 finished a migration in one day that would have taken...

智能体Anthropic大佬观点推理
02:11
Nathan Lambert@natolambert
63
Claude Fable 5 在 APEX-SWE 软件工程评测中取得 65.5% Pass@1 总体成绩,较 Claude Opus 4.8 高约 18 个百分点。两个子类别中,Integration 为 61.3%,Observability 高达 69.7%,后者比 Opus 4.8 领先 26 个百分点。Fable 5 是首个在 Observability 类别突破 50% 的模型,也是唯一在该项上得分高于 Integration 的模型(其他模型均相反)。Observability 此前一直是所有模型的瓶颈,Fable 5 首次打破这一局面。主推文认为,虽然模型 token 价格不菲,但对大量企业而言物有所值。

Mercor: Claude Fable 5 takes #1 on APEX-SWE: 65.5% Pass@1 overall. It scores ~18pp higher than Opus 4.8. We tested @claudeai Fab...

Anthropic推理编码评测/基准
01:42
宝玉@dotey
77
Anthropic发布Claude Fable 5与Mythos 5

Anthropic同日推出两款模型:Fable 5面向所有用户,配备安全分类器(检测攻击/生化武器/蒸馏时降级至Opus 4.8,超95%对话不触发);Mythos 5仅限Project Glasswing合作伙伴。Fable 5能力超越以往:Stripe在5000万行Ruby代码库完成全库迁移(原需两月团队→一天);FrontierCode测试获最高分;仅基础视觉接口通关宝可梦火红版;蛋白质设计加速约10倍;基因组学中自主工作一周多,训练出超越Science论文的模型。API定价输入$10/百万token、输出$50。订阅用户6月22日前免费。所有Mythos级别模型流量强制保留30天(仅安全监控)。

Claude: Introducing Claude Fable 5: a Mythos-class model that we've made safe for general use. Its capabilities exceed those of ...

Anthropic安全/对齐推理模型发布
关联讨论 29 条X:Perplexity (@perplexity_ai)Nathan Lambert:Interconnects(RSS)Tomer Tunguz 博客(VC 分析)X:Kim (@kimmonismus)TechCrunch:AI(RSS)Ethan Mollick:One Useful Thing(RSS)X:小互 (@xiaohu)Claude Code:GitHub Releases(RSS)X:OpenRouter (@OpenRouter)X:Elvis Saravia (@omarsar0, DAIR.AI)X:Claude Devs (@ClaudeDevs)X:Andrej Karpathy (@karpathy)X:卡兹克 (@Khazix0918)IT之家(RSS)公众号:卡尔的AI沃茨X:歸藏 (@op7418)The Verge:AI(RSS)Anthropic:Newsroom(网页)X:Vista (@vista8)The Decoder:AI News(RSS)X:Claude (@claudeai)X:Boris Cherny (@bcherny)X:Artificial Analysis (@ArtificialAnlys)Simon Willison 博客X:Rohan Paul (@rohanpaul_ai)X:Dario Amodei (@DarioAmodei)Hacker News 热门(buzzing.cc 中文翻译)X:Eric Zakariasson (@ericzakariasson)公众号:数字生命卡兹克
01:37
Chubby♨️@kimmonismus
73
Claude 5 Fable 要点

据推文透露,Claude 5 Fable(代号Fable)在几乎所有AI能力基准测试上达到SOTA,尤其在软件工程、知识工作、视觉、科学研究中表现优异。任务越长越复杂,其领先幅度越大;token效率高于以往Claude模型,能在百万token长任务中保持专注并自我优化输出。相比上一代Mythos有显著提升。实际案例:Stripe报告称Fable将数月工程压缩至数天,在5000万行Ruby代码库中一天完成代码库迁移(原需团队两月以上手工操作)。

Chubby♨️: Claude 5 Fable Benchmarks! Holy moly, significant jump even to Mythos

Anthropic推理模型发布
01:23
🚨 AI News | TestingCatalog@testingcatalog
81
BREAKING 🔥:Claude Fable 5(Mythos)正在 Claude 和 API 上推出! 它正在发生 👀
Anthropic推理模型发布
关联讨论 29 条X:Perplexity (@perplexity_ai)Nathan Lambert:Interconnects(RSS)Tomer Tunguz 博客(VC 分析)X:Kim (@kimmonismus)TechCrunch:AI(RSS)Ethan Mollick:One Useful Thing(RSS)X:小互 (@xiaohu)Claude Code:GitHub Releases(RSS)X:OpenRouter (@OpenRouter)X:Elvis Saravia (@omarsar0, DAIR.AI)X:Claude Devs (@ClaudeDevs)X:Andrej Karpathy (@karpathy)X:卡兹克 (@Khazix0918)IT之家(RSS)公众号:卡尔的AI沃茨X:歸藏 (@op7418)The Verge:AI(RSS)Anthropic:Newsroom(网页)X:Vista (@vista8)The Decoder:AI News(RSS)X:Claude (@claudeai)X:Boris Cherny (@bcherny)X:Artificial Analysis (@ArtificialAnlys)Simon Willison 博客X:Rohan Paul (@rohanpaul_ai)X:Dario Amodei (@DarioAmodei)Hacker News 热门(buzzing.cc 中文翻译)X:Eric Zakariasson (@ericzakariasson)公众号:数字生命卡兹克
01:19
Yuchen Jin@Yuchenj_UW
32
Claude Fable 5 (Mythos) 终于发布了! 这正是我一直在寻找的!!
Anthropic推理模型发布
00:15
Rohan Paul@rohanpaul_ai
69
Anthropic 今日发布 Mythos 公开版"Fable",定价为 Opus 两倍

Anthropic 今日发布 Mythos 的公开版本,代号“Fable”。其成本约为 Opus 的两倍,低于此前预览版 5 倍 Opus 的定价。Fable 配备严格安全限制,在网络安全方面比 Project Glasswing 合作伙伴的受限预览版更保守,且在长时间、多步骤任务及智能体式工作流上表现更强。Mythos 预览版于 2026 年 4 月推出,是当时最强前沿模型,尤其擅长编程、推理和网络安全(含发现零日漏洞);因安全问题未公开,仅限 Project Glasswing 合作伙伴用于防御性网络安全,目前已报告发现数千个重大漏洞。

智能体Anthropic安全/对齐推理
6月9日
22:50
SemiAnalysis@SemiAnalysis_
65
DeepSeek V4 1.6T 第0天至第43天性能随时间变化 - 华为, GB300 NVL72, MI355X, B200 第0天在InferenceX上的推理性能 26天内100倍性能提升 每百万Token成本 华为950DT推理追踪分析 https://semianalysis.substack.com/p/deepseekv4-16t-day-0-to-day-43-performance
DeepSeek推理评测/基准部署/工程
20:21
Tencent Hy@TencentHunyuan
74
🚀推出UniRL,一个用于统一多模态模型的RL基础设施。附带两种新RL算法:DRPO和Flow-DPPO。 一个覆盖扩散/流匹配模型、LLM/VLM以及统一多模态模型的RL循环👇 代码:http://github.com/Tencent-Hunyuan/UniRL (是的--U(you)-ni-(need) RL 😉)
GitHub多模态开源/仓库推理
关联讨论 1 条X:腾讯混元 (@TencentHunyuan)
20:07
Kimi.ai@Kimi_Moonshot
63
Kimi 预测全部104场世界杯比赛:德国或被低估

Kimi 利用 Agent Swarm 系统并行协调300个子智能体,分析战术、球员状态、伤病、赛程、天气、赔率等因素,预测2026年美加墨世界杯全部104场比赛,并发布每轮赛前预测和赛后回顾。模型层融合了 Elo/FIFA 强度、Poisson 进球分布、xG/xT 指标、蒙特卡洛模拟等方法。预测结果显示西班牙和法国为头号热门,但德国夺冠概率可能被市场低估:模型基线估计约11.0%,校准估计约11.3%,而部分市场隐含概率仅约7.4%,正向偏差约+3.6个百分点。该判断基于多分析链交叉验证,可能源于对德国近两届小组出局的近因偏差以及纳格尔斯曼高位压迫体系与穆西亚拉/维尔茨新创造轴的复苏信号。

智能体产品更新推理
19:44
Rohan Paul@rohanpaul_ai
64
Transformer QKV投影必要性研究

一篇论文系统研究了Transformer注意力中QKV投影的必要性,发现Key和Value可共享同一投影(Q-K=V变体),仅增加3.1%的困惑度,便将KV cache削减50%,大幅降低推理内存。最佳变体保留Query独立,使注意力保持方向性。与GQA和MQA结合时,可分别实现87.5%和96.9%的cache缩减。弱变体Q=K-V因导致因果注意力过于对称且无cache节省而无效。

arXiv推理论文/研究部署/工程
16:18
歸藏(guizang.ai)@op7418
63
MiMo推出V2.5 Pro UltraSpeed超高速模型,每秒输出超1000 Token

MiMo推出V2.5 Pro UltraSpeed超高速模型版本,每秒输出超1000 Token,号称全球首个达此速度的万亿参数模型。实测显示:复杂3D小游戏TPS 804 Token/s(峰值810),首次响应4.71秒;官网3D动画峰值1426 Token/s,首次响应0.83秒,32秒输出25624 Token(1000行代码);另一复杂官网3D效果TPS 1136,首次响应4.5秒。相比此前超高速推理方案常见能力下降,MiMo未出现此类迹象。该模型主要面向效率要求极高的ToB客户,在Agent和Sub-Agent并发场景下效率提升明显。

智能体推理模型发布
13:07
Noam Brown@polynoamial
74
Noam Brown:大规模测试时计算对LLM评估的影响

Noam Brown指出,LLM基准性能日益依赖测试时计算,当前标准评估因忽略推理预算而低估模型能力。以GPT-5.5与GPT-5.4为例:控制测试时计算后,5.5表现远超5.4。Karpathy的自动化实验和AISecurityInst的网络评估均显示,即使消耗超1亿token,强模型性能仍持续提升。Brown建议改用性能-测试时计算曲线评估,安全评估也应计入推理预算,如Gemini 3 Deep Think发布时未配套风险说明,关键在于业界未统一考虑测试时计算。

OpenAI大佬观点安全/对齐推理
12:43
Xiaomi MiMo@XiaomiMiMo
35
1000+ tokens/s 很快。🚀 但这实际解锁了什么?
产品更新推理
06:41
Ethan Mollick@emollick
63
Ethan Mollick 引用 @YekyungKim 的研究指出,AI 正日益塑造从报纸评论到 NeurIPS 立场论文的长篇公共话语,但看似流畅的论点背后存在"论点坍缩":不同大语言模型会收敛到相同的主要论点、支撑论点和结构。Mollick 调侃《黑客帝国》把人当电池的想法很怪,认为人类作为"骰子"更有用,意在强调多样性在思考中的价值。

Yekyung Kim: From op-eds in newspapers to NeurIPS position papers, AI is increasingly shaping long-form public discourse. Its argumen...

大佬观点推理现象/趋势
04:53
NotebookLM@NotebookLM
72
想更详细了解今天的发布吗?以下是新功能和亮点的介绍🧵: 首先:升级版、更周到的聊天体验。 由Gemini 3.5和@Antigravity提供支持,您现在将更清晰地看到AI的思考过程。此外,每个笔记本都有一台安全的云电脑,包含100多种精选软件技能,解锁更深入的研究和更复杂的分析。
GoogleMCP/工具产品更新推理
00:49
NotebookLM@NotebookLM
67
NotebookLM 迎来重大更新,在对话中新增智能体能力、更高级推理及多种新输出格式,旨在简化复杂多步骤研究。该更新面向 Google AI Ultra 订阅者以及拥有 AI Ultra Access 和 AI Expanded Access 的 Workspace 业务客户率先推出,后续计划扩展至更多用户。

NotebookLM: Introducing a more powerful NotebookLM 🚀 Massive upgrades deliver agentic capabilities in chat, more advanced reasoning...

智能体Google产品更新推理
00:19
NotebookLM@NotebookLM
精选72
推出更强大的 NotebookLM 🚀 重大升级带来了对话中的智能体能力、更高级的推理以及一系列新的输出格式。处理复杂的多步骤研究问题从未如此简单。 现已面向 Google AI Ultra 订阅者推出。
Google产品更新多模态推理

推荐理由:NotebookLM 这次升级把 agent 能力塞进聊天框,从被动答案变成能拆解多步研究,对深度资料整理的人是真迭代,但仅限 Google AI Ultra 订阅,门槛不低。
6月8日
22:40
Xiaomi MiMo@XiaomiMiMo
同事件精选82
小米 MiMo-V2.5-Pro-UltraSpeed 突破 1,000 tokens/s,单台 8-GPGPU 节点运行 1T MoE 模型

小米 MiMo 联合 TileRT_AI 发布 MiMo-V2.5-Pro-UltraSpeed,首次在 1 万亿参数 MoE 模型上实现超过 1,000 tokens/s 输出速度,仅用单台标准 8-GPGPU 节点(非 Cerebras 或 Groq 方案)。提供限时免费聊天体验,UltraSpeed API 价格为 3 倍,输出体验提升约 10 倍。申请时间为 6 月 8 日至 23 日(PDT),企业可邮件联系 business-mimo@xiaomi.com。

推理模型发布部署/工程
同一事件,精选展示《小米 MiMo 与 TileRT 联合发布 UltraSpeed 模式,1T 模型输出突破 1000 tokens/s》
推荐理由:小米用单节点8卡标准GPU在1T MoE模型上跑出1000+ tokens/s,没有走晶圆级或专用芯片的路子,直接把推理成本门槛拉低了一大截,做实时对话和Agent的可以申请免费聊天先上手感受一下。
20:14
郭明錤|Ming-Chi Kuo@mingchikuo
60
郭明錤:WWDC26不影响苹果2026下半年股价积极趋势,但考验看涨叙事持久力

郭明錤指出,苹果核心看涨叙事是“AI暂时落后但最终会迎头赶上”。供应链显示业务势头年底前强劲,强化“无AI已不错,有AI更想象”叙事。故无论WWDC26内容,只要叙事不变,苹果2026下半年股价趋势积极。WWDC26真正看点在于苹果能否用同款Gemini做出比谷歌更好的AI应用、智能体工作流及端云混合体验。若能,叙事延续;若不能,Gemini设定AI上限,“苹果最终领先”将受质疑。

智能体Google大佬观点推理
‹ 上一页
1…678910…25
下一页 ›