Nvidia just cleared the memory bottleneck significantly for Vera Rubin by qualifying HBM4 from Samsung, SK Hynix, and Micron & moving those parts into full production. per news, Vera Rubin HBM4 share: SK Hynix 60–70%, Samsung 25–30%, Micron rest.

译NVIDIA 正式认证三星、SK 海力士和美光的 HBM4 内存，并投入量产，以解决 Vera Rubin 超算的内存瓶颈。据消息，Vera Rubin 的 HBM4 份额分配为：SK 海力士 60–70%、三星 25–30%、美光占剩余部分。SK 海力士与 NVIDIA 已达成多年合作，将共同开发 Vera Rubin AI 超算等平台的内存，并利用 CUDA-X、PhysicsNeMo 等工具加速芯片设计与半导体仿真。双方强调，先进 DRAM 与 HBM 须提前数年协同设计。

Rohan Paul@rohanpaul_ai · 6月8日49

A Primer paper about how reasoning models improve after training Shows that better reasoning models depend less on raw data size and more on checkable training evidence. reasoning data is NOT simple question-and-answer pairs. The useful part is often the feedback that says why an answer, step, tool action, or full attempt was good or bad. A prompt and a response tell you what a model said, but not why that answer became learnable, which judge blessed it, which failures were hidden, or whether the skill was already inside the base model. The core idea is to describe each training example as a record that includes the task, the model’s behavior, the checking signal, and metadata about where it came from. The authors sort reasoning data by how it can be checked, such as exact rule-based checks for math and code, environment checks for agents using tools, and human or model judgments when no exact checker exists. They also explain why common assumptions fail, because long reasoning traces may be fake, harder examples may be useless for some models, and larger datasets may still miss important coverage. The key point is that agent data should preserve mess: failed actions, retries, recoveries, state differences, and terminal checks, because that is where learning signal often lives. ---- Link – arxiv. org/abs/2606.02113 Title: "A Primer in Post-Training Reasoning Data: What They Know About How It Works"

译论文指出，更好的推理模型更依赖可验证的训练证据，而非原始数据规模。推理数据的关键不是简单问答对，而是提供答案、步骤、工具操作或完整尝试好坏判断的反馈信号。每个训练样本应描述为包含任务、模型行为、检查信号和元数据的记录。研究者按检查方式分类：数学和代码用精确规则、智能体工具用环境检查，无精确检查器时用人类或模型判断。常见误区包括：长推理链可能虚假、更难样例对部分模型无效、更大数据集仍可能缺失关键覆盖。智能体数据应保留失败动作、重试、恢复、状态差异和终端检查等“混乱”信息，因为学习信号常在其中。

elvis@omarsar0 · 6月8日59

This was one of the standout AI papers of the week. (bookmark it) It tackles a question most self-improving AI agents ignore: is the agent actually discovering anything, or just remixing what it already knows? How can you tell whether the agent is doing real discovery or just confident retrieval? The authors give three clean buckets: - Retrieval is looking something up in a notebook you already have. - Search is combining tools you already own in new ways. - Discovery is inventing a new concept that wasn't in your toolkit before. The issue is that most agents stop at the first two. The math behind their definition (category theory plus a left Kan extension, if you care) is basically a bookkeeping trick to ask: could the old version of me have produced this result? If yes, it's not discovery. If no, something genuinely new showed up. They build a Builder/Breaker agent that studies protein mechanics. Over four rounds, the model's fit accuracy actually drops (R² goes from 0.48 to 0.68 to 0.54 to 0.41). At first glance, that looks like a failing agent. It isn't. The agent kept taking on harder proteins and rewriting its theory to cover them. Data grew almost 10x while the model code grew only 1.3x. A smaller theory covering a bigger world is exactly what good science looks like. Why does it matter? If you optimize for accuracy alone, your self-improving agent will just settle into easy benchmarks and stop. This paper offers a cleaner success signal and asks whether the agent is compressing more of the world into less code over time. Paper: https://arxiv.org/abs/2606.01444 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译本周一篇AI论文探讨自我改进智能体是否真正发现新知识，还是仅重新组合已知信息。作者将行为分为三类：检索（查询已有笔记本）、搜索（组合现有工具）和发现（发明新概念），并用范畴论和左Kan扩展定义——若旧版本能产生相同结果则非发现。他们构建Builder/Breaker agent研究蛋白质力学，四轮中R²从0.48升至0.68再降至0.54和0.41，看似变差实则不断挑战更难蛋白质并重写理论：数据增长近10倍，模型代码仅增长1.3倍。论文提出用代码压缩率作为真实发现信号。链接：arxiv.org/abs/2606.01444。

Rohan Paul@rohanpaul_ai · 6月7日45

"Pretty soon, competition math, competition coding, is not going to be interesting anymore. I'll be disappointed if we don't have a model out by next year that anybody can use to get a perfect score on the IMO (International Math Olympiad)."

译"很快，竞赛数学、竞赛编程将不再有趣。如果到明年我们还没有一个任何人都能用来在国际数学奥林匹克（IMO）中取得满分的模型，我会很失望。"

jason@jxnlco · 6月7日8

Having waited in line at the coffee shop at work I agree.

译在工作的咖啡店排队等过之后，我同意。引用 @ghosttyped：人们在 AGI 之后会做什么？当然是排队。

Rohan Paul@rohanpaul_ai · 6月7日62

Great idea for self-evolving AI scientists from this new MIT paper. Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder. The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims. The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced. Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself. So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema. A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language. ---- arxiv. org/abs/2606.01444 Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

译MIT论文（F.Y. Wang & M.J. Buehler, arXiv:2606.01444, 2026）提出Self-Revising Discovery Systems框架，使AI科学家能自主识别当前思维模式不足并添加新科学概念，而非仅更努力搜索。系统将数据、模型、工具输出、失败及声明均视为类型化产物（typed provenance），从而区分三种模式：retrieval（添加已知对象）、search（探索固定模式）和discovery（可验证的模式转换）。论文通过Kan obstruction和Left Kan extension数学化定义了真正新颖性——由旧证据传输后的逐点残差量化，使novelty可客观测量。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性，以及CategoryScienceClaw发现各向异性纤维网络刚度规则。

Rohan Paul@rohanpaul_ai · 6月7日66

New MIT paper, great idea for self-evolving AI scientists from Tries to make an AI scientist notice when its current way of thinking is too small, then add new scientific concepts instead of merely searching harder. The problem is that most AI science systems still search inside a fixed setup, even when real science sometimes needs new kinds of variables, tools, tests, or claims. The paper’s core idea is to make every data point, model, tool output, failure, and claim a typed artifact, where typed means the system records what kind of thing it is and how it was produced. Then the system can tell the difference between retrieval, which adds known things, search, which explores a fixed setup, and discovery, which changes the setup itself. So novelty AI scientists is not defined by surprise, fluency, or benchmark gain, but by what could not be expressed inside the previous schema. A serious attempt to formalize something most AI systems still fake: the difference between finding an answer inside a language and earning the right to change the language. ---- arxiv. org/abs/2606.01444 Title: "Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic AI"

译MIT团队提出自我演进AI科学家框架，核心创新是让AI识别当前推理空间过小并主动添加新科学概念，而非仅在固定模式内搜索。论文将数据点、模型、工具输出、失败、声明均视为带类型的artifact，明确区分检索（添加已知对象）、搜索（探索固定schema）和发现（可验证的模式扩展）。通过类型化copresheaf与Kan障碍理论证明，真正发现是可验证的schema扩展：旧证据由左Kan扩展传输，创新性通过逐点残差量化。案例包括Builder/Breaker模型发现蛋白质模式条件顺应性，以及CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444（2026）。

SemiAnalysis@SemiAnalysis_ · 6月6日61

Sequential Monte Carlo speculative decoding from @makora_ai keeps multiple draft tokens alive in parallel instead of rewinding failed matches.

译来自 @makora_ai 的序贯蒙特卡洛投机解码会并行保持多个草稿 token 存活，而不是回退失败的匹配。

SemiAnalysis@SemiAnalysis_ · 6月6日49

@makora_ai 's sequential Monte Carlo speculative decoding keeps multiple draft tokens alive in parallel instead of rewinding failed matches

译@makora_ai 的顺序蒙特卡洛推测解码将多个草案 token 并行保持存活，而不是回退失败的匹配。

Rohan Paul@rohanpaul_ai · 6月6日48

Today’s edition of my newsletter just went out. 🔗 https://www.rohan-paul.com/p/anthropic-just-disclosed-that-claude 🗞️ Anthropic says 80% of its new production code is now authored by Claude 🗞️ New Google paper shows general LLMs can solve formal math by planning proofs and checking each step. Raised general LLM performance from under 10% to 70% 🗞️ Google’s new open source Gemma 4 12B can analyze audio and video while running fully locally on a consumer 16GB GPU 🗞️ Alibaba’s Qwen3.7-Plus supports text, video, and image inputs at a low price of $0.4/$1.6 per 1M tokens, though it remains proprietary. 🗞️ Anthropic’s new chemistry report has a genuinely wild result.

译Anthropic 称其 80% 的新生产代码由 Claude 编写。Google 新论文显示，通用 LLM 通过规划证明与逐步验证，将形式数学求解性能从低于 10% 提升至 70%。Google 开源 Gemma 4 12B，可在消费级 16GB GPU 上本地运行，支持音频和视频分析。通义千问发布 Qwen3.7-Plus，支持文本、视频、图像输入，价格 $0.4/$1.6 每百万 token，闭源。Anthropic 新化学报告有惊人结果。

Chubby♨️@kimmonismus · 6月6日65

AI scientists may be moving from search to real discovery. A new MIT paper proposes a framework for self-revising AI systems that don’t just explore a fixed scientific vocabulary, but can expand the vocabulary itself, introducing new variables, tools, verifiers, and model structures when existing ones are no longer enough. True scientific progress is often not just about finding better answers, but about changing the space in which answers can exist. If this scales, AI could become far more than a research assistant: it could become an auditable partner in building new scientific world models. Still early, but conceptually very exciting.

译MIT Buehler团队提出Self-Revising Discovery Systems框架，让AI能自主扩展科学词汇（变量、工具、验证器、模型结构），而非仅搜索固定空间。论文使用typed copresheaf和Kan obstruction数学框架形式化智能体工作流，证明真正发现是可验证的schema扩展：旧证据通过Left Kan extension迁移，新异性由pointwise残差客观量化，区分发现与搜索。三种模态：检索（添加已知对象）、搜索（固定schema）、发现（验证的范式转换）。案例包括Builder/Breaker发现蛋白质模式条件合规性，CategoryScienceClaw发现各向异性纤维网络刚度规则。论文arXiv:2606.01444（2026）。

Rohan Paul@rohanpaul_ai · 6月6日79

Anthropic’s new chemistry report has a genuinely wild result. Claude Opus 4.7 is now competitive with dedicated NMR software, and the bigger story is that it can work the problem backwards, i.e. infer the molecule from the spectrum.” NMR software is the chemist’s expert tool for turning molecular structures into predicted lab spectra. So Opus 4.7 is no longer just “helping chemists read data” — it can work backward from NMR data and propose the molecule’s structure, a task the report says existing mainstream tools generally leave to human chemists. Note, that Opus 4.7, a general-purpose model with no chemistry-specific fine-tuning. Claude Opus 4.7 made the smallest hydrogen prediction errors and nearly matched MestReNova on carbon, meaning it can predict NMR signals about as well as specialist chemistry tools. So AI now handle one of chemistry’s hidden bottlenecks: translating between a molecule, its spectral shadow, and the structure a chemist actually needs to trust.

译Anthropic最新化学报告显示，通用大模型Claude Opus 4.7（无化学微调）在NMR核磁共振谱分析上匹配甚至超越专用软件MestReNova，氢预测误差最小，碳预测近乎一致。更关键的是，它能从NMR光谱反向推导分子结构——这一任务以往只能由人类化学家完成。这意味着AI现在可以处理化学中的关键瓶颈：在分子结构、谱图与最终确认之间自动翻译。

Rohan Paul@rohanpaul_ai · 6月5日93

Anthropic just called for a global way to slow frontier AI because its own models may be approaching recursive self-improvement, where a system helps build a stronger version of itself without direct human control. Future models will become so good at research, experiments, debugging, and training design that humans will stop being the main bottleneck. Once that loop starts, progress could shift from human-paced engineering to machine-assisted improvement, which makes every safety test, law, and lab policy feel late by default. Anthropic says this has not happened yet, but warns that the jump may arrive before governments, companies, and researchers have a trusted way to measure or restrain it. The hard part is verification, because a huge AI training run is easier to hide than a weapons site, and any lab that secretly keeps training while others pause could gain the lead. Anthropic is now ~$1T, may reach $50B annualized revenue, and competes fiercely with OpenAI, so every safety claim also lands inside a giant business fight. --- anthropic .com/institute/recursive-self-improvement

译Anthropic公开呼吁全球采取行动减缓前沿AI发展，因其Claude模型可能接近递归自我改进（系统无需人类控制即帮助构建更强版本）。目前尚未发生，但跳跃可能突然到来，且AI训练运行比武器库更难隐藏。Claude现已编写超80%合并生产代码，工程师产出达2024年基线8倍；可靠任务长度每4个月翻倍，Mythos Preview可连续工作超16小时；训练代码加速从3x跃至52x（人类仅4x）。剩余人类优势仅剩研究判断力。Anthropic估值约1万亿美元，年化收入或达500亿美元，与OpenAI激烈竞争。

Chubby♨️@kimmonismus · 6月5日47

I've read the comment several times now that this is IPO talk. And it's a fair comment. Yes, both OpenAI and Anthropic are currently talking about RSI. And yes, both are planning an IPO in 2026. A model like Mythos and an article about RSI appear at just the right time, which naturally makes it seem odd. But if you read through the noise and look at the evidence, you can see it. And at least the data that Anthropic provides suggests the validity of their thesis, at least based on what has been presented. At the same time, Dario Amodei started talking about RSI as early as 2024, saying he didn't consider it far-fetched, long before the IPO, and discussed it in his article "Machines of Loving Grace." Something similar happened with OpenAI. In short: it's not just empty talk, but has a valid basis, although real-world use cases will probably soon be demonstrated using this myth-like model, thus providing a more solid foundation for the debate. But I consider their statements to be more than just IPO rhetoric.

译Kim回应外界对Anthropic与OpenAI近期RSI言论仅为2026年IPO炒作的质疑。引用Anthropic数据：即使模型能力冻结，智能体扩散也将使100人公司完成1000人工作；实际发展已超过内部指数假设。模型自主任务时长加速翻倍——2024年3月Claude Opus 3完成4分钟任务，一年后Sonnet 3.7达1.5小时，再一年后Opus 4.6达12小时，翻倍周期从7个月缩短至4个月。若趋势持续，今年内可处理数天级任务。OpenAI同样认可该方向。

SiliconFlow@SiliconFlowAI · 6月5日64

DeepSeek at #1 on @OpenRouter token share — 4 weeks running And we're proud to be powering a big slice of it You can find the complete @deepseek_ai lineup on @SiliconFlow: → V4 Pro & Flash ( best price/performance 🔥) → V3.2 · V3.2 Exp · V3.1 · V3.1 Terminus · V3 0324 · R1 0528

译DeepSeek 在 @OpenRouter 的 token 份额位列第一——已连续四周我们很自豪为其提供了很大一部分支持你可以在 @SiliconFlow 上找到完整的 @deepseek_ai 模型阵容： → V4 Pro & Flash（最佳性价比 🔥） → V3.2 · V3.2 Exp · V3.1 · V3.1 Terminus · V3 0324 · R1 0528

Tencent Hy@TencentHunyuan · 6月5日74

Planning is where LLMs move from “saying” to “doing.” Tencent Hy, in collaboration with the Gaoling School of Artificial Intelligence at Renmin University of China, is excited to open-source PlanningBench - a scalable, verifiable framework for evaluating and training LLM planning capabilities. With PlanningBench, you get: ✅ 30+ real-world planning tasks ✅ Automated verification ✅ Evaluation and training support See how top-tier LLMs perform on PlanningBench 👇 Resources: arXiv: https://arxiv.org/abs/2605.20873 GitHub: https://github.com/Tencent-Hunyuan/PlanningBench HuggingFace: https://huggingface.co/datasets/tencent/PlanningBench #PlanningBench #TencentHunyuan #OpenSource 📷

译腾讯混元（Tencent Hunyuan）与中国人民大学高瓴人工智能学院合作，开源PlanningBench——一个可扩展、可验证的LLM规划能力评估与训练框架。该框架包含30+真实世界规划任务，支持自动验证和训练。PlanningBench旨在推动LLM从“说”到“做”的规划能力发展。资源已发布于arXiv、GitHub及HuggingFace。

Yuchen Jin@Yuchenj_UW · 6月5日51

Think of yourself as an LLM. Every social interaction, every meeting, burns your tokens. Unless someone is a paid subscriber to your attention, you are under no obligation to answer low-quality prompts.

译把自己当作一个大语言模型。每个社交互动、每个会议都在消耗你的 token。除非有人付费订阅你的注意力，否则你没有义务回答低质量的提示词。

Rohan Paul@rohanpaul_ai · 6月5日60

Harness-1 makes search agents better by moving memory work out of the model and into a helper system. Shows that intelligence performs better when the environment stops forcing it to spend cognition on bookkeeping. That search agents should stop using the LLM as the notebook and let a separate harness track the search state. The paper proved that a 20B model improved search by doing less inside its own head. The problem is that normal search agents must both think about the next search and remember every document, clue, failed path, and remaining check inside the same limited context. This formulation puts too much routine state management inside the policy. Harness-1 separates those jobs. The model keeps the hard semantic choices: what to search, what to inspect, what to verify, and when the evidence is good enough. The harness keeps the recoverable state: candidate pools, curated documents, importance tags, evidence links, verification records, deduplicated observations, and budget-aware memory rendering. That sounds minor until you look at reinforcement learning. RL works poorly when every failure looks the same, because an empty or wrong final set does not reveal whether the agent searched badly, forgot evidence, skipped verification, or curated carelessly. By externalizing state, Harness-1 gives the policy a cleaner learning problem: improve decisions over a visible search workspace. For Harness-1, its gains were larger on held-out benchmarks than on source-family tasks, suggesting the model learned reusable search moves rather than memorized domain habits. ---- Link – arxiv. org/abs/2606.02373 Title: "Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses"

译Harness-1 将大语言模型的记忆工作转移到外部辅助系统（harness），解决传统搜索智能体需在同一上下文窗口内处理语义决策与状态记录导致的效率低下问题。模型仅负责搜索、验证等关键语义选择，而可恢复状态（候选池、证据链接、去重记录、预算感知记忆等）由 harness 追踪。这一分离使一个 20B 参数模型实现了更好的搜索表现。在强化学习中，外部化状态避免了失败原因混淆，有助于策略学习。Harness-1 在未见 benchmark 上提升更大，表明模型学到了可复用的搜索策略而非记忆领域习惯。论文 arXiv:2606.02373。

Rohan Paul@rohanpaul_ai · 6月5日70

Another great paper from Google. Shows general LLMs can solve formal math by planning proofs and checking each step. Raised general LLM performance from under 10% to 70%. A general LLM failed badly when asked to write full formal proofs in 1 try, but became much stronger when it planned, split the work into smaller claims, reused past claims, and learned from Lean’s feedback. The paper shows the weakness was not just the model’s math ability, but the way it was being used - the absence of structured interaction with a verifier. The key idea is that the model does not try to write one giant perfect proof at once, because that usually fails on long and tricky problems. Instead, LEAP stores the proof as a graph of goals and subgoals, so useful lemmas can be reused instead of rediscovered every time. The authors tested LEAP on Putnam 2025 and a new Lean benchmark built from 60 IMO-style problems, where ordinary one-shot proof writing did very poorly. LEAP solved all 12 Putnam 2025 problems and raised general LLM performance on the Lean IMO benchmark from under 10% to 70%. ---- Link – arxiv. org/abs/2606.03303 Title: "LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks"

译Google 新论文 LEAP 提出智能体框架，通过规划证明、分解子目标、复用已有引理并利用 Lean 验证器反馈，将通用 LLM 在形式化数学证明上的性能从不到 10% 提升至 70%。传统单次完整证明在长难题上表现极差，而 LEAP 将证明存储为有向图结构，先规划再逐步验证。在 Putnam 2025 竞赛中，LEAP 成功解出全部 12 道题；在包含 60 道 IMO 风格题目的 Lean 基准测试中，也实现了上述性能跃升。

Rohan Paul@rohanpaul_ai · 6月5日70

Sam Altman admits AI budgets are turning into a “huge issue,” with customers burning more tokens than even OpenAI’s top in-house users. Altman said OpenAI’s top internal user spends about 100B tokens/month, while one outside customer hit 603B tokens/month. The cost problem gets worse with AI agents because they do not just answer once, they plan, call tools, read files, retry failed steps, check their own work, and create long chains of hidden token spending. Every plan, retry, code review, context window, tool call, and verification step becomes metered cognition. A human asks once; an agent may ask hundreds of times in a second. Companies are no longer asking whether AI is impressive, but whether the marginal token is producing marginal value. Jevons paradox explains part of the trap: when AI gets cheaper per token, people use far more tokens, so the total bill can still rise.

译Sam Altman 表示 AI 预算正成“巨大问题”。OpenAI 顶级内部用户月耗约 100B 模型 token，而外部客户高达 603B。AI 智能体使成本恶化：agent 不止回答一次，而是规划、调用工具、读取文件、重试失败步骤、检查自身工作，产生大量隐藏 token 消耗。人类问一次，agent 可能一秒内问数百次。公司不再问 AI 是否令人印象深刻，而是问边际 token 是否产生边际价值。杰文斯悖论解释部分陷阱：每 token 成本下降，人们使用更多 token，总账单仍可能上升。

🚨 AI News | TestingCatalog@testingcatalog · 6月5日72

NVIDIA 🔥: Nemotron 3 Ultra has been released on Huggingface with 5x faster inference and 30% lower costs in comparison to other open models. > Nemotron-3-Ultra-550B-A55B-NVFP4 is a frontier-scale large language model (LLM) trained by NVIDIA, designed to deliver strong agentic, reasoning, and conversational capabilities.

译NVIDIA 在 Huggingface 上发布 Nemotron 3 Ultra（Nemotron-3-Ultra-550B-A55B-NVFP4），一个 550B 参数的 MoE 前沿智能开源大语言模型，专为长时间运行的 AI 智能体设计。相比其他开源前沿模型，推理速度提升 5 倍，复杂智能体任务成本降低 30%。模型具备强大的智能体、推理和对话能力。

Artificial Analysis@ArtificialAnlys · 6月5日65

Nemotron 3 Ultra was launched today, including a focus on low latency agentic performance. We tested it against peers under restricted turn-usage limits on Terminal-Bench v2.1 - @NVIDIA Nemotron 3 Ultra completes tasks at a much faster pace than peers due to its high inference speed while scoring competitively on the benchmark. In this analysis each model is given a ‘turn limit’ within which it can complete tasks, inside a customized version of the Terminus 2 harness which advises it of this limit. We apply 4 increasing turn limits and trace each result’s tradeoff of task latency and performance. Time per task, on the X axis, is calculated as decode time based on token usage and measured endpoint output speeds (for Nemotron 3 Ultra, speeds were measured on a pre-release deployment on @blackboxai), plus the actual time spent executing tools to complete the benchmark. Nemotron 3 Ultra is the fastest across all turn limits and sits on the Pareto frontier for performance versus time per task for this evaluation.

译NVIDIA 今日发布 Nemotron 3 Ultra，重点优化低延迟智能体性能。在 Terminal-Bench v2.1 上，该模型与竞品在 4 个递增轮次限制下对比测试。Nemotron 3 Ultra 凭借高推理速度（基于 token 用量与 blackboxai 预部署测得的端点输出速度，以及工具执行实际耗时），在每个轮次限制下完成任务的速度均快于竞品，同时保持了有竞争力的基准分数，处于该评测性能-时间帕累托前沿的领先位置。

NotebookLM@NotebookLM · 6月5日60

PRO TIP: Gamify your notebooks Don't just read your notes— investigate them. Our new Sherlock Holmes notebook turns studying into an interactive mystery game. Deduce facts, uncover clues, & prove that even the most complex matters can be elementary. ➡️ https://goo.gle/Sherlock

译专业技巧：将笔记本游戏化不要只是阅读笔记——去调查它们。我们全新的福尔摩斯笔记本将学习变成一款互动侦探游戏。推理事实，发现线索，证明即使是最复杂的问题也能迎刃而解。 ➡️ https://goo.gle/Sherlock

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月5日73

HOLY SHIT LET'S FUCKING GOO

译HOLY SHIT LET'S FUCKING GOO 我们内部数据显示，Claude 正在加速 AI 发展——这可能通往递归自我改进，即 AI 自主构建更强大的后继者。这发生得比我们想象的更快，其影响值得更多关注。

Yuchen Jin@Yuchenj_UW · 6月5日60

Recursive self-improvement post by Anthropic: “Each time we release a model, we give it code that trains a small AI model, ask the new model to speed it up. In May 2024, Claude Opus 4 averaged a ~3x speedup. This April, Mythos Preview achieved ~52x.” RSI is happening, and I can't wait to see Mythos.

译Anthropic 发布的递归自我改进帖子： “每次我们发布一个模型，都会给它代码，让它训练一个小型 AI 模型，然后让新模型加速训练。 2024 年 5 月，Claude Opus 4 平均实现约 3 倍加速。今年 4 月，Mythos Preview 达到约 52 倍。” RSI 正在发生，我等不及要看到 Mythos 了。

Chubby♨️@kimmonismus · 6月4日81

1/ NVIDIA shipped Nemotron 3 Ultra today, a fully open 550B model with 55B active params, with the weights, training data, and complete recipe all released openly. That alone is rare at this scale. The headline however actually is speed. Ultra is a hybrid Mamba-Attention MoE, an architecture built for fast decoding and a light memory footprint over long contexts, and NVIDIA clocks it at roughly 6x (!) the throughput of comparable open models on long-output agent workloads while holding the same accuracy. That's a serious engineering result, and it's aimed exactly where the industry is heading: autonomous agents that run long, multi-turn tasks where throughput per GPU is what actually costs money. It was pre-trained in 4-bit (NVFP4) across 20T tokens, the largest stable run of its kind shown to date. And the post-training introduces MOPD, where ten-plus specialist teacher models distill their skills into the student on its own rollouts, sometimes pushing it past the teachers themselves. The interesting aspect:This is a frontier-class model you can fully reproduce.

译NVIDIA 正式发布 Nemotron 3 Ultra，550B 总参数（55B 活跃）的完全开源 MoE 模型，权重、训练数据和完整配方全部公开。采用混合 Mamba-Attention 架构，专为长上下文快速解码和轻内存占用设计。在长输出智能体工作负载上，吞吐量约为可比开源模型的 6 倍（推理速度提升 5 倍），复杂智能体任务成本降低最多 30%。该模型在 4-bit（NVFP4）精度下预训练 20T tokens，后训练使用 MOPD 技术，由十余个专家教师模型蒸馏技能至学生模型。这是首个达到前沿水平且可完全复现的开源模型。

SiliconFlow@SiliconFlowAI · 6月4日72

Post-training is having a moment — Nex-N2-Pro from neolab @NexEcosystem proves it. Built on Qwen3.5-397B-A17B, delivers GPT-5.5 and Claude Opus 4.7–level performance. 🎉 T+0 Support on SiliconFlow · Free for First 2 Weeks N2-Pro: 397B MoE / Reasoning Model / 262K context / VLM → Auto-adjusts reasoning depth, 30–50% fewer thinking tokens, no performance trade-off → SOTA performance on Terminal Bench 2.1, GDPVal, SWE-Verified → Excels at agentic coding, deep search, tool use → Plug-and-play with Claude Code, Cursor, OpenClaw, etc. Try it on SiliconFlow ⬇️

译neolab 推出 Nex-N2-Pro，基于 Qwen3.5-397B-A17B，总参数 397B 的 MoE 推理模型，支持 262K 上下文与多模态（VLM），性能达到 GPT-5.5 和 Claude Opus 4.7 级别。模型可自动调节推理深度，减少 30-50% 思考 token 且无性能折损，在 Terminal Bench 2.1、GDPVal、SWE-Verified 上取得 SOTA。擅长智能体编码、深度搜索和工具使用，兼容 Claude Code、Cursor 等工具。硅基流动已提供 T+0 支持，前两周免费使用。

Artificial Analysis@ArtificialAnlys · 6月4日74

NVIDIA has just released Nemotron 3 Ultra, the new most intelligent US open weights model, with leading speed for its intelligence Nemotron 3 Ultra scores 47.7 on the Artificial Analysis Intelligence Index, well ahead of the next strongest US open weights models, Gemma 4 31B (39.2), Nemotron 3 Super (36.0) and gpt-oss-120b (33.3), but behind the Chinese-led open weights frontier (Kimi K2.6 at 53.9). We partnered with @NVIDIA to evaluate this model for intelligence and speed ahead of its public release. These figures use the final NVFP4 weights that NVIDIA recommends for inference, but our tests show minimal intelligence impact compared to BF16 testing, with higher precision resulting in an Artificial Analysis Intelligence Index score of 48.2 vs. the NVFP4 score of 47.7. Key Takeaways: ➤ Nemotron 3 Ultra leads in speed for its intelligence: through BlackBox AI ahead of release, Nemotron 3 Ultra is served at over 400 output tokens per second - this is slightly faster than the typical serving speed of gpt-oss-120b despite being >4X larger, and comes with significantly greater intelligence ➤ Largest Nemotron 3 model so far: with approximately 550 billion total parameters and 55 billion active, Nemotron 3 Ultra is significantly larger than its siblings and is the largest and most intelligent US open weights model release ever ➤ Nemotron 3 Ultra is the leading US open weights model on the Artificial Analysis Intelligence and Agentic Indexes by far, but Gemma 4 31B scores ~1 point higher on the Coding Index (comprised of Terminal-Bench Hard and SciCode)

译NVIDIA 发布 Nemotron 3 Ultra，为目前最智能的美国开源权重模型。在 Artificial Analysis Intelligence Index 得分 47.7，领先 Gemma 4 31B（39.2）、Nemotron 3 Super（36.0）和 gpt-oss-120b（33.3），但低于中国开源模型 Kimi K2.6（53.9）。模型总参数约 550B，激活 55B，推理速度超 400 tokens/s，较 gpt-oss-120b 略快且智能显著更高。NVFP4 精度得分 47.7，BF16 得分 48.2，精度差异极小。

StepFun@StepFun_ai · 6月4日77

Great to see Step 3.7 Flash live on @FireworksAI_HQ. Designed for inference from day one, Step 3.7 Flash combines a hardware-friendly architecture with MTP-assisted decoding to reach up to 400 tokens/s. Fast, multimodal, and ready to power capable agents in real-world workflows.

译阶跃星辰的 Step 3.7 Flash 已上架 Fireworks AI。该模型为 198B 稀疏 MoE 多模态大模型（VLM），含 196B 语言骨干和 1.8B 视觉编码器，从设计之初优化推理效率，采用硬件友好架构与 MTP 辅助解码，速度达 400 tokens/s。具备原生多模态理解与行动、可靠工具使用、增强搜索能力，面向真实智能体工作负载，采用 Apache 2.0 开源许可。

X.PIN@thexpin · 6月4日59

Anthropic isn't the only one making money. ByteDance is too. Volcengine's 2026 MaaS revenue was raised to ~$2.2 billion in April, up from ~$1.5 billion at end-2025. Insiders say Seedance 2.0 alone brings in ~$150 million per month, and its API isn't even fully live overseas yet.

译Anthropic不是唯一赚钱的。字节跳动也是。火山引擎2026年MaaS收入预期在4月上调至约22亿美元，而2025年底约为15亿美元。知情人士表示，仅Seedance 2.0每月就能带来约1.5亿美元收入，而其API甚至尚未在海外全面上线。

StepFun@StepFun_ai · 6月4日73

Thanks @ArtificialAnlys for the detailed independent evaluation. Step 3.7 Flash is built with a clear focus on the intelligence-speed frontier: MTP-assisted decoding, 400+ output tokens/s, stronger agentic performance, native multimodal capabilities, and Apache 2.0 open weights. This is the direction we believe matters for production agent workloads: capable, efficient, and deployable at scale.

译阶跃星辰发布开源 Step 3.7 Flash（Apache 2.0），采用 MoE 架构（198B 总参/11B 活跃参），配备 MTP 辅助解码（3 个预测头），输出速度超 400 tokens/s，是同类两倍多。Artificial Analysis Intelligence Index 得分 42.6，较 Step 3.5 Flash 提升 4 分。智能体能力明显增强：GDPval-AA Elo 升至 1298，TerminalBench Hard 升至 35.6%。新增 1.8B 视觉编码器，MMMU-Pro 得分 75.3%。上下文窗口 256K tokens，提供 BF16、FP8、NVFP4 版本。缺点：AA-Omniscience 准确率仅 25.4%，幻觉率 84.4%。

Artificial Analysis@ArtificialAnlys · 6月4日67

StepFun's Step 3.7 Flash sits on the Intelligence vs Output Speed Pareto frontier, scoring 43 on the Artificial Analysis Intelligence Index and is served at over 400 output tokens/s Step 3.7 Flash (open weights, Apache 2.0) is a significant upgrade on Step 3.5 Flash and stands out for its speed and gains in agentic performance (particularly GDPval-AA). 400 output tokens/s is more than double other models of a similar size class. Contributing to this speed is that the model has only 11B active parameters and the model ships with trained Multi-Token Prediction heads (3) that predict several tokens in a single forward pass, letting it decode multiple tokens at once using speculative decoding. Key results for Step 3.7 Flash with the high reasoning level: ➤ 4 point Intelligence Index improvement: Step 3.7 Flash scores 42.6 on the Artificial Analysis Intelligence Index, up 4 points from Step 3.5 Flash 2603 (38.5). It is equivalent to Qwen3.5 122B A10B (41.6) and trails MiniMax-M2.7 (49.6) and DeepSeek V4 Flash (Max Effort, 46.5) ➤ Speed-intelligence frontier: Step 3.7 Flash achieves ~400 output tokens/s on StepFun's first-party API, placing the model on the Intelligence vs Output Speed Pareto frontier. StepFun has released the weights for this model and we expect several third-party providers to serve this model ➤ Agentic capability improvements: Step 3.7 Flash improves over Step 3.5 Flash 2603 across our agentic evaluations, in both GDPval-AA (real-world agentic tasks) and TerminalBench Hard (agentic coding and terminal use). It achieves a GDPval-AA Elo of 1298, up from 1070 for Step 3.5 Flash 2603, and it's TerminalBench Hard score increases to 35.6% from 32.6%. AA-LCR (Long Context Reasoning) improves to 63.7% from 54.3%. Scores for other evals remain relatively flat ➤ Weaker on knowledge and hallucination than peers: While Step 3.7 Flash trails competitors overall on AA-Omniscience (-38), it improves from Step 3.5 Flash 2603 (-44). It has an AA-Omniscience accuracy of 25.4% and a hallucination rate of 84.4% ➤ Native multimodal support, new in this generation: Step 3.7 Flash introduces a 1.8B-parameter vision encoder for native image understanding, where Step 3.5 Flash was text-only. On MMMU-Pro (multimodal reasoning) it scores 75.3%, roughly matching Qwen3.5 122B A10B (75.0%). Among its same-size open weights peers, MiniMax-M2.7, DeepSeek V4 Flash, and gpt-oss-120b are text-only Key model details: ➤ Context window: 256K tokens ➤ Parameters: 198B total, 11B active (MoE). At BF16 native precision, Step 3.7 Flash requires ~400GB to store the weights. StepFun has also released FP8 (~200GB) and NVFP4 (~100GB) versions for lower-memory deployment ➤ License: Apache 2.0 ➤ Availability: Currently Step 3.7 Flash is available on @StepFun_ai 's first-party API

译StepFun 开源 Step 3.7 Flash（Apache 2.0），总参数 198B、激活 11B（MoE），上下文 256K。在 Artificial Analysis 智能指数上得分 42.6，较 Step 3.5 Flash 提升 4 分，输出速度超 400 tokens/s，通过 Multi-Token Prediction（3 个 token）加速。新增 1.8B 视觉编码器支持原生多模态，MMMU-Pro 得分 75.3%。代理能力提升：GDPval-AA Elo 从 1070 升至 1298，TerminalBench Hard 达 35.6%，AA-LCR 63.7%。知识/幻觉仍弱：AA-Omniscience 准确率 25.4%，幻觉率 84.4%。提供 BF16、FP8、NVFP4 精度权重以降低部署成本。

MiniMax (official)@MiniMax_AI · 6月4日77

15.6× faster decoding at 1M tokens 🔥 Thanks @FireworksAI_HQ for powering the inference behind M3. Try it now 👇

译15.6× faster decoding at 1M tokens 🔥 感谢 @FireworksAI_HQ 为 M3 提供推理支持。立即尝试 👇

OpenAI@OpenAI · 6月4日67

We’re bringing new capabilities to GPT-Rosalind, a model series purpose-built for life sciences research at enterprise scale. It brings GPT-5.5’s agentic coding and tool use together with stronger intelligence for drug discovery, analysis, design, and experimental workflows. https://openai.com/index/introducing-new-capabilities-to-gpt-rosalind

译我们正在为 GPT-Rosalind 带来新功能，这是一个专为企业级生命科学研究打造的模型系列。它将 GPT-5.5 的智能体编码和工具使用能力与更强大的智能相结合，用于药物发现、分析、设计和实验工作流程。 https://openai.com/index/introducing-new-capabilities-to-gpt-rosalind

Artificial Analysis@ArtificialAnlys · 6月4日71

Jensen Huang’s keynote at Computex used Artificial Analysis benchmarks to communicate the performance of Nemotron 3 Ultra Jensen used our Artificial Analysis Intelligence Index vs. Output Speed chart to communicate the performance of NVIDIA’s new Nemotron 3 Ultra model. The presentation also highlighted GDPval-AA, Artificial Analysis' benchmark that uses OpenAI's GDPval dataset to evaluate models on economically valuable tasks NVIDIA additionally highlighted Artificial Analysis Text to Image and Image to Video Arena Elos to promote the NVIDIA Cosmos 3 model family. Congratulations @NVIDIAAI on the launches!

译Jensen Huang 在 Computex 主题演讲中引用 Artificial Analysis 的 Intelligence Index vs. Output Speed 图表，介绍 NVIDIA 新模型 Nemotron 3 Ultra 的性能。演讲还提及 GDPval-AA——Artificial Analysis 基于 OpenAI 的 GDPval 数据集评估模型在经济价值任务上的基准。NVIDIA 同时用 Artificial Analysis 的文生图和图生视频 Arena Elo 评分推广 Cosmos 3 模型族。

Microsoft Research@MSFTResearch · 6月4日62

A three‑month pilot in a Midwestern bottling plant shows what happens when AI moves beyond chat and into decision-making, where constraints shift, stakes are real, and answers must hold. https://msft.it/6015vjYUN

译一份在中西部装瓶厂进行的三个月试点显示，当AI超越聊天进入决策领域时会发生什么——约束条件变化、风险真实、答案必须可靠。 https://msft.it/6015vjYUN

elvis@omarsar0 · 6月3日72

New research from Google. Just shows the impressive results you can get from custom agent harnesses. LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback. The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%. Paper: https://arxiv.org/abs/2606.03303 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Google 新研究 LEAP 将通用大语言模型封装在智能体框架中，每个步骤基于 Lean 编译器，并依赖验证器反馈进行迭代。同一通用模型解决了全部 12 道 Putnam 2025 问题，并将 Lean-IMO-Bench 一次性解决率从不到 10% 提升至 70%，击败了得分 48% 的专业金牌系统。论文链接：https://arxiv.org/abs/2606.03303。

Alibaba Cloud@alibaba_cloud · 6月3日63

Agent performance is no longer about cost per token, but the cost to finish the whole task. We must treat inference as a whole operating system to turn tokens into real business value.

译智能体性能不再取决于每个token的成本，而是完成整个任务的成本。我们必须将推理视为一个完整的操作系统，将token转化为实际的商业价值。

swyx@swyx · 6月3日46

probably the best reward function for reasoning efficiency i've seen

译可能是我见过的最好的推理效率奖励函数。

Alibaba Cloud@alibaba_cloud · 6月3日71

Qwen: Foundation Models for the Agent Era with Steven Hoi, Head of Multimodal Interaction, Tongyi Large Model BU Qwen3.7 delivers major breakthroughs in reasoning, fully upgrading native agentic capabilities across tool use, coding, and long-horizon tasks.

译Qwen：面向智能体时代的基座模型，由通义大模型BU多模态交互负责人Steven Hoi介绍。 Qwen3.7在推理方面取得重大突破，全面升级了工具使用、编码和长程任务的原生智能体能力。