全部 AI 动态 · AI HOT

内容

精选全部 AI 动态 AI 日报主题收藏

接入

更多

关于更新日志反馈

内部员工登录

精选全部日报更多

内部员工登录

全部动态X · 608 条

全部一手资讯 X 论文

标签「论文/研究」清除

Ant Ling@AntLingAGI · 5月26日68

From IcePop to KPop — our team keeps pushing on RL training stability for large MoE models. 👇 KPop replaces the fixed-ratio mask with an adaptive binary-KL region that matches each token's inherent noise. More robust updates, stable long-horizon agentic RL. Ring-2.6-1T → 76+ on SWE-bench Verified, pure RL. Congrats to @Jia__Guo & team! Blog: https://ringtech.notion.site/kpop

译团队推出 KPop，用于稳定大规模 MoE 模型的智能体强化学习训练。它用基于二元 KL 散度的自适应掩码机制，替代了此前 IcePop 方法中的固定比例掩码，能根据训练过程中的训练-推理不匹配程度动态调整。这一改进使得 Ring-2.6-1T 模型在无需修改基础设施或路由重放的情况下，仅通过纯 RL 训练，在 SWE-bench Verified 上取得了超过 76 分的成绩。

Berryxia.AI@berryxia · 5月26日44

别被骗了！大模型也特么需要“睡觉”？一个来自CMU和UMD的研究团队发现：Transformer大模型在处理超长任务时注意力机制彻底拉胯他们没有继续堆上下文长度而是直接给模型安排了“睡眠” 模型在睡眠期间把最近的上下文全部转化成持久的fast weights然后清空KV cache 这个机制叫“sleep-like consolidation”大模型也需要睡觉故事就藏在2026年5月25日刚出的arXiv 2605.26099里标题直白到离谱：《Language Models Need Sleep》作者Sangyun Lee、Sean McLeish、Tom Goldstein、Giulia Fanti 传统Transformer在长时序任务上越跑越累因为attention对上下文长度是二次方爆炸。 KV cache占显存越来越多推理速度越来越慢。他们提出的方案超级生物启发：模型每隔一段时间进入“睡眠模式” 先把最近积累的上下文做N次离线循环遍历然后通过一个学会的局部规则把这些信息固化到state-space model块里的fast weights里固化完直接清空KV cache 醒来后模型继续工作但记忆已经从“短期易失”变成了“长期持久” 实验结果直接证明：增加睡眠深度或者睡眠时长能显著提升睡眠后的推理能力这不是又一个参数技巧而是彻底改变了模型处理长上下文的范式。 Big Tech还在疯狂卷把上下文拉到百万级靠暴力堆显存。这个小团队却用“睡觉”这个最简单的人类机制把问题从根上解决了。整个框架100%开源论文代码思路全在arXiv上。 Big Tech的闭源长上下文订阅模式靠的就是你不知道模型其实可以“睡觉”来省资源。

译CMU与UMD的研究团队在论文《Language Models Need Sleep》（arXiv 2605.26099）中指出，传统Transformer模型在处理长任务时，因注意力机制计算复杂度高及KV cache显存占用持续增长而导致效率低下。为此，他们提出了受生物启发的“类睡眠巩固”机制：模型会周期性进入“睡眠”状态，离线多轮处理最近的上下文，并将信息固化到模型状态空间块的fast weights中，随后清空KV cache。实验表明，增加睡眠深度或时长能显著提升模型后续的推理能力。该框架完全开源，提供了一种区别于暴力堆显存的长上下文处理新范式。

Rohan Paul@rohanpaul_ai · 5月26日61

Brilliant new paper from Meta, CMU and other labs. Shows that coding agents improve faster by manufacturing their own software experience. Coding agents can train themselves by making and fixing bugs inside real projects. Most coding agents still learn from human leftovers: issues, pull requests, tests, comments, and benchmarks that describe what went wrong. That is useful, but it makes the agent dependent on the rate at which humans produce clean, verifiable lessons. Self-play SWE-RL changes the unit of learning from a labeled task to an executable situation. One version of the model explores a real codebase, weakens tests, injects a meaningful bug, and leaves behind test artifacts that define the failure without needing an English issue description. Another version of the same model has to repair the system, not by matching words to patches, but by restoring behavior under tests. Here’s the key point: the test is not just a grader here, it is the language of the problem. That matters because software understanding lives in constraints, dependencies, edge cases, and invariants that prose often compresses or misses. The reported gains, +10.4 points on SWE-bench Verified and +7.8 on SWE-Bench Pro, are early but hard to ignore because evaluation still used natural-language issues the self-play system did not train on. That suggests SSR (Self-play SWE-RL) is learning something deeper than issue phrasing, though not yet anything like open-ended mastery. The restraint matters: generated bugs can be artificial, rewards can be noisy, and sandboxed repositories are still a narrow slice of software reality. Still, the direction is sharp. The next bottleneck for coding agents may not be more human-written tasks, but more ways for agents to encounter, create, survive, and learn from failure. ---- Paper Link – arxiv. org/abs/2512.18552 Paper Title: "Toward Training Superintelligent Software Agents through Self-Play SWE-RL"

译Meta、CMU等机构在论文中提出了Self-play SWE-RL方法。该方法让编程智能体通过“自我博弈”生成训练数据，而非仅依赖人工标注的问题。具体而言，一个模型探索代码库、注入bug并留下测试用例来描述问题；另一个模型则学习根据测试修复系统。其中，测试成为了描述问题的核心语言。该方法在SWE-bench Verified上提升了+10.4分，在SWE-Bench Pro上提升了+7.8分。值得注意的是，评估使用了该系统未训练过的自然语言问题，表明其可能学到了更深层的软件理解能力。

Rohan Paul@rohanpaul_ai · 5月26日57

New Meta, Stanford, Google and many other top labs paper proposes AutoResearchClaw. Shows that automated research improves when AI can fail, recover, and ask humans at the right moments. The paper is less about an “AI scientist” than about turning research into a governed loop. Most systems still treat science like a production line: generate an idea, run code, write a paper, then stop when the chain breaks. AutoResearchClaw treats failure as evidence, using debate, repair, verification, memory, and selective human input as parts of the same machine. That is the main point: autonomy gets better when it is constrained by process, not when it is simply given more freedom. On ARC-Bench, the system beat AI Scientist v2 by 54.7%, with its sharpest gains in result analysis, where claims had to match measurements rather than merely sound plausible. The human result is more interesting: CoPilot reached an 87.5% accept rate, while full autonomy reached 25% and step-by-step oversight reached 50%, suggesting that too little judgment and too much supervision can both degrade science. The most revealing failure was a case where every cross-validation method returned identical zero-bias outputs, which passed numeric verification but failed scientific meaning. That is the boundary this paper exposes: machines can verify that numbers are real, but humans still notice when the experiment has stopped asking the right question. ---- Paper Link – arxiv. org/abs/2605.20025 Paper Title: "AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration"

译Meta、斯坦福等机构提出AutoResearchClaw，这是一个通过AI智能体进行自主研究的框架。其核心理念是将科研过程转化为一个受流程约束的循环，而非简单的生产线。系统整合了辩论、修复、验证、记忆和选择性的人类反馈，并将失败视为有效证据。在ARC-Bench基准测试中，该系统在结果分析等任务上性能比AI Scientist v2提升54.7%。人类协作实验显示：CoPilot模式（适时介入）接受率达87.5%，完全自主仅25%，逐步监督为50%。一个关键失败案例揭示了当所有交叉验证方法返回相同零偏差输出时，系统虽通过数值验证却失去了科学意义，凸显了人类判断的关键作用。

Ant Ling@AntLingAGI · 5月26日62

SwiGLU is everywhere in modern LLMs — but for large inputs it behaves like x². That quadratic blow-up inflates activations, amplifies outliers, and makes deep network or low-precision (FP8/FP4) training prone to loss spikes. We propose PowLU, a drop-in activation built for stable large-scale pre-training. 🧵

译SwiGLU在现代大语言模型中无处不在——但对于大输入，它的行为类似于x²。这种二次增长会膨胀激活值，放大异常值，并使深层网络或低精度（FP8/FP4）训练容易出现损失尖峰。我们提出了PowLU，一种为稳定大规模预训练而设计的即插即用激活函数。🧵

X.PIN@thexpin · 5月26日67

Huawei plans to scale AI chips without smaller nodes. A new paper by Huawei's He Tingbo, "A Time Scaling Theory for Multi-Layer Electronic Systems," outlines how they'll advance Ascend AI chips as transistor shrinking slows down. Instead of next-gen lithography, Huawei will scale its Ascend SuperPoD line through ~2030 by packing mature tech across the 2025 910C, 2026 950, and 990: 🔹 Chiplets 🔹 2.5D fan-out packaging 🔹 3D stacking (via micro-bumps & hybrid bonding) Around 2030, Ascend 990 will debut LogicFolding in AI accelerators, aiming for a 100x integration leap by 2035.

译华为将不依赖更小制程节点，通过封装与架构创新来扩展其昇腾AI芯片。根据何庭波的论文，华为计划在2025年至2030年间，通过Chiplets、2.5D扇出封装和3D堆叠技术，推进其昇腾SuperPoD系列，具体产品包括2025年的910C、2026年的950及后续的990。约2030年，Ascend 990将引入LogicFolding技术，目标是到2035年实现100倍的集成度跃升。

Rohan Paul@rohanpaul_ai · 5月26日59

One engineering challenge in dexterous Robot hands is balancing strength and speed. Here a SharpaWave performing rapid hand cycles at over 4x/sec. The Dynamic Tactile Array uses visuo-tactile sensing: fingertip integrates camera & 1,000+ tactile pixels.

译灵巧机械手的一个工程挑战在于平衡强度与速度。这里 SharpaWave 正以超过每秒 4 次的频率进行快速手部循环。动态触觉阵列采用视觉-触觉传感：指尖集成了摄像头和 1000 多个触觉像素。

Rohan Paul@rohanpaul_ai · 5月26日69

New Google paper says LLMs should stop pretending certainty and instead clearly show when they are unsure. Hallucination is less about machines being wrong than about machines sounding certain when they should hesitate. That distinction changes the target-problem. The paper changes the target from making models perfectly factual to making them honest about their own uncertainty. For years, the obvious goal has been to make language models know more, so they make fewer factual mistakes. Perfect factuality may be very hard, but a model that clearly separates “I know this” from “I am guessing” can stay useful without quietly damaging trust. This paper argues that the harder missing skill is not knowledge, but self-knowledge. A model can be well calibrated in the broad sense, knowing that answers like this are correct about 60% of the time, yet still fail to identify which particular answer is the dangerous one. That is the trap: to eliminate errors, the system must refuse many answers that would have been right. The authors call this the utility tax, and it explains why products keep drifting toward confident usefulness rather than cautious truth. Here's the key point. A wrong answer wrapped in honest uncertainty is not the same social object as a wrong answer delivered as fact. It gives the user a different instruction: verify this, treat it as provisional, do not build too much on it. The proposed fix is “faithful uncertainty,” where the model’s language mirrors its internal confidence instead of smoothing doubt into authority. For agents, this becomes even more important, because uncertainty is what should decide when to search, when to trust a source, and when to stop. Tools expand what a model can access, but metacognition governs whether access is used wisely. ---- Paper Link – arxiv. org/abs/2605.01428v1 Paper Title: "Hallucinations Undermine Trust; Metacognition is a Way Forward"

译Google最新论文指出，LLM的幻觉问题核心在于模型在该犹豫时仍表现确定，而非单纯事实错误。论文将优化目标从追求完美的事实准确性，转向让模型能诚实地区分“我确知”与“我猜测”。作者提出了“忠实不确定性”概念，要求模型的表述与其内部置信度相符。文章还引入了“效用税”概念，解释了为何产品倾向自信但可能错误的回答。对于智能体而言，元认知能力至关重要，它决定了何时调用工具、何时信任信息源。

Rohan Paul@rohanpaul_ai · 5月26日65

This Meta + Stanford + Illinois survey paper argues that AI agents work better when code becomes their main working layer. The problem is that an LLM by itself is mostly a text predictor, so long tasks can lose state, hide mistakes, and turn plans into actions in fragile ways. The real advance is not “AI writes code,” but “AI uses code as the environment it thinks inside.” The authors call the surrounding system an agent harness, meaning the tools, memory, sandboxes, checks, and feedback loops that turn a model into an agent. Their core idea is that code should sit at the center of that harness, because code can be run, inspected, checked, saved, edited, and shared. Tests become sensors. Repositories become memory. Logs become history. Sandboxes become boundaries. A generated script is no longer merely an answer; it is a handle the system can run, check, revise, share, and roll back. The main finding is a pattern across many fields: code helps agents reason through executable steps, act through tool calls or control programs, and model environments through tests, traces, logs, repositories, and simulators. ---- Paper Link – arxiv. org/abs/2605.18747 Paper Title: "Code as Agent Harness"

译Meta、斯坦福与伊利诺伊的研究论文指出，AI智能体在将代码作为主要工作层时性能更佳。论文认为，大语言模型（LLM）作为文本预测器，在处理长任务时存在状态丢失、错误隐蔽等问题。真正的进步并非“AI写代码”，而是“AI在代码环境中思考”。论文的核心是提出一个以代码为中心的“智能体框架”，即工具、记忆、沙箱等系统。在此框架中，测试成为传感器，代码库成为记忆，日志成为历史，沙箱成为边界。生成的脚本成为可运行、检查、修改和共享的操控对象。总结发现，代码能通过可执行步骤帮助智能体推理，通过工具调用行动，并通过测试、日志等对环境进行建模。

elvis@omarsar0 · 5月25日66

New research from Microsoft Research I see a lot of AI engineers handwriting agent skill docs and hope they generalize. Probably not optimal. This works show why. It treats the skill doc as a trainable external state of a frozen agent instead. It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes. SkillOpt is best or tied on all 52 (model, benchmark, harness) cells. On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses. Paper: https://arxiv.org/abs/2605.23904 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译微软研究院提出了SkillOpt方法，将AI智能体的技能文档视为可训练的外部状态，而非由工程师手动编写。该方法利用一个优化器模型对技能文件进行验证门控编辑，通过添加、删除或替换指令来优化文档，并引入文本学习率控制每轮重写力度，而智能体本身保持不变。实验显示，在全部52个测试单元（涵盖不同模型、基准测试和工具链）中，SkillOpt均达到最佳或并列最佳。具体在GPT-5.5上，相比无技能文档，SkillOpt在直接聊天、Codex和Claude Code下分别取得23.5、24.8和19.1分的提升，超越人类手写技能及其他自动化方法，且不增加推理时开销，学到的技能还能跨模型和工具链迁移。

Rohan Paul@rohanpaul_ai · 5月25日75

🇨🇳 Huawei just released breakthrough chip design approach "LogicFolding" that will close it's gap with TSMC. The technical paper behind it. The core idea is that chips should stop measuring progress mainly by how small transistors are and start measuring progress by how much time delay can be removed from the whole machine. A chip wastes time when signals move through long wires, memory paths, chip-to-chip links, and software communication layers, so Huawei calls this delay τ, or tau. Huawei’s paper introducing "LogicFolding" says the next chip breakthrough may come from cutting wasted time inside the machine. That is what "τ scaling" means. τ is the delay that accumulates before useful computing happens: a transistor switches, a signal crosses a wire, data reaches memory, a chip talks to another chip, or a server waits for a response. Moore’s Law reduced this delay indirectly because shrinking transistors also shortened many of the paths around them. But modern chips are no longer slowed only by transistor size. They are slowed by wire resistance, parasitic capacitance, clock skew, memory distance, protocol conversion, chip-to-chip communication, and the cost of moving data. So τ scaling changes the question from “how small is the transistor?” to “where is time being lost?” LogicFolding is Huawei’s physical answer to that question inside a chip. In a normal chip, related logic gates are spread across a flat surface, so signals often travel sideways through long metal routes before reaching the next important gate. Those wires behave like sticky pipes: resistance slows current, capacitance must be charged and discharged, and every extra distance creates delay and wastes energy. LogicFolding tries to stack active circuit layers vertically and connect them with very fine hybrid bonds, so circuits that need to talk are placed above and below each other instead of far apart on one plane. The signal now takes a shorter route, the critical path becomes faster, clock timing becomes cleaner, and the same manufacturing node can deliver more performance. Huawei is trying to win not by making every switch smaller, but by making every important signal travel less, wait less, and arrive sooner.

译华为提出了“τ缩放”和“LogicFolding”两种新方法，旨在不依赖最先进光刻工具的前提下，缩小与台积电的性能差距。其核心思想是将衡量芯片进步的指标从晶体管尺寸转向信号传输延迟（τ）。LogicFolding作为具体实现，通过垂直堆叠逻辑电路层并采用混合键合，将需要通信的电路紧邻放置，从而缩短关键线路、降低电阻和寄生电电容，提升信号速度。华为表示，其下一代麒麟手机芯片将是对τ缩放规律的首次全面测试。

Rohan Paul@rohanpaul_ai · 5月25日65

New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) with only lightweight adaptation Shows standard LLMs can handle very long context faster by making attention selectively sparse. The problem is that full attention gets very expensive when the input grows to hundreds of thousands or 1M tokens, because the model keeps comparing too many tokens with too many other tokens. The paper’s claim is that a trained full-attention model already has a hidden sparse structure, so the model does not need to be rebuilt or trained from scratch. RTPurbo uses that structure by finding the few attention heads that really need faraway tokens, while letting the other heads focus mostly on nearby text. For those retrieval heads, it uses a small 16-dimensional token finder to guess which old tokens matter, then runs the real attention only on that selected set. The authors tested this on long-context benchmarks and reasoning tasks, and RTPurbo kept accuracy close to full attention while reaching up to 9.36x faster prefill at 1M tokens and about 2x faster decoding. RTPurbo's engineering rule: keep expensive long-context access only where it matters, and route the rest through a smaller search space. The clever part is the 16-dimensional indexer. It does not replace the model’s real attention computation; it acts like a cheap scout, finding likely useful tokens before the full representation is used on the selected set. RTPurbo is not proof that every model can be safely sparsified this way. But it is strong evidence that the waste in long-context inference is more structured than it looks. ---- Paper Link – arxiv. org/abs/2605.16928v1 Paper Title: "Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps"

译阿里巴巴与南京大学提出RTPurbo，一种轻量级适配方法。该方法发现，已训练的全注意力模型内存在隐藏的稀疏结构。它利用一个轻量的16维token查找器作为“侦察兵”，为少数需要长程信息的关键注意力头定位重要token，而让其他头主要关注局部文本。基于此，RTPurbo在100万token预填充任务上，相比FlashAttention-2实现了高达9.36倍的加速，解码阶段也约有2倍加速，同时在长上下文和推理基准上保持了接近全注意力模型的精度。该研究表明，长上下文推理中的计算浪费具有可挖掘的结构性。

Chubby♨️@kimmonismus · 5月25日60

Nine more Erdős problems have been solved. This time, however, by Google DeepMind. This shouldn't be underestimated, because on the one hand it increases competitive pressure, and on the other hand it proves that the other Frontier Labs can easily keep up.

译又有九个Erdős问题被解决了。但这次，是Google DeepMind完成的。这不容小觑，因为一方面它加剧了竞争压力，另一方面也证明了其他前沿实验室可以轻松跟上。

Rohan Paul@rohanpaul_ai · 5月25日73

A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of expert computation removed, with almost no loss in accuracy. This makes already-trained MoE models like Qwen3 and GLM stop calling half their experts when a token is too easy to need them. Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. Shows that many MoE tokens do not need real experts, only permission to skip them. That sounds like a small routing trick, but it changes the economics of deployed language models. Standard MoE models already avoid using every parameter, yet they still spend the same expert budget on every token. ZEDA adds a strange new option to the router: experts that output exactly nothing. When the model routes a token to one of these zero experts, it is not making the model dumber; it is admitting that this token does not need another expensive transformation. The clever part is not the dummy expert, but the adaptation method. Instead of retraining the model from scratch, the original MoE becomes a frozen teacher, while the new dynamic version learns when it can safely skip work. Across Qwen3-30B-A3B and GLM-4.7-Flash, the result is roughly half the expert computation removed, with only marginal average accuracy loss and about 20% real inference speedup. The deeper finding is: compute use did not simply track task difficulty. The model spent more expert budget where uncertainty or teacher-student disagreement rose, while structured code and math fragments often needed less. That makes ZEDA feel less like pruning and more like attention to computational doubt. ---- Paper Link – arxiv. org/abs/2605.18643 Paper Title: "Post-Trained MoE Can Skip Half Experts via Self-Distillation"

译论文提出ZEDA框架，可将训练后固定的静态MoE模型（如Qwen3、GLM）转变为动态模型，允许路由器在token过于简单时跳过专家调用。实验显示，在Qwen3-30B-A3B和GLM-4.7-Flash上，ZEDA可移除约50%的专家计算量，仅带来轻微准确率损失，并实现约20%的实际推理速度提升。研究发现，计算分配主要依据模型的不确定性，而非单纯跟随任务难度。

Chubby♨️@kimmonismus · 5月24日68

Dont like this at all. Researchers at KIT (germany) just demonstrated that ordinary WiFi routers can identify individuals with near-perfect accuracy. No phone required, no special hardware, no line of sight. The system reads unencrypted beamforming feedback that every connected device already broadcasts. 197 test subjects, nearly 100% identification rate. The surveillance infrastructure isn't being built. It's already installed in every café, airport, and office you walk through. The only question is who starts reading the signals first. Source: science daily

译德国KIT研究人员展示，使用普通WiFi路由器即可近乎完美地识别个人身份，无需手机、特殊硬件或视线。该系统利用每个已连接设备都在广播的未加密波束成形反馈（beamforming feedback）。在197名受试者的测试中，识别准确率接近100%。该研究指出，此类监控基础设施（如咖啡馆、机场、办公室中的路由器）已普遍存在，核心问题在于谁将开始读取并利用这些信号。

Rohan Paul@rohanpaul_ai · 5月24日39

Interesting paper 😀 "What the F*ck Is Artificial General Intelligence?" Defines intelligence as adaptability under limits of compute, memory, and energy. So AGI is a system that adapts at least as generally as a human scientist That means it should be able to plan experiments, learn cause and effect, balance exploration and action, and operate with autonomy. The paper calls this type of AGI an artificial scientist, because it is judged by its ability to discover and adapt across many tasks, not just by passing human-like tests. So AGI is not just “human-level AI” but a whole system that can adapt broadly, efficiently, and scientifically, at least as well as a human scientist. ---- arxiv. org/abs/2503.23923

译一篇新论文提出对通用人工智能（AGI）的明确定义，认为AGI是一种“人工科学家”。该模型需要像人类科学家一样，具备自主规划实验、学习因果关系、平衡探索与行动的能力。其核心在于适应性，即能在计算、记忆和能量限制下，像人类科学家一样广泛、高效且科学地适应新环境和任务，评判标准是其发现和适应能力，而非通过拟人化测试。

elvis@omarsar0 · 5月23日64

// Adapt the Interface, Not the Model // I am fascinated by the results across my cheap-model-plus-good-harness builds. This new paper also shows good signs of the code-as-agent-harness thesis. The idea is really simple. Do not touch the model. Instead, modify the runtime interface that wraps the frozen LLM. Then convert recurring interaction failures into reusable interventions on the harness side. The paper reports an average relative improvement 88.5% across 7 deterministic environments, 126 model-environment settings, and 18 backbones. A harness learned from one model trajectory generalizes to 17 other backbones. That tells you the harness is capturing environment structure, not model-specific patterns. If you ship agents in production, your harness work is more portable than you might assume. Paper: https://arxiv.org/abs/2605.22166 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一项新研究提出通过改进包裹冻结LLM的运行时接口来优化AI代理性能，而非修改模型本身。该方法将反复出现的交互失败转化为对运行时层的可复用干预，在7个确定性环境、126个设置中取得平均88.5%的相对性能提升。关键发现是，从单一模型轨迹中学习到的运行时方法可成功迁移至18个不同模型骨架，证明其捕捉的是环境结构而非模型特异性模式。这为生产环境中部署AI代理提供了更高可移植性的解决方案。

Rohan Paul@rohanpaul_ai · 5月23日60

Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw logs. i.e. stronger coding agents do not just need more attempts, but better ways to remember attempts. That sounds obvious until you look at what an agent actually produces: not an answer, but a messy trail of file reads, shell commands, errors, partial fixes, and abandoned ideas. The paper’s idea is to turn each full attempt into a compact summary of the main guess, partial progress, and failure points, then use those summaries both to pick the best attempts and to guide new ones. Test-time scaling breaks when the model cannot compare its own past work. For short answers, ranking is easy. For long-horizon coding, the bottleneck shifts from generation to representation. Once rollouts become summaries, two useful things happen. The system can run tournament-style selection over small groups of candidates, which works better than forcing one giant comparison, and it can feed the best summaries back into a fresh round of attempts instead of starting blind. --- The authors test this on 2 hard coding benchmarks by running many attempts in parallel, selecting promising summaries with a tournament style voting method, and then launching fresh attempts that can read the selected summaries first. The results are strong, with Claude 4.5 Opus rising from 70.9% to 77.6% on SWE-Bench Verified and from 46.9% to 59.1% on Terminal-Bench v2.0. What matters is that the paper says better test-time scaling for long coding agents is not mostly about making more attempts, but about storing experience in a form the agent can actually reuse. ---- Paper Link – arxiv. org/abs/2604.16529 Paper Title: "Scaling Test-Time Compute for Agentic Coding"

译Meta研究发现，在编程智能体任务中，通过复用过往尝试的简短摘要，其性能显著优于使用原始日志。该论文指出，对于长程编程任务，主要瓶颈已从代码生成转向了如何有效记忆与表示智能体的工作过程。其方法是将每次充满错误的“混乱轨迹”转化为包含核心假设、进展与失败点的紧凑摘要，系统通过锦标赛式选择最佳摘要来指导新一轮尝试。在Claude 4.5 Opus的测试中，该方法使其在SWE-Bench Verified上的得分从70.9%大幅提升至77.6%，证明提升性能的关键在于以可复用的形式存储经验。

Rohan Paul@rohanpaul_ai · 5月23日61

This paper shows that agent performance depends less on prompts alone and more on the harness around them. “Agent intelligence” is becoming partly a systems problem. The problem is that many AI agents look like 1 model, but their real behavior comes from surrounding code that controls planning, tools, memory, retries, checking, and stopping. A model may reason well in one step, but long tasks fail in messier places: state disappears, verification drifts, tools return partial evidence, and the agent forgets which intermediate artifact actually matters. Natural-Language Agent Harnesses try to make that control layer visible. Instead of burying the logic in controller code, they express the stages, roles, contracts, state rules, failure modes, and stopping conditions in structured natural language that a shared runtime can execute. The claim is not that natural language should replace code, but that the important design choices around an agent should become inspectable, portable, and testable instead of hiding inside one framework’s habits. On SWE-bench, heavier harnessing changed behavior dramatically, with more calls, tools, delegation, and runtime, but it did not produce a simple win curve; sometimes added structure helped, and sometimes it pushed the agent away from the shortest benchmark-aligned repair. A harness is not magic scaffolding around a model; it is a set of bets about where reliability comes from. ---- Paper Link – arxiv. org/abs/2603.25723 Paper Title: "Natural-Language Agent Harnesses"

译本研究指出，AI代理的实际性能更多取决于围绕模型的外部控制系统（即代理框架），而非单纯的提示词。当前许多代理看似单一模型，其行为实则由规划、工具调用、记忆管理等周边代码驱动，导致长任务易因状态丢失、验证漂移等环节失败。为此，论文提出“自然语言代理框架”理念，旨在将控制流程以结构化自然语言显式表达，使其可检查、可迁移且可测试。研究发现，虽然更复杂的框架能显著改变代理行为，但并未带来稳定的性能提升，这表明框架设计是保障可靠性的关键选择，而非一种立竿见影的万能方案。

Rohan Paul@rohanpaul_ai · 5月23日55

AI detectors fail because student writing is too varied to judge from 1 document. The problem is not only that AI writing is getting better, but that many real students write in ways that can look statistically close to AI output. The paper frames this as a testing problem where the detector does not know each student’s normal writing style, so “human writing” is not 1 fixed target. Because of that, any detector that catches many AI-written submissions must also wrongly accuse some real students, especially students whose writing is more structured, formulaic, or shaped by learning English. The authors use basic statistics to show that this false-accusation problem is not just a bug in current tools, because it appears whenever student writing overlaps with AI writing. A university is not comparing “AI text” with “human text”; it is comparing one submission with the unknown writing habits of one particular student. Better detectors may reduce some errors, but they cannot erase the structural problem created by one-shot judgment. ---- Paper Link – arxiv. org/abs/2603.20254 Paper Title: "AI Detectors Fail Diverse Student Populations: A Mathematical Framing of Structural Detection Limits"

译该研究指出，AI检测器频繁失效的根本原因在于学生写作风格的多样性，使得仅凭单份文档判断是否为AI生成变得极为困难。问题不仅在于AI写作能力在提升，更在于许多真实学生的写作风格，在统计特征上已与AI输出高度相似。检测器无法事先掌握每个学生独特的写作习惯，因此“人类写作”不存在一个固定的判断标准。这意味着任何能有效识别大量AI文本的检测器，都不可避免地会误判一部分真实学生，尤其是写作更规范、公式化或受英语学习影响的学生。现有技术或许能降低错误率，但无法根除基于“单次判断”模式所带来的结构性误判问题。

Rohan Paul@rohanpaul_ai · 5月23日64

New Google paper shows that wearable data becomes far more useful when AI learns the person behind the signals. It's is not another heart-rate algorithm, but a general model trained on more than one trillion minutes of sensor data from five million people. The authors propose SensorFM, a foundation model trained on more than 1 trillion minutes of unlabeled wearable data from 5 million people, so it can learn general patterns of human physiology before seeing specific health tasks. That scale changes the problem from measuring isolated events to learning patterns of lived physiology: sleep, movement, temperature, oxygen, heart rhythms, and their ordinary daily messiness. Wearables are not weak because they lack data; they are weak because most systems compress that data into crude summaries before the meaningful structure has a chance to appear. SensorFM tries to learn that structure first, then reuse it across tasks, which is why the same representation can help with cardiovascular, metabolic, mental health, sleep, lifestyle, and demographic predictions. The evidence is strongest as a scaling story: larger models trained on more data performed better, and the learned embeddings beat engineered-feature baselines on 34 of 35 prediction tasks. ---- Paper Link – arxiv. org/abs/2511.15352v3 Paper Title: "People readily follow personal advice from AI but it does not improve their well-being"

译谷歌研究院提出基础模型SensorFM，通过学习超过500万人产生的逾1万亿分钟可穿戴设备传感器数据，掌握了人类生理活动的一般性模式。该模型超越了将数据压缩为简单指标的传统方法，能够从数据中提取出有意义的结构并将其复用于多种健康预测任务。实验显示，模型规模和数据量越大性能越强，且其学习到的数据表征在35项预测任务中的34项上，均优于基于工程特征的基线方法。

Rohan Paul@rohanpaul_ai · 5月23日79

Google DeepMind's new paper. Shows that AI can now search formal mathematics proofs, but only inside carefully constrained worlds. The striking result is not that the system “thinks like a mathematician,” but that it keeps forcing its thoughts through Lean, where every step must compile. The problem is that LLMs can sound convincing in math while still making tiny mistakes, so the authors use Lean, a proof system that checks every logical step. Their system, AlphaProof Nexus, lets an LLM keep editing a formal proof, read compiler errors, try again, and sometimes ask a stronger proof tool for help on smaller subproblems. The stronger version also keeps a shared pool of partial proof attempts, rates which ones look promising, and uses those attempts to guide later searches. That changes the role of the model from a persuasive storyteller into a generator of candidates that can be killed quickly when they are wrong. The verifier is not a cosmetic add-on, it is the mechanism that makes exploration tolerable. Without it, a beautiful proof sketch can hide a false lemma; with it, the model has to turn insight into executable logic, or fail visibly. The authors tested the system on real unsolved math problems, including 353 formalized Erdős problems and 492 open conjectures from the Online Encyclopedia of Integer Sequences. The main result is that the best agent solved 9 Erdős problems and proved 44 sequence conjectures, while also helping with problems in optimization, graph theory, algebraic geometry, and quantum optics. The failures are as revealing as the wins, because the agents sometimes buried the hard part inside a helper lemma or hallucinated a known result, exactly the kind of error formal checking is built to expose. The real shift is not full mathematical autonomy, but a new division of labor: humans choose the formal question, libraries define the terrain, models propose routes, and the proof assistant refuses to be impressed. ---- "Advancing Mathematics Research with AI-Driven Formal Proof Search" Paper Link – arxiv. org/abs/2605.22763

译Google DeepMind提出了AlphaProof Nexus系统，它将大型语言模型与Lean形式化验证工具相结合。该系统允许LLM在生成证明的过程中，不断读取Lean的编译错误并进行修正，还可调用更强的工具辅助解决子问题。这一机制迫使模型将每一步逻辑都转化为可编译、可验证的代码，从而将其角色从“令人信服的叙述者”转变为“候选方案生成器”。在针对353个Erdős问题和492个开放猜想的测试中，系统成功解决了9个Erdős问题并证明了44个序列猜想。该研究展示了形式化验证在暴露AI逻辑错误、建立“人类提问-模型探索-验证器把关”新分工中的关键作用。

Rohan Paul@rohanpaul_ai · 5月22日46

This RAI Institute robot managing 3-balls juggling through dynamic hand adjustments. It processes visual and contact information to maintain the pattern without external aids.

译这个RAI研究所的机器人通过动态手部调整管理三球抛接。它处理视觉和接触信息以维持模式，无需外部辅助。

Chubby♨️@kimmonismus · 5月22日54

University of Tokyo built a chip component that processes data 1000x faster than conventional methods - without generating extra heat. The real number worth paying attention to: power consumption drops to 1/100th of current levels. A Google-scale data center that today powers 80,000 homes could theoretically run on the energy of 800. But the prototype chip isn't scheduled until 2030, and commercial availability is years beyond that. We're watching the AI industry sprint toward an energy wall at full speed while the most promising efficiency breakthroughs are still a decade from production. via techradar

译东京大学研发了一种新型芯片组件，其处理数据速度较传统方法提升1000倍，且不产生额外热量。关键突破在于功耗仅为现有技术的百分之一，这理论上能使一个谷歌规模的数据中心能耗降低至当前的百分之一，极大缓解AI行业的能源压力。然而，该芯片原型预计2030年才问世，商用化需更长时间，凸显了AI快速发展与突破性节能技术量产时间之间的差距。

Berryxia.AI@berryxia · 5月22日66

兄弟们，Apple的Persona团队又把数字人真实度干上新高度了。他们刚在WWDC26前放出一篇新论文，专门讲面部捕捉和动画的最新进展。从演示视频里看，捕捉精度和动画自然度又明显进化了一步，尤其是眼部微表情、头部细微动作和皮肤质感，真实感拉满。这已经不是简单的“数字头像”了，而是越来越接近可信的数字分身。对AR/VR、游戏、远程协作来说，这类突破直接决定“沉浸感”能不能成立。毕竟当你戴上头显后，最先被打穿的往往就是“这个人看起来假”的那层滤镜。 Apple显然还在持续重仓这条赛道。论文和演示在这里（强烈建议看视频）： https://apple.github.io/ml-headsup/ 有空试试这货到底表现如何？？

译苹果Persona团队在WWDC26前发布新论文，展示了面部捕捉与动画技术的最新进展。从演示来看，其在眼部微表情、头部细微动作和皮肤质感等细节上实现了显著提升，使数字形象的真实感进一步增强，已超越简单“数字头像”，趋近于可信的“数字分身”。这类突破对AR/VR、游戏和远程协作等领域的沉浸式体验至关重要，能够有效打破虚拟交互中的“不真实感”。苹果持续重仓该技术赛道，相关论文与演示视频已公开。

Saining Xie@sainingxie · 5月22日60

check out RAEv2 led by Jas. through extensive exps, we found some really intriguing behaviors showing why strong representation encoders are key for pixel decoders. spoiler: it’s not about hillclimbing fid; new metrics like ep@fid-k/fdr^k show there’s a lot more left to explore!

译RAEv2通过大幅简化架构并提升通用性，在文本到图像（T2I）和世界模型等任务中实现了超过10倍的收敛速度提升，同时改善了重建与生成质量。研究团队在大量实验中发现，强大的表示编码器对像素解码器至关重要。传统评估指标（如FID）已不足以全面衡量模型性能，新的评估指标（如ep@fid-k/fdr^k）揭示了生成模型领域仍存在广阔的研究空间。

Ethan Mollick@emollick · 5月22日61

Seems GPT-5.2 reaches expert level in peer review: 45 scientists took 469 hours evaluating human & AI reviews on 82 papers. "Surprisingly, current AI reviewers are competitive even with the top-rated reviewers in Nature’s official peer review..." though not without weaknesses.

译似乎GPT-5.2在同行评审中达到了专家水平：45位科学家花费469小时，评估了人类与AI对82篇论文的评审。 “令人惊讶的是，当前的AI评审甚至能与《自然》官方同行评审中的顶级评审人相媲美……”尽管并非没有弱点。

AK@_akhaliq · 5月22日68

Mix-Quant Quantized Prefilling, Precise Decoding for Agentic LLMs

译Mix-Quant 量化预填充，精确解码，面向智能体LLM

AK@_akhaliq · 5月22日56

LongMINT Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

译LongMINT 评估长期智能体系统中多目标干扰下的记忆能力

Ethan Mollick@emollick · 5月21日55

In science, AI still does a poor job at finding interesting questions to solve in fields that don't have lists of known issues This has always been the hardest thing to teach PhDs: otherwise you find small problems or problems that don't advance the field or don't generalize etc

译在科学领域，AI在寻找值得解决的有趣问题方面仍然表现不佳，尤其是在那些没有已知问题清单的领域。这一直是博士培养中最难教授的能力：否则你只能找到小问题，或是那些无法推动领域发展、无法泛化的问题等。

Orange AI@oran_ge · 5月21日81

AI 发展的里程碑时刻。 OpenAI 的一个未公布的内部推理模型，自主解决了 Erdős 1946 年提出的平面单位距离问题。 chain of thought 长达125 页，核心手法是从代数数论拉了一套工具去解离散几何问题，这个跨领域连接是人类 80 年没想到的。最有意思的是这个模型不是专门为数学训练的，是通用推理模型。这说明足够强的推理能力到了某个阈值之后，创造性会自然涌现。恭喜人类。

译OpenAI未公开的内部通用推理模型，自主解决了数学家Erdős于1946年提出的平面单位距离问题，颠覆了近80年来学界对解法结构的普遍预期。该模型通过125页思维链，创新运用代数数论工具解决离散几何问题，实现了跨领域方法论突破。更值得注意的是，该模型并非专攻数学训练，其成果表明通用推理能力达到一定阈值后可能自然催生创造性，标志着AI在基础科学领域迈出了关键一步。

Greg Brockman@gdb · 5月21日78

our math result is a milestone in new knowledge generation by AI. very exciting to imagine similar results in other scientific fields. "It's very hard to sleep, man" is a pretty good reaction.

译AI在数学领域实现了新知识生成的里程碑式突破。OpenAI模型解决了组合几何中悬而未决的著名难题——平面单位距离问题（Erdos 1946），首次证明通过AI方法可将该问题中单位距离对的数量提升至超线性规模（n^{1+δ}），超越了以往所有人类已知的线性构造。这标志着AI从解决已知问题迈向发现新数学的重要进展。该突破引发了研究者“难以入睡”的强烈反响，被视为AGI时代临近的信号。

Rohan Paul@rohanpaul_ai · 5月21日78

AI in math is creating history again, as OpenAI's general-purpose reasoning model has disproved a major Erdős conjecture from 1946. The important part is not that AI solved a hard math problem, but how little special machinery it needed. For decades, the planar unit distance problem looked almost embarrassingly simple: place points on a plane, then ask how many pairs can be exactly one unit apart. For decades, the best examples looked like stretched versions of a square grid, so mathematicians believed grids were almost the best possible design. OpenAI’s internal model broke that picture by finding an infinite family of constructions that gives a polynomial improvement, with the proof checked by external mathematicians. The point to note is that the model was not a bespoke theorem-proving engine trained only for this problem, and the official post says its success improved with more test-time compute, meaning more reasoning at inference rather than only more training. That matters so much, because research progress often comes from holding a fragile chain of ideas together long enough to cross from one field into another. In this case, the bridge ran from a plain geometric question into deep algebraic number theory, including machinery like infinite class field towers and Golod–Shafarevich theory. And now we see a general-purpose reasoning system appears able to search a conceptual space where human taste, field boundaries, and inherited guesses may have quietly narrowed the path. So future is not machines replacing judgment, but machines widening the map before judgment begins.

译OpenAI的通用推理模型自主解决了一个自1946年以来未解的著名数学难题——平面单位距离问题。该模型没有采用专门为数学设计的定定理证明引擎，而是通过推理时增强计算能力，发现了优于传统网格结构的新构造方案。这标志着AI首次自主解决一个数学领域的核心开放问题。更重要的是，该模型能将几何问题与代数数论等深层理论连接，展示了通用人工智能在跨领域研究和拓宽人类认知边界方面的巨大潜力。

Rohan Paul@rohanpaul_ai · 5月21日67

A 10 million parameter model just outperformed deterministic rivals 3 times its size by doing something regular recursive AI dont do: exploring multiple reasoning paths at the same time. Most AI reasoning models are trapped on a single train of thought, and GRAM ("Generative Recursive Reasoning") is the first to break that by letting the model think in parallel universes simultaneously. The problem is that all existing recursive models are fully deterministic, meaning given the same input they always follow the exact same reasoning path and can never escape a wrong trajectory or discover more than 1 valid answer. GRAM fixes this by injecting learned randomness at each refinement step, so the model samples a slightly different direction each time rather than snapping to 1 fixed next state, which produces a spread of diverse reasoning trajectories. At test time the model runs many of these paths in parallel and selects the best one using a small reward predictor trained alongside the main model, adding a "width" scaling axis on top of the usual "depth" axis of running more recursion steps. On hard Sudoku puzzles, GRAM with 10M parameters hits 97% accuracy versus 87.4% for the best prior recursive model, and with only 20 parallel samples it outperforms every deterministic baseline even at 320 recursion steps. On tasks with many valid answers like N-Queens, deterministic recursive models collapse as the number of solutions grows, while GRAM maintains near-perfect accuracy throughout. The same stochastic framework also acts as a generator: given a blank board, GRAM produces valid Sudoku puzzles 99% of the time using 16 steps, versus 1,000 steps and 55M parameters for the best diffusion baseline at just 91%. --- Paper Link – arxiv. org/abs/2605.19376v1

译仅1000万参数的GRAM模型，通过引入可学习的随机性，在推理时并行探索多条不同路径，打破了传统递归模型锁定单一思维的限制。该模型在测试时同时运行这些平行轨迹，并借助奖励预测器选择最优结果，从而在深度之上增加了“宽度”维度。实验表明，GRAM在困难数独任务上准确率高达97%，远超此前最佳确定性模型；在多解的皇后问题上也能维持高性能，并能高效生成有效的数独谜题。这一框架为提升小模型的推理能力提供了新思路。

Chubby♨️@kimmonismus · 5月21日84

OpenAI made history today. An internal reasoning model autonomously disproved a famous conjecture in mathematics that stood for nearly 80 years. The problem: In 1946, Paul Erdős asked how many pairs of points can be exactly 1 unit apart if you place n points on a flat surface. The best known answer came from square grid constructions, and Erdős himself conjectured you can't do meaningfully better. Mathematicians believed this for decades. The AI proved him wrong. It found entirely new point configurations that beat the square grid by a fixed polynomial factor, not a marginal improvement, a real mathematical gap. The proof uses methods from algebraic number theory, a completely different branch of math, Class field towers, Golod-Shafarevich theory, tools nobody expected to be relevant to a geometry problem about distances in the plane (reminds me of move 37, AlphaGo tbh). Fields Medalist Tim Gowers calls it "a milestone in AI mathematics." The proof was verified by leading external mathematicians. According to OpenAI, this is the first time AI has independently solved a prominent open research problem in mathematics! Caveat: Obviously OpenAI chose which problems to test the model on. So "autonomous" means the model generated the idea and wrote the proof, not that it wandered into the problem on its own. But if reasoning models can reliably make cross-domain connections like this, finding paths that experts didn't prioritize, this changes research far beyond math. Biology, physics, materials science, medicine. This isn't AI reproducing human knowledge anymore. This is AI producing new knowledge. That's a qualitative shift.

译OpenAI内部推理模型自主解决了存在近80年的著名数学开放问题——平面单位距离问题。该模型推翻了Paul Erdős的猜想，发现了全新的点配置构造，其效率以固定多项式因子优于传统方格网格方案。证明运用了代数数论等跨学科方法，经外部数学家验证，被Fields奖得主Tim Gowers誉为“AI数学的里程碑”。这是AI首次独立解决数学领域的核心公开问题，标志着从知识复现到知识创造的重要转变，其跨领域推理能力可能为多学科研究带来深远影响。

AYi@AYi_AInotes · 5月21日76

说实话，OpenAI这条推文我看了三遍。第一遍看懂了"AI解了80年数学悬案"，第二遍看懂了"几何问题用数论来破"，第三遍才反应过来——最震撼的不是结果，是AI自己想出了这条路，而咱们人类80年来都觉得这条路太冷门不值得走。这道题叫平面单位距离问题，1946年埃尔德什提出来的。简单说就是:平面上撒一堆点，让尽量多的点对之间距离正好是1。 80年来所有数学家都信一个结论:最优解长得像方格子，没法再优化了。 OpenAI的AI说:你们错了，它找了一整族全新的构造方法，不是方格子，效率比方格子明显高出一截。用的什么工具呢？就是代数数论里最冷门的那套——无限类域塔、Golod–Shafarevich理论。因为几何和数论，这两帮数学家以前基本不聊天，AI说你们应该聊聊🤣 菲尔兹奖得主Tim Gowers写进审稿论文:如果是人写的，我直接推荐《数学年刊》接收。数论专家Arul Shankar说:AI不只是助手，它有了原创天才想法并完整执行。他的125页思维链已经公开，人类数学家验证通过，证明这不是噱头炒作。以前AI在数学里的角色很清晰: 辅助验证，帮人算，搜索已知模式，但这次不一样， AI自己想了一条路，人类80年都觉得这条路太冷门、太反直觉、不值得走， AI偏偏走了，而且还走通了。人类觉得不靠谱所以没试的路，有多少其实是通的？这事想想有点后背发凉，但更多的是期待 hhh

译OpenAI的一个AI模型自主攻克了“平面单位距离问题”，这是数学家埃尔德什于1946年提出的一个著名开放难题。近80年来，学界普遍认为最优构造近似于方格子，而该AI模型通过运用代数数论中冷门的Golod-Shafarevich理论，发现了一整族效率更高的全新构造，推翻了原有定见。此成就标志着AI首次独立解决一个数学领域的核心开放问题，其关键在于提出并完整执行了一条人类因直觉认为不可行而从未尝试的创新路径。

Z.ai@Zai_org · 5月21日75

http://x.com/i/article/2057206923208884224 # Next-generation LLM Inference Network: How ZCube Alleviates Network Bottlenecks? LLM inference is reshaping AI infrastructure. The network used to be the least interesting part of an inference cluster. That isn't true anymore. With long-context inference and Prefill-Decode disaggregation now standard, the network sits on the critical path of throughput, tail latency, and per-token serving cost. To address the increasingly severe topology-induced congestion in Prefill-Decode disaggregated deployments, Z.ai, Harnets.AI, and Tsinghua University jointly developed and deployed the ZCube network architecture in an online production environment. The deployment shows that system-level innovation at the network architecture layer can unlock hardware potential in a highly cost-effective way. In production benchmarking for the GLM-5.1 coding workload, ZCube delivered significant gains through architectural optimization alone: - Cost optimization: GPUs, the software stack, and applications remained unchanged, while switch and optical module CapEx was reduced by 33%. - Throughput improvement: Average GPU inference throughput increased by 15%. - Latency improvement: TTFT P99 was reduced by 40.6%. The root cause of the congestion lies in the shift of inference traffic patterns. As PD disaggregation becomes mainstream, cross-node KV Cache transfers make inference traffic highly asymmetric, with dynamically changing sources, destinations, and traffic volumes. In traditional ROFT (Rail-Optimized Fat-Tree) architectures, static topology and port mappings can easily concentrate traffic on a limited set of switches and links, causing local hotspots, queue buildup, and PFC backpressure. This leads to a structural issue where aggregate bandwidth appears sufficient, yet localized congestion occurs frequently. ZCube addresses this issue by using a fully flattened network topology together with a hybrid single-rail / multi-rail access design. At the network architecture layer, it decouples and distributes PD traffic across a broader path space, reducing the probability of topology-induced congestion at its source. This provides a more efficient networking foundation for next-generation hyperscale inference clusters. # Network Becoming a Bottleneck for Effective Inference When thousands of GPUs serve online inference requests concurrently, every KV Cache transfer and every data synchronization operation traverses the inter-GPU network. As long-context inference and Prefill-Decode disaggregated inference gradually become mainstream, data exchange between Prefill and Decode nodes continues to grow. Network bandwidth, and more importantly the ability to use it effectively, has begun to affect cluster-level throughput and latency directly. To quantify the impact of networking on inference performance, we first conducted an ablation study on a 512-GPU cluster. We kept GPU compute, the software stack, the model, and application logic unchanged, and only adjusted the available NIC bandwidth cap. We then measured changes in overall cluster throughput and Time to First Token (TTFT). For example, when network bandwidth was increased from 100Gbps to 200Gbps, overall inference throughput improved by approximately 19%, while Time to First Token, or TTFT, decreased by approximately 22%. This indicates that, in LLM inference, network bandwidth has become one of the key factors constraining service performance. # 1. Network Congestion in Inference Today, AI clusters commonly use Clos, or Fat-Tree, architectures. The basic idea is to scale the network by stacking multiple layers of switches. However, the performance of Clos networks depends heavily on ideal load balancing across switches, which is difficult to achieve in practice due to routing policies and real traffic patterns. For example, in many two-tier Fat-Tree deployments, which consist of Spine and Leaf layers, traffic across Spine switches can become severely imbalanced. As a result, upper-layer applications often fail to obtain the expected network performance. To reduce the overhead of cross-layer forwarding, the industry often adopts ROFT (Rail-Optimized Fat-Tree) architectures [1]. As shown in Figure 3, ROFT groups GPUs by index ("rail"), and connects GPUs with the same index to the same Leaf switch, reducing the communication cost across Spine switches. ROFT works well for certain training traffic patterns. However, in Prefill-Decode disaggregated inference, we observed a more prominent issue: KV Cache transfers exhibit strong source-destination asymmetry. Different GPUs and different NICs carry highly uneven communication loads, as shown in Figure 4. As a result, ROFT’s rail mapping no longer naturally translates into load balancing. Instead, traffic can become concentrated on a small number of Leaf switches and links, leading to link congestion and degraded transfer performance. This manifests in several ways: - Some Leaf switches become persistent load hotspots, increasing the probability that multiple KV Cache transfer flows compete on the same links. As a result, actual transfer throughput can fall far below the NIC bandwidth capacity. - Certain egress queues on some Leaf switches remain at high depth for extended periods and frequently trigger PFC backpressure, as shown in Figure 5. - Link congestion further amplifies tail latency, affecting both TTFT and overall throughput. It is important to distinguish between the two types of network congestion, as illustrated in Figure 6: - Unavoidable congestion: For example, when multiple GPUs send data to the same destination at the same time, contention on the final-hop link is inevitable. - Avoidable congestion: This is caused by topology design, traffic mapping, or imbalanced multipath utilization. Fundamentally, it is an architecture-level design problem. For the first type of congestion, we typically rely on congestion control, traffic shaping, and related mechanisms to mitigate its impact. For the second type, new network transport mechanisms such as adaptive routing [2], packet spraying [3,4], and MRC [5] can help. However, a more effective approach is to prevent network conflicts that should not occur in the first place through innovation at the network architecture layer. Prefill-Decode disaggregated inference is a typical example. If the network topology cannot match the traffic pattern, the system will repeatedly generate load hotspots and link conflicts. Solving this problem requires rethinking the inference network architecture itself. # 2. ZCube Network Architecture To address the above issues, we deployed a new ZCube network architecture [6]. ZCube breaks away from the traditional Clos design philosophy of hierarchical switch stacking and instead introduces a fully flattened GPU server interconnect. The ZCube routing strategy, designed specifically for the ZCube architecture, fully leverages the structural properties of the flattened topology. It can achieve near-ideal load balancing across all switches in the network, thereby significantly improving overall cluster network bandwidth. Compared with Clos, ZCube has a natural advantage in load balancing. This advantage benefits both training clusters and inference clusters. Importantly, ZCube achieves these performance gains while reducing switch and optical module costs by approximately one third compared with Clos. Based on current mainstream switch and NIC configurations, ZCube can support flattened networking for tens of thousands, or even hundreds of thousands, of GPUs. ## 2.1 ZCube Core Architecture As shown in Figure 7, the core ideas of ZCube are: 1. Remove the Spine switch layer. 1. Divide Leaf switches into two groups of equal size, typically odd-numbered switches and even-numbered switches. 1. Establish a complete bipartite interconnect between the two switch groups. 1. Connect the two ports of each GPU NIC to the corresponding switches in the two groups using single-rail and multi-rail access patterns. Suppose each GPU has a corresponding NIC with two ports, i.e., p=2. There are n GPUs in total, and GPUs and NICs share the same indices: 1,2,…,n. Let k denote the number of GPUs connected to each switch. The total number of switches is 2n/k, numbered 1,2,…,2n/k. For GPU i, where 1≤i≤n: - The first port connects to the odd-numbered switch: ((i−1)mod(n/k))×2+1 - The second port connects to the even-numbered switch: ⌈i/k⌉×2 The two switch groups are connected as a complete bipartite graph: every odd-numbered switch connects to every even-numbered switch. A ZCube topology under dual-port NIC configuration, withp=2,n=32, and k=8, is shown in Figure 7. ## 2.2 Key Properties of ZCube Network Diameter ZCube has a network diameter of two switch hops, meaning any pair of GPUs can reach each other through two switches. This sits between a one-layer switch network, which has one switch hop but limited scale, and a conventional two-layer switch network, which supports a larger scale but typically requires three switch hops and incurs higher latency. Load Balancing First, the ZCube routing strategy ensures that each GPU pair has a unique optimal path, avoiding traffic conflicts caused by multipath route selection. Second, ZCube uses two complementary GPU-to-switch connection patterns. One switch group connects to GPUs in a single-rail pattern, where each switch connects to a contiguous range of GPU IDs. The other switch group connects to GPUs in a multi-rail pattern, where each switch connects to GPUs with the same relative index across groups. This design enables ZCube to achieve highly effective load balancing across the entire switch fabric under both typical AI training traffic patterns, such as AllReduce and All-to-All, and typical AI inference traffic patterns, where source-destination relationships are uncertain, and NIC loads can be highly imbalanced. As a result, ZCube can avoid the second type of network congestion described earlier at the architecture layer. As shown in Figure 8, traffic flows that would conflict under ROFT can obtain dedicated network paths under ZCube, thereby avoiding congestion. Scalability ZCube provides strong scalability while preserving its favorable performance characteristics. For example, using one layer of 51.2T switches, each with 128 × 400Gbps ports, ZCube can construct a network connecting 16,384 400Gbps NICs. If higher-capacity switches are used, or if the ZCube network is divided into more planes, the architecture can scale further to support interconnection among tens of thousands or even hundreds of thousands of GPUs. Cost At the same cluster scale, ZCube can reduce switch and optical module costs by approximately one third compared with traditional Clos / ROFT architectures. For example, in a 10,000-GPU AI cluster, ZCube can save roughly 210 million RMB to 640 million RMB in network hardware investment. These characteristics show that ZCube can achieve better load balancing and performance while requiring lower network hardware cost. ## 2.3 Real-World Cluster Testing: Boosting Inference Performance While Cutting Network Costs We upgraded the network architecture of a thousand-GPU cluster running GLM-5.1 coding inference services from the original ROFT to the ZCube architecture. Since the ZCube architecture eliminates the Spine-layer switches found in traditional Clos architectures, the legacy cabling patterns, IP addressing schemes, routing policies, and switch configuration methods established under the Clos framework could not be reused directly, necessitating a complete redesign tailored to ZCube. To tackle these challenges, the Harnets.AI Network Team designed a comprehensive network solution centered on the ZCube architecture. They developed a suite of automation tools, including the ZCube Controller, a data center layout design tool, and a cabling correctness verification program. This enabled capabilities such as data center deployment planning, cabling validation, automated configuration generation, and batch deployment, effectively resolving numerous hurdles in ZCube deployment. This suite of tools was the critical factor enabling the successful transformation of a large-scale production cluster within an exceptionally tight timeframe. Following the seamless network architecture migration, we conducted real-world testing on the ZCube architecture by running the GLM-5.1 coding inference services on this cluster. By comparing the cluster's inference performance before and after the upgrade, we found that ZCube boosted the average GPU inference throughput by over 15% compared to the ROFT architecture (as shown in Figure 9), while dropping the P99 tail latency of TTFT by 40.6%. In summary, for GPU and server hardware of the same scale and configuration, and without modifying any applications, upgrading the networking architecture to ZCube allowed us to not only save 1/3 of the optical modules and switch hardware, but also enable the cluster to serve 15% more inference requests per second. Against the current backdrop of exploding inference workloads and severe shortage of compute resources, this approach proves to be highly pragmatic and valuable. Currently, this ZCube cluster has been running stably for over two weeks, playing a vital role in powering the GLM-5.1 coding inference services. # 3. Conclusion LLM inference is moving from point-wise optimization toward system-level co-design. The coupling between the network and the inference engine is becoming increasingly tight, making networking a critical component of the inference system. The production deployment of ZCube shows that network architecture innovation can directly unlock the effective capacity of inference systems. By better aligning the network architecture with KV Cache transfers and PD traffic patterns, ZCube reduces the probability of topology-induced congestion at the source, improving throughput and latency while enhancing cluster cost efficiency. Looking ahead to next-generation LLM infrastructure, network design will evolve from general-purpose interconnects toward model-traffic-driven system co-design. Long-context inference, PD disaggregation, MoE, and integrated training-inference workloads are reshaping intra-cluster communication patterns, requiring network topology, communication libraries, and scheduling policies to be jointly optimized around real model traffic. Looking ahead, we will continue pioneering novel AI network architectures for larger-scale inference and training clusters ─ upgrading the network from a foundational GPU connection layer into a core driver of token generation efficiency, system resilience, and cost-effectiveness. # Acknowledgements ZCube was published at ACM SIGCOMM 2025, and was recognized as “significantly change the way we think about and understand networking.” This is the first large-scale deployment of the technology in a production inference cluster. We thank the Harnets.AI team for their professional support and close collaboration throughout this network architecture upgrade and optimization effort. ## Reference [1] NVIDIA. 2023. SuperPOD: Next Generation Scalable Infrastructure for AI Leadership. https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod-reference-architecture-dgx-h100.pdf [2] NVIDIA. 2025. https://developer.nvidia.com/blog/accelerating-ai-storage-by-up-to-48-with-nvidia-spectrum-x-networking-platform-and-partners/ [3] Ultra Ethernet Consortium. Ultra Ethernet specification v1.0.1, 2025. [4] Tommaso Bonato, Abdul Kabbani, Ahmad Ghalayini, Michael Papamichael, Mohammad Dohadwala, Lukas Gianinazzi, Mikhail Khalilov, Elias Achermann, Daniele De Sensi, and Torsten Hoefler. REPS: Recycled entropy packet spraying for adaptive load balancing and failure mitigation, 2026. [5] Araujo, J., Chow, A., Handley, M., Lewis, R., Paasch, C., Padhye, J., … & Sur, S. (2026). Resilient AI Supercomputer Networking using MRC and SRv6. arXiv preprint arXiv:2605.04333. [6] Yan, Z., Li, D., Chen, L., Xiong, D., Gao, K., Zhang, Y., … & Lin, H. (2025, September). From ATOP to ZCube: Automated topology optimization pipeline and a highly cost-effective network topology for large model training. In Proceedings of the ACM SIGCOMM 2025 Conference (pp. 861-881).

译随着长上下文与Prefill-Decode分离部署成为主流，GPU集群网络已从次要部件转变为制约推理吞吐、尾部延迟和成本的关键瓶颈。传统静态网络拓扑与动态非对称的KV Cache流量模式冲突，导致局部拥塞。为此，Z.ai、Harnets.AI与清华大学联合研发了ZCube网络架构。该架构采用完全扁平化拓扑与混合接入设计，从源头解耦并分散流量以减少拥塞。在GLM-5.1生产测试中，ZCube在保持GPU与软件栈不变的前提下，实现了交换机与光模块成本降低33%、平均推理吞吐提升15%、首token时间P99降低40.6%的显著效果，证明网络架构创新能有效释放硬件潜力。

Emad@EMostaque · 5月21日91

Once AI starts making solving open problems in novel ways it won’t stop. We are entering the final stage of human solutions to open problems like this. Feels weird, doesn’t it?

译OpenAI模型首次自主解决了Paul Erdős于1946年提出的平面单位距离问题，这一突破推翻了数学界近80年来的主流猜想。AI不仅给出了更优的解法，更发现了一族全新的构造方式。这一事件被视为AI能力的里程碑，暗示着在解决科学开放性问题上，AI正开始以新颖方式持续突破，可能标志着人类主导此类问题求解的“最终阶段”的到来。

Greg Brockman@gdb · 5月21日92

An OpenAI model has achieved a major breakthrough in mathematics, by disproving a central conjecture in discrete geometry that was first posed by Paul Erdős in 1946. This is the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

译OpenAI的模型在离散几何领域取得重大突破，自主解决了由数学家Paul Erdős于1946年首次提出的平面单位距离猜想。该突破是AI首次独立解决一个学科的核心著名开放问题。此前近80年间，数学家普遍认为该问题的最优解大致呈现为方形网格结构，而OpenAI模型发现了全新的、性能更优的构造方式，颠覆了这一长期信念。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 5月21日87

"This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics."

译OpenAI模型自主攻克了数学领域一个长达近80年的著名开放问题——平面单位距离问题。该问题由Paul Erdős于1946年提出，传统观点认为最优解结构近似于方格网格。OpenAI模型的突破性发现不仅推翻了这一长期假设，还构造出性能更优的全新解法，标志着人工智能首次在数学核心领域独立解决重大未解难题。

全部 AI 动态

AI 相关资讯全量信息流

全部一手信源资讯推文

全部模型产品行业论文技巧

5月26日

23:29

Ant Ling@AntLingAGI

同事件精选68

团队推出 KPop，用于稳定大规模 MoE 模型的智能体强化学习训练。它用基于二元 KL 散度的自适应掩码机制，替代了此前 IcePop 方法中的固定比例掩码，能根据训练过程中的训练-推理不匹配程度动态调整。这一改进使得 Ring-2.6-1T 模型在无需修改基础设施或路由重放的情况下，仅通过纯 RL 训练，在 SWE-bench Verified 上取得了超过 76 分的成绩。

Jia Guo: Curious about the secret sauce behind our trillion-scale agentic foundation model? Here it comes!🥳 Last year, we releas...

智能体数据/训练编码论文/研究

同一事件，精选展示《蚂蚁 inclusionAI 推出万亿参数推理模型 Ring-2.6-1T》

推荐理由：蚂蚁团队把 IcePop 升级成 KPop，从固定掩码变成自适应 KL 区域，思路很巧。Ring-2.6-1T 纯 RL 直接冲到 SWE-bench 76+，做 agentic RL 训练的同学值得翻一下博客。

23:27

Berryxia.AI@berryxia

44

论文《Language Models Need Sleep》摘要

CMU与UMD的研究团队在论文《Language Models Need Sleep》（arXiv 2605.26099）中指出，传统Transformer模型在处理长任务时，因注意力机制计算复杂度高及KV cache显存占用持续增长而导致效率低下。为此，他们提出了受生物启发的“类睡眠巩固”机制：模型会周期性进入“睡眠”状态，离线多轮处理最近的上下文，并将信息固化到模型状态空间块的fast weights中，随后清空KV cache。实验表明，增加睡眠深度或时长能显著提升模型后续的推理能力。该框架完全开源，提供了一种区别于暴力堆显存的长上下文处理新范式。

himanshu: very cool research (and nomenclature)

arXiv 开源生态推理论文/研究

23:03

Rohan Paul@rohanpaul_ai

61

论文提出Self-play SWE-RL方法，通过自我博弈提升软件智能体能力

Meta、CMU等机构在论文中提出了Self-play SWE-RL方法。该方法让编程智能体通过“自我博弈”生成训练数据，而非仅依赖人工标注的问题。具体而言，一个模型探索代码库、注入bug并留下测试用例来描述问题；另一个模型则学习根据测试修复系统。其中，测试成为了描述问题的核心语言。该方法在SWE-bench Verified上提升了+10.4分，在SWE-Bench Pro上提升了+7.8分。值得注意的是，评估使用了该系统未训练过的自然语言问题，表明其可能学到了更深层的软件理解能力。

智能体 arXiv Meta 编码

22:33

Rohan Paul@rohanpaul_ai

57

AutoResearchClaw：支持人机协作的自主研究框架

Meta、斯坦福等机构提出AutoResearchClaw，这是一个通过AI智能体进行自主研究的框架。其核心理念是将科研过程转化为一个受流程约束的循环，而非简单的生产线。系统整合了辩论、修复、验证、记忆和选择性的人类反馈，并将失败视为有效证据。在ARC-Bench基准测试中，该系统在结果分析等任务上性能比AI Scientist v2提升54.7%。人类协作实验显示：CoPilot模式（适时介入）接受率达87.5%，完全自主仅25%，逐步监督为50%。一个关键失败案例揭示了当所有交叉验证方法返回相同零偏差输出时，系统虽通过数值验证却失去了科学意义，凸显了人类判断的关键作用。

智能体 Google Meta 论文/研究

22:28

Ant Ling@AntLingAGI

62

SwiGLU在现代大语言模型中无处不在--但对于大输入，它的行为类似于x2。这种二次增长会膨胀激活值，放大异常值，并使深层网络或低精度（FP8/FP4）训练容易出现损失尖峰。我们提出了PowLU，一种为稳定大规模预训练而设计的即插即用激活函数。🧵

推理数据/训练论文/研究

18:28

X.PIN@thexpin

67

华为AI芯片：绕过制程限制的扩展路径

华为将不依赖更小制程节点，通过封装与架构创新来扩展其昇腾AI芯片。根据何庭波的论文，华为计划在2025年至2030年间，通过Chiplets、2.5D扇出封装和3D堆叠技术，推进其昇腾SuperPoD系列，具体产品包括2025年的910C、2026年的950及后续的990。约2030年，Ascend 990将引入LogicFolding技术，目标是到2035年实现100倍的集成度跃升。

端侧论文/研究部署/工程

15:00

Rohan Paul@rohanpaul_ai

59

灵巧机械手的一个工程挑战在于平衡强度与速度。这里 SharpaWave 正以超过每秒 4 次的频率进行快速手部循环。动态触觉阵列采用视觉-触觉传感：指尖集成了摄像头和 1000 多个触觉像素。

具身智能多模态论文/研究

06:58

Rohan Paul@rohanpaul_ai

69

新论文：LLM应诚示不确定性，而非假装确定

Google最新论文指出，LLM的幻觉问题核心在于模型在该犹豫时仍表现确定，而非单纯事实错误。论文将优化目标从追求完美的事实准确性，转向让模型能诚实地区分“我确知”与“我猜测”。作者提出了“忠实不确定性”概念，要求模型的表述与其内部置信度相符。文章还引入了“效用税”概念，解释了为何产品倾向自信但可能错误的回答。对于智能体而言，元认知能力至关重要，它决定了何时调用工具、何时信任信息源。

Google 安全/对齐论文/研究

04:58

Rohan Paul@rohanpaul_ai

65

AI智能体以代码为主要工作层时性能更佳

Meta、斯坦福与伊利诺伊的研究论文指出，AI智能体在将代码作为主要工作层时性能更佳。论文认为，大语言模型（LLM）作为文本预测器，在处理长任务时存在状态丢失、错误隐蔽等问题。真正的进步并非“AI写代码”，而是“AI在代码环境中思考”。论文的核心是提出一个以代码为中心的“智能体框架”，即工具、记忆、沙箱等系统。在此框架中，测试成为传感器，代码库成为记忆，日志成为历史，沙箱成为边界。生成的脚本成为可运行、检查、修改和共享的操控对象。总结发现，代码能通过可执行步骤帮助智能体推理，通过工具调用行动，并通过测试、日志等对环境进行建模。

智能体 arXiv Meta 编码

5月25日

23:54

elvis@omarsar0

66

微软研究院提出SkillOpt方法，通过优化器自动学习AI智能体技能文档

微软研究院提出了SkillOpt方法，将AI智能体的技能文档视为可训练的外部状态，而非由工程师手动编写。该方法利用一个优化器模型对技能文件进行验证门控编辑，通过添加、删除或替换指令来优化文档，并引入文本学习率控制每轮重写力度，而智能体本身保持不变。实验显示，在全部52个测试单元（涵盖不同模型、基准测试和工具链）中，SkillOpt均达到最佳或并列最佳。具体在GPT-5.5上，相比无技能文档，SkillOpt在直接聊天、Codex和Claude Code下分别取得23.5、24.8和19.1分的提升，超越人类手写技能及其他自动化方法，且不增加推理时开销，学到的技能还能跨模型和工具链迁移。

智能体 Microsoft 论文/研究

19:28

Rohan Paul@rohanpaul_ai

75

华为发布突破性芯片设计方法"LogicFolding"

华为提出了“τ缩放”和“LogicFolding”两种新方法，旨在不依赖最先进光刻工具的前提下，缩小与台积电的性能差距。其核心思想是将衡量芯片进步的指标从晶体管尺寸转向信号传输延迟（τ）。LogicFolding作为具体实现，通过垂直堆叠逻辑电路层并采用混合键合，将需要通信的电路紧邻放置，从而缩短关键线路、降低电阻和寄生电电容，提升信号速度。华为表示，其下一代麒麟手机芯片将是对τ缩放规律的首次全面测试。

Rohan Paul: 🇨🇳 Huawei reveals a new chip design breakthrough under US sanctions pressure. A design approach meant to close the gap...

端侧论文/研究

关联讨论 1 条IT之家（RSS）

03:57

Rohan Paul@rohanpaul_ai

65

全注意力回归：将全注意力转化为稀疏，训练步骤在百步之内

阿里巴巴与南京大学提出RTPurbo，一种轻量级适配方法。该方法发现，已训练的全注意力模型内存在隐藏的稀疏结构。它利用一个轻量的16维token查找器作为“侦察兵”，为少数需要长程信息的关键注意力头定位重要token，而让其他头主要关注局部文本。基于此，RTPurbo在100万token预填充任务上，相比FlashAttention-2实现了高达9.36倍的加速，解码阶段也约有2倍加速，同时在长上下文和推理基准上保持了接近全注意力模型的精度。该研究表明，长上下文推理中的计算浪费具有可挖掘的结构性。

arXiv 推理论文/研究

02:57

Chubby♨️@kimmonismus

60

又有九个Erdős问题被解决了。但这次，是Google DeepMind完成的。这不容小觑，因为一方面它加剧了竞争压力，另一方面也证明了其他前沿实验室可以轻松跟上。

Przemek Chojecki | PC: Another 9 open Erdos problems solved, this time by DeepMind team. Interesting loop of LLM - Lean agents working autonomo...

DeepMind 推理论文/研究

02:57

Rohan Paul@rohanpaul_ai

73

大型MoE模型或在无需专家帮助的简单token上浪费半数计算

论文提出ZEDA框架，可将训练后固定的静态MoE模型（如Qwen3、GLM）转变为动态模型，允许路由器在token过于简单时跳过专家调用。实验显示，在Qwen3-30B-A3B和GLM-4.7-Flash上，ZEDA可移除约50%的专家计算量，仅带来轻微准确率损失，并实现约20%的实际推理速度提升。研究发现，计算分配主要依据模型的不确定性，而非单纯跟随任务难度。

推理论文/研究部署/工程

5月24日

20:27

Chubby♨️@kimmonismus

68

德国研究：普通WiFi路由器可近乎完美识别个人身份

德国KIT研究人员展示，使用普通WiFi路由器即可近乎完美地识别个人身份，无需手机、特殊硬件或视线。该系统利用每个已连接设备都在广播的未加密波束成形反馈（beamforming feedback）。在197名受试者的测试中，识别准确率接近100%。该研究指出，此类监控基础设施（如咖啡馆、机场、办公室中的路由器）已普遍存在，核心问题在于谁将开始读取并利用这些信号。

安全/对齐论文/研究

19:27

Rohan Paul@rohanpaul_ai

39

新论文定义AGI："人工科学家"模型

一篇新论文提出对通用人工智能（AGI）的明确定义，认为AGI是一种“人工科学家”。该模型需要像人类科学家一样，具备自主规划实验、学习因果关系、平衡探索与行动的能力。其核心在于适应性，即能在计算、记忆和能量限制下，像人类科学家一样广泛、高效且科学地适应新环境和任务，评判标准是其发现和适应能力，而非通过拟人化测试。

arXiv 论文/研究

5月23日

23:51

elvis@omarsar0

64

调整运行时接口而非模型，提升AI代理通用性

一项新研究提出通过改进包裹冻结LLM的运行时接口来优化AI代理性能，而非修改模型本身。该方法将反复出现的交互失败转化为对运行时层的可复用干预，在7个确定性环境、126个设置中取得平均88.5%的相对性能提升。关键发现是，从单一模型轨迹中学习到的运行时方法可成功迁移至18个不同模型骨架，证明其捕捉的是环境结构而非模型特异性模式。这为生产环境中部署AI代理提供了更高可移植性的解决方案。

智能体论文/研究部署/工程

22:57

Rohan Paul@rohanpaul_ai

60

智能体编程的测试时计算规模化

Meta研究发现，在编程智能体任务中，通过复用过往尝试的简短摘要，其性能显著优于使用原始日志。该论文指出，对于长程编程任务，主要瓶颈已从代码生成转向了如何有效记忆与表示智能体的工作过程。其方法是将每次充满错误的“混乱轨迹”转化为包含核心假设、进展与失败点的紧凑摘要，系统通过锦标赛式选择最佳摘要来指导新一轮尝试。在Claude 4.5 Opus的测试中，该方法使其在SWE-Bench Verified上的得分从70.9%大幅提升至77.6%，证明提升性能的关键在于以可复用的形式存储经验。

智能体 Meta 编码论文/研究

21:27

Rohan Paul@rohanpaul_ai

61

研究揭示：AI代理的性能更依赖外部控制系统而非提示词本身

本研究指出，AI代理的实际性能更多取决于围绕模型的外部控制系统（即代理框架），而非单纯的提示词。当前许多代理看似单一模型，其行为实则由规划、工具调用、记忆管理等周边代码驱动，导致长任务易因状态丢失、验证漂移等环节失败。为此，论文提出“自然语言代理框架”理念，旨在将控制流程以结构化自然语言显式表达，使其可检查、可迁移且可测试。研究发现，虽然更复杂的框架能显著改变代理行为，但并未带来稳定的性能提升，这表明框架设计是保障可靠性的关键选择，而非一种立竿见影的万能方案。

智能体论文/研究

20:27

Rohan Paul@rohanpaul_ai

55

AI检测器为何容易失效：学生写作风格的多样性挑战

该研究指出，AI检测器频繁失效的根本原因在于学生写作风格的多样性，使得仅凭单份文档判断是否为AI生成变得极为困难。问题不仅在于AI写作能力在提升，更在于许多真实学生的写作风格，在统计特征上已与AI输出高度相似。检测器无法事先掌握每个学生独特的写作习惯，因此“人类写作”不存在一个固定的判断标准。这意味着任何能有效识别大量AI文本的检测器，都不可避免地会误判一部分真实学生，尤其是写作更规范、公式化或受英语学习影响的学生。现有技术或许能降低错误率，但无法根除基于“单次判断”模式所带来的结构性误判问题。

arXiv 安全/对齐论文/研究

08:27

Rohan Paul@rohanpaul_ai

64

谷歌新研究：AI学习生理模式提升可穿戴设备价值

谷歌研究院提出基础模型SensorFM，通过学习超过500万人产生的逾1万亿分钟可穿戴设备传感器数据，掌握了人类生理活动的一般性模式。该模型超越了将数据压缩为简单指标的传统方法，能够从数据中提取出有意义的结构并将其复用于多种健康预测任务。实验显示，模型规模和数据量越大性能越强，且其学习到的数据表征在35项预测任务中的34项上，均优于基于工程特征的基线方法。

Google 数据/训练端侧论文/研究

06:57

Rohan Paul@rohanpaul_ai

精选79

AlphaProof Nexus：用形式化验证驱动AI数学证明搜索

Google DeepMind提出了AlphaProof Nexus系统，它将大型语言模型与Lean形式化验证工具相结合。该系统允许LLM在生成证明的过程中，不断读取Lean的编译错误并进行修正，还可调用更强的工具辅助解决子问题。这一机制迫使模型将每一步逻辑都转化为可编译、可验证的代码，从而将其角色从“令人信服的叙述者”转变为“候选方案生成器”。在针对353个Erdős问题和492个开放猜想的测试中，系统成功解决了9个Erdős问题并证明了44个序列猜想。该研究展示了形式化验证在暴露AI逻辑错误、建立“人类提问-模型探索-验证器把关”新分工中的关键作用。

arXiv DeepMind 推理论文/研究

关联讨论 2 条The Decoder：AI News（RSS）IT之家（RSS）

推荐理由：DeepMind 把 AI 的'数学直觉'塞进 Lean 编译器里，每步都必须编译通过，结果解决 9 个 Erdős 问题，失败也暴露了隐藏错误。这篇论文重新定义了 AI 做数学的范式。

5月22日

21:26

Rohan Paul@rohanpaul_ai

46

这个RAI研究所的机器人通过动态手部调整管理三球抛接。它处理视觉和接触信息以维持模式，无需外部辅助。

具身智能论文/研究

09:56

Chubby♨️@kimmonismus

54

东京大学研发超低功耗芯片，效率提升千倍但十年后才能商用

东京大学研发了一种新型芯片组件，其处理数据速度较传统方法提升1000倍，且不产生额外热量。关键突破在于功耗仅为现有技术的百分之一，这理论上能使一个谷歌规模的数据中心能耗降低至当前的百分之一，极大缓解AI行业的能源压力。然而，该芯片原型预计2030年才问世，商用化需更长时间，凸显了AI快速发展与突破性节能技术量产时间之间的差距。

论文/研究部署/工程

08:13

Berryxia.AI@berryxia

66

苹果数字人面部捕捉技术再突破，逼真度迈向新高

苹果Persona团队在WWDC26前发布新论文，展示了面部捕捉与动画技术的最新进展。从演示来看，其在眼部微表情、头部细微动作和皮肤质感等细节上实现了显著提升，使数字形象的真实感进一步增强，已超越简单“数字头像”，趋近于可信的“数字分身”。这类突破对AR/VR、游戏和远程协作等领域的沉浸式体验至关重要，能够有效打破虚拟交互中的“不真实感”。苹果持续重仓该技术赛道，相关论文与演示视频已公开。

Jonathan Cooper: Apple's Persona team continuing to do amazing work with face capture and animation. New paper released ahead of WWDC26 h...

多模态视频论文/研究

07:10

Saining Xie@sainingxie

60

RAEv2通过大幅简化架构并提升通用性，在文本到图像（T2I）和世界模型等任务中实现了超过10倍的收敛速度提升，同时改善了重建与生成质量。研究团队在大量实验中发现，强大的表示编码器对像素解码器至关重要。传统评估指标（如FID）已不足以全面衡量模型性能，新的评估指标（如ep@fid-k/fdr^k）揭示了生成模型领域仍存在广阔的研究空间。

Jaskirat Singh: In Oct last year, Representation Autoencoders provided an elegant solution to unified tokenization for understanding and...

图像生成论文/研究

02:43

Ethan Mollick@emollick

61

似乎GPT-5.2在同行评审中达到了专家水平：45位科学家花费469小时，评估了人类与AI对82篇论文的评审。 "令人惊讶的是，当前的AI评审甚至能与《自然》官方同行评审中的顶级评审人相媲美……"尽管并非没有弱点。

OpenAI 推理论文/研究

01:26

AK@_akhaliq

68

Mix-Quant 量化预填充，精确解码，面向智能体LLM

智能体论文/研究部署/工程

00:26

AK@_akhaliq

56

LongMINT 评估长期智能体系统中多目标干扰下的记忆能力

智能体 arXiv 推理论文/研究

5月21日

22:42

Ethan Mollick@emollick

55

在科学领域，AI在寻找值得解决的有趣问题方面仍然表现不佳，尤其是在那些没有已知问题清单的领域。这一直是博士培养中最难教授的能力：否则你只能找到小问题，或是那些无法推动领域发展、无法泛化的问题等。

大佬观点论文/研究

17:03

Orange AI@oran_ge

81

AI自主破解80年数学难题，里程碑式突破

OpenAI未公开的内部通用推理模型，自主解决了数学家Erdős于1946年提出的平面单位距离问题，颠覆了近80年来学界对解法结构的普遍预期。该模型通过125页思维链，创新运用代数数论工具解决离散几何问题，实现了跨领域方法论突破。更值得注意的是，该模型并非专攻数学训练，其成果表明通用推理能力达到一定阈值后可能自然催生创造性，标志着AI在基础科学领域迈出了关键一步。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI 推理论文/研究

关联讨论 8 条TechCrunch：AI（RSS）The Decoder：AI News（RSS）X：OpenAI (@OpenAI)OpenAI：官网动态（RSS · 排除企业/客户案例）IT之家（RSS）Hacker News 热门（buzzing.cc 中文翻译）X：Sam Altman (@sama)X：Noam Brown (@polynoamial)

15:57

Greg Brockman@gdb

78

AI在数学领域实现了新知识生成的里程碑式突破。OpenAI模型解决了组合几何中悬而未决的著名难题--平面单位距离问题（Erdos 1946），首次证明通过AI方法可将该问题中单位距离对的数量提升至超线性规模（n^{1+δ}），超越了以往所有人类已知的线性构造。这标志着AI从解决已知问题迈向发现新数学的重要进展。该突破引发了研究者"难以入睡"的强烈反响，被视为AGI时代临近的信号。

Alex Dimakis: A breakthrough by OpenAI in a very famous Combinatorics problem, the Planar Unit Distance problem by Erdos 1946. The pro...

OpenAI 推理论文/研究

关联讨论 8 条TechCrunch：AI（RSS）The Decoder：AI News（RSS）X：OpenAI (@OpenAI)OpenAI：官网动态（RSS · 排除企业/客户案例）IT之家（RSS）Hacker News 热门（buzzing.cc 中文翻译）X：Sam Altman (@sama)X：Noam Brown (@polynoamial)

15:26

Rohan Paul@rohanpaul_ai

78

AI通用推理突破80年数学猜想

OpenAI的通用推理模型自主解决了一个自1946年以来未解的著名数学难题——平面单位距离问题。该模型没有采用专门为数学设计的定定理证明引擎，而是通过推理时增强计算能力，发现了优于传统网格结构的新构造方案。这标志着AI首次自主解决一个数学领域的核心开放问题。更重要的是，该模型能将几何问题与代数数论等深层理论连接，展示了通用人工智能在跨领域研究和拓宽人类认知边界方面的巨大潜力。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI 推理论文/研究

关联讨论 8 条TechCrunch：AI（RSS）The Decoder：AI News（RSS）X：OpenAI (@OpenAI)OpenAI：官网动态（RSS · 排除企业/客户案例）IT之家（RSS）Hacker News 热门（buzzing.cc 中文翻译）X：Sam Altman (@sama)X：Noam Brown (@polynoamial)

15:26

Rohan Paul@rohanpaul_ai

67

小模型大智慧：随机推理实现性能超越

仅1000万参数的GRAM模型，通过引入可学习的随机性，在推理时并行探索多条不同路径，打破了传统递归模型锁定单一思维的限制。该模型在测试时同时运行这些平行轨迹，并借助奖励预测器选择最优结果，从而在深度之上增加了“宽度”维度。实验表明，GRAM在困难数独任务上准确率高达97%，远超此前最佳确定性模型；在多解的皇后问题上也能维持高性能，并能高效生成有效的数独谜题。这一框架为提升小模型的推理能力提供了新思路。

推理论文/研究

12:44

Chubby♨️@kimmonismus

84

OpenAI突破性解决平面单位距离问题

OpenAI内部推理模型自主解决了存在近80年的著名数学开放问题——平面单位距离问题。该模型推翻了Paul Erdős的猜想，发现了全新的点配置构造，其效率以固定多项式因子优于传统方格网格方案。证明运用了代数数论等跨学科方法，经外部数学家验证，被Fields奖得主Tim Gowers誉为“AI数学的里程碑”。这是AI首次独立解决数学领域的核心公开问题，标志着从知识复现到知识创造的重要转变，其跨领域推理能力可能为多学科研究带来深远影响。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI 推理论文/研究

关联讨论 8 条TechCrunch：AI（RSS）The Decoder：AI News（RSS）X：OpenAI (@OpenAI)OpenAI：官网动态（RSS · 排除企业/客户案例）IT之家（RSS）Hacker News 热门（buzzing.cc 中文翻译）X：Sam Altman (@sama)X：Noam Brown (@polynoamial)

11:03

AYi@AYi_AInotes

76

OpenAI模型突破性自主解决80年数学悬案

OpenAI的一个AI模型自主攻克了“平面单位距离问题”，这是数学家埃尔德什于1946年提出的一个著名开放难题。近80年来，学界普遍认为最优构造近似于方格子，而该AI模型通过运用代数数论中冷门的Golod-Shafarevich理论，发现了一整族效率更高的全新构造，推翻了原有定见。此成就标志着AI首次独立解决一个数学领域的核心开放问题，其关键在于提出并完整执行了一条人类因直觉认为不可行而从未尝试的创新路径。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI 推理论文/研究

关联讨论 8 条TechCrunch：AI（RSS）The Decoder：AI News（RSS）X：OpenAI (@OpenAI)OpenAI：官网动态（RSS · 排除企业/客户案例）IT之家（RSS）Hacker News 热门（buzzing.cc 中文翻译）X：Sam Altman (@sama)X：Noam Brown (@polynoamial)

05:50

Z.ai@Zai_org

75

ZCube网络架构：破解大模型推理网络瓶颈

随着长上下文与Prefill-Decode分离部署成为主流，GPU集群网络已从次要部件转变为制约推理吞吐、尾部延迟和成本的关键瓶颈。传统静态网络拓扑与动态非对称的KV Cache流量模式冲突，导致局部拥塞。为此，Z.ai、Harnets.AI与清华大学联合研发了ZCube网络架构。该架构采用完全扁平化拓扑与混合接入设计，从源头解耦并分散流量以减少拥塞。在GLM-5.1生产测试中，ZCube在保持GPU与软件栈不变的前提下，实现了交换机与光模块成本降低33%、平均推理吞吐提升15%、首token时间P99降低40.6%的显著效果，证明网络架构创新能有效释放硬件潜力。

推理论文/研究部署/工程

关联讨论 1 条智谱：研究（网页内嵌数据）

04:01

Emad@EMostaque

91

OpenAI模型首次自主解决了Paul Erdős于1946年提出的平面单位距离问题，这一突破推翻了数学界近80年来的主流猜想。AI不仅给出了更优的解法，更发现了一族全新的构造方式。这一事件被视为AI能力的里程碑，暗示着在解决科学开放性问题上，AI正开始以新颖方式持续突破，可能标志着人类主导此类问题求解的"最终阶段"的到来。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI 推理论文/研究

关联讨论 8 条TechCrunch：AI（RSS）The Decoder：AI News（RSS）X：OpenAI (@OpenAI)OpenAI：官网动态（RSS · 排除企业/客户案例）IT之家（RSS）Hacker News 热门（buzzing.cc 中文翻译）X：Sam Altman (@sama)X：Noam Brown (@polynoamial)

03:36

Greg Brockman@gdb

92

OpenAI的模型在离散几何领域取得重大突破，自主解决了由数学家Paul Erdős于1946年首次提出的平面单位距离猜想。该突破是AI首次独立解决一个学科的核心著名开放问题。此前近80年间，数学家普遍认为该问题的最优解大致呈现为方形网格结构，而OpenAI模型发现了全新的、性能更优的构造方式，颠覆了这一长期信念。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI 推理论文/研究

关联讨论 8 条TechCrunch：AI（RSS）The Decoder：AI News（RSS）X：OpenAI (@OpenAI)OpenAI：官网动态（RSS · 排除企业/客户案例）IT之家（RSS）Hacker News 热门（buzzing.cc 中文翻译）X：Sam Altman (@sama)X：Noam Brown (@polynoamial)

03:36

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes

87

OpenAI模型自主攻克了数学领域一个长达近80年的著名开放问题--平面单位距离问题。该问题由Paul Erdős于1946年提出，传统观点认为最优解结构近似于方格网格。OpenAI模型的突破性发现不仅推翻了这一长期假设，还构造出性能更优的全新解法，标志着人工智能首次在数学核心领域独立解决重大未解难题。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI 推理论文/研究

关联讨论 8 条TechCrunch：AI（RSS）The Decoder：AI News（RSS）X：OpenAI (@OpenAI)OpenAI：官网动态（RSS · 排除企业/客户案例）IT之家（RSS）Hacker News 热门（buzzing.cc 中文翻译）X：Sam Altman (@sama)X：Noam Brown (@polynoamial)

1…6 789 10…16