AIHOT
内容
精选全部 AI 动态AI 日报主题收藏
接入
Agent 接入
更多
关于更新日志反馈
内部员工登录
精选全部日报更多
内部员工登录
全部动态X · 506 条
全部一手资讯X论文
elvis@omarsar0 · 5月25日66

New research from Microsoft Research I see a lot of AI engineers handwriting agent skill docs and hope they generalize. Probably not optimal. This works show why. It treats the skill doc as a trainable external state of a frozen agent instead. It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes. SkillOpt is best or tied on all 52 (model, benchmark, harness) cells. On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses. Paper: https://arxiv.org/abs/2605.23904 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译微软研究院提出了SkillOpt方法,将AI智能体的技能文档视为可训练的外部状态,而非由工程师手动编写。该方法利用一个优化器模型对技能文件进行验证门控编辑,通过添加、删除或替换指令来优化文档,并引入文本学习率控制每轮重写力度,而智能体本身保持不变。实验显示,在全部52个测试单元(涵盖不同模型、基准测试和工具链)中,SkillOpt均达到最佳或并列最佳。具体在GPT-5.5上,相比无技能文档,SkillOpt在直接聊天、Codex和Claude Code下分别取得23.5、24.8和19.1分的提升,超越人类手写技能及其他自动化方法,且不增加推理时开销,学到的技能还能跨模型和工具链迁移。

Rohan Paul@rohanpaul_ai · 5月25日75

🇨🇳 Huawei just released breakthrough chip design approach "LogicFolding" that will close it's gap with TSMC. The technical paper behind it. The core idea is that chips should stop measuring progress mainly by how small transistors are and start measuring progress by how much time delay can be removed from the whole machine. A chip wastes time when signals move through long wires, memory paths, chip-to-chip links, and software communication layers, so Huawei calls this delay τ, or tau. Huawei’s paper introducing "LogicFolding" says the next chip breakthrough may come from cutting wasted time inside the machine. That is what "τ scaling" means. τ is the delay that accumulates before useful computing happens: a transistor switches, a signal crosses a wire, data reaches memory, a chip talks to another chip, or a server waits for a response. Moore’s Law reduced this delay indirectly because shrinking transistors also shortened many of the paths around them. But modern chips are no longer slowed only by transistor size. They are slowed by wire resistance, parasitic capacitance, clock skew, memory distance, protocol conversion, chip-to-chip communication, and the cost of moving data. So τ scaling changes the question from “how small is the transistor?” to “where is time being lost?” LogicFolding is Huawei’s physical answer to that question inside a chip. In a normal chip, related logic gates are spread across a flat surface, so signals often travel sideways through long metal routes before reaching the next important gate. Those wires behave like sticky pipes: resistance slows current, capacitance must be charged and discharged, and every extra distance creates delay and wastes energy. LogicFolding tries to stack active circuit layers vertically and connect them with very fine hybrid bonds, so circuits that need to talk are placed above and below each other instead of far apart on one plane. The signal now takes a shorter route, the critical path becomes faster, clock timing becomes cleaner, and the same manufacturing node can deliver more performance. Huawei is trying to win not by making every switch smaller, but by making every important signal travel less, wait less, and arrive sooner.

译华为提出了“τ缩放”和“LogicFolding”两种新方法,旨在不依赖最先进光刻工具的前提下,缩小与台积电的性能差距。其核心思想是将衡量芯片进步的指标从晶体管尺寸转向信号传输延迟(τ)。LogicFolding作为具体实现,通过垂直堆叠逻辑电路层并采用混合键合,将需要通信的电路紧邻放置,从而缩短关键线路、降低电阻和寄生电电容,提升信号速度。华为表示,其下一代麒麟手机芯片将是对τ缩放规律的首次全面测试。

Rohan Paul@rohanpaul_ai · 5月25日65

New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) with only lightweight adaptation Shows standard LLMs can handle very long context faster by making attention selectively sparse. The problem is that full attention gets very expensive when the input grows to hundreds of thousands or 1M tokens, because the model keeps comparing too many tokens with too many other tokens. The paper’s claim is that a trained full-attention model already has a hidden sparse structure, so the model does not need to be rebuilt or trained from scratch. RTPurbo uses that structure by finding the few attention heads that really need faraway tokens, while letting the other heads focus mostly on nearby text. For those retrieval heads, it uses a small 16-dimensional token finder to guess which old tokens matter, then runs the real attention only on that selected set. The authors tested this on long-context benchmarks and reasoning tasks, and RTPurbo kept accuracy close to full attention while reaching up to 9.36x faster prefill at 1M tokens and about 2x faster decoding. RTPurbo's engineering rule: keep expensive long-context access only where it matters, and route the rest through a smaller search space. The clever part is the 16-dimensional indexer. It does not replace the model’s real attention computation; it acts like a cheap scout, finding likely useful tokens before the full representation is used on the selected set. RTPurbo is not proof that every model can be safely sparsified this way. But it is strong evidence that the waste in long-context inference is more structured than it looks. ---- Paper Link – arxiv. org/abs/2605.16928v1 Paper Title: "Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps"

译阿里巴巴与南京大学提出RTPurbo,一种轻量级适配方法。该方法发现,已训练的全注意力模型内存在隐藏的稀疏结构。它利用一个轻量的16维token查找器作为“侦察兵”,为少数需要长程信息的关键注意力头定位重要token,而让其他头主要关注局部文本。基于此,RTPurbo在100万token预填充任务上,相比FlashAttention-2实现了高达9.36倍的加速,解码阶段也约有2倍加速,同时在长上下文和推理基准上保持了接近全注意力模型的精度。该研究表明,长上下文推理中的计算浪费具有可挖掘的结构性。

Chubby♨️@kimmonismus · 5月25日60

Nine more Erdős problems have been solved. This time, however, by Google DeepMind. This shouldn't be underestimated, because on the one hand it increases competitive pressure, and on the other hand it proves that the other Frontier Labs can easily keep up.

译又有九个Erdős问题被解决了。 但这次,是Google DeepMind完成的。 这不容小觑,因为一方面它加剧了竞争压力,另一方面也证明了其他前沿实验室可以轻松跟上。

Rohan Paul@rohanpaul_ai · 5月25日73

A large MoE model may be wasting half its expert compute on tokens that barely need expert help. In this paper 50% of expert computation removed, with almost no loss in accuracy. This makes already-trained MoE models like Qwen3 and GLM stop calling half their experts when a token is too easy to need them. Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. Shows that many MoE tokens do not need real experts, only permission to skip them. That sounds like a small routing trick, but it changes the economics of deployed language models. Standard MoE models already avoid using every parameter, yet they still spend the same expert budget on every token. ZEDA adds a strange new option to the router: experts that output exactly nothing. When the model routes a token to one of these zero experts, it is not making the model dumber; it is admitting that this token does not need another expensive transformation. The clever part is not the dummy expert, but the adaptation method. Instead of retraining the model from scratch, the original MoE becomes a frozen teacher, while the new dynamic version learns when it can safely skip work. Across Qwen3-30B-A3B and GLM-4.7-Flash, the result is roughly half the expert computation removed, with only marginal average accuracy loss and about 20% real inference speedup. The deeper finding is: compute use did not simply track task difficulty. The model spent more expert budget where uncertainty or teacher-student disagreement rose, while structured code and math fragments often needed less. That makes ZEDA feel less like pruning and more like attention to computational doubt. ---- Paper Link – arxiv. org/abs/2605.18643 Paper Title: "Post-Trained MoE Can Skip Half Experts via Self-Distillation"

译论文提出ZEDA框架,可将训练后固定的静态MoE模型(如Qwen3、GLM)转变为动态模型,允许路由器在token过于简单时跳过专家调用。实验显示,在Qwen3-30B-A3B和GLM-4.7-Flash上,ZEDA可移除约50%的专家计算量,仅带来轻微准确率损失,并实现约20%的实际推理速度提升。研究发现,计算分配主要依据模型的不确定性,而非单纯跟随任务难度。

Chubby♨️@kimmonismus · 5月24日68

Dont like this at all. Researchers at KIT (germany) just demonstrated that ordinary WiFi routers can identify individuals with near-perfect accuracy. No phone required, no special hardware, no line of sight. The system reads unencrypted beamforming feedback that every connected device already broadcasts. 197 test subjects, nearly 100% identification rate. The surveillance infrastructure isn't being built. It's already installed in every café, airport, and office you walk through. The only question is who starts reading the signals first. Source: science daily

译德国KIT研究人员展示,使用普通WiFi路由器即可近乎完美地识别个人身份,无需手机、特殊硬件或视线。该系统利用每个已连接设备都在广播的未加密波束成形反馈(beamforming feedback)。在197名受试者的测试中,识别准确率接近100%。该研究指出,此类监控基础设施(如咖啡馆、机场、办公室中的路由器)已普遍存在,核心问题在于谁将开始读取并利用这些信号。

elvis@omarsar0 · 5月23日64

// Adapt the Interface, Not the Model // I am fascinated by the results across my cheap-model-plus-good-harness builds. This new paper also shows good signs of the code-as-agent-harness thesis. The idea is really simple. Do not touch the model. Instead, modify the runtime interface that wraps the frozen LLM. Then convert recurring interaction failures into reusable interventions on the harness side. The paper reports an average relative improvement 88.5% across 7 deterministic environments, 126 model-environment settings, and 18 backbones. A harness learned from one model trajectory generalizes to 17 other backbones. That tells you the harness is capturing environment structure, not model-specific patterns. If you ship agents in production, your harness work is more portable than you might assume. Paper: https://arxiv.org/abs/2605.22166 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一项新研究提出通过改进包裹冻结LLM的运行时接口来优化AI代理性能,而非修改模型本身。该方法将反复出现的交互失败转化为对运行时层的可复用干预,在7个确定性环境、126个设置中取得平均88.5%的相对性能提升。关键发现是,从单一模型轨迹中学习到的运行时方法可成功迁移至18个不同模型骨架,证明其捕捉的是环境结构而非模型特异性模式。这为生产环境中部署AI代理提供了更高可移植性的解决方案。

Rohan Paul@rohanpaul_ai · 5月23日61

This paper shows that agent performance depends less on prompts alone and more on the harness around them. “Agent intelligence” is becoming partly a systems problem. The problem is that many AI agents look like 1 model, but their real behavior comes from surrounding code that controls planning, tools, memory, retries, checking, and stopping. A model may reason well in one step, but long tasks fail in messier places: state disappears, verification drifts, tools return partial evidence, and the agent forgets which intermediate artifact actually matters. Natural-Language Agent Harnesses try to make that control layer visible. Instead of burying the logic in controller code, they express the stages, roles, contracts, state rules, failure modes, and stopping conditions in structured natural language that a shared runtime can execute. The claim is not that natural language should replace code, but that the important design choices around an agent should become inspectable, portable, and testable instead of hiding inside one framework’s habits. On SWE-bench, heavier harnessing changed behavior dramatically, with more calls, tools, delegation, and runtime, but it did not produce a simple win curve; sometimes added structure helped, and sometimes it pushed the agent away from the shortest benchmark-aligned repair. A harness is not magic scaffolding around a model; it is a set of bets about where reliability comes from. ---- Paper Link – arxiv. org/abs/2603.25723 Paper Title: "Natural-Language Agent Harnesses"

译本研究指出,AI代理的实际性能更多取决于围绕模型的外部控制系统(即代理框架),而非单纯的提示词。当前许多代理看似单一模型,其行为实则由规划、工具调用、记忆管理等周边代码驱动,导致长任务易因状态丢失、验证漂移等环节失败。为此,论文提出“自然语言代理框架”理念,旨在将控制流程以结构化自然语言显式表达,使其可检查、可迁移且可测试。研究发现,虽然更复杂的框架能显著改变代理行为,但并未带来稳定的性能提升,这表明框架设计是保障可靠性的关键选择,而非一种立竿见影的万能方案。

Rohan Paul@rohanpaul_ai · 5月23日55

AI detectors fail because student writing is too varied to judge from 1 document. The problem is not only that AI writing is getting better, but that many real students write in ways that can look statistically close to AI output. The paper frames this as a testing problem where the detector does not know each student’s normal writing style, so “human writing” is not 1 fixed target. Because of that, any detector that catches many AI-written submissions must also wrongly accuse some real students, especially students whose writing is more structured, formulaic, or shaped by learning English. The authors use basic statistics to show that this false-accusation problem is not just a bug in current tools, because it appears whenever student writing overlaps with AI writing. A university is not comparing “AI text” with “human text”; it is comparing one submission with the unknown writing habits of one particular student. Better detectors may reduce some errors, but they cannot erase the structural problem created by one-shot judgment. ---- Paper Link – arxiv. org/abs/2603.20254 Paper Title: "AI Detectors Fail Diverse Student Populations: A Mathematical Framing of Structural Detection Limits"

译该研究指出,AI检测器频繁失效的根本原因在于学生写作风格的多样性,使得仅凭单份文档判断是否为AI生成变得极为困难。问题不仅在于AI写作能力在提升,更在于许多真实学生的写作风格,在统计特征上已与AI输出高度相似。检测器无法事先掌握每个学生独特的写作习惯,因此“人类写作”不存在一个固定的判断标准。这意味着任何能有效识别大量AI文本的检测器,都不可避免地会误判一部分真实学生,尤其是写作更规范、公式化或受英语学习影响的学生。现有技术或许能降低错误率,但无法根除基于“单次判断”模式所带来的结构性误判问题。

Rohan Paul@rohanpaul_ai · 5月23日64

New Google paper shows that wearable data becomes far more useful when AI learns the person behind the signals. It's is not another heart-rate algorithm, but a general model trained on more than one trillion minutes of sensor data from five million people. The authors propose SensorFM, a foundation model trained on more than 1 trillion minutes of unlabeled wearable data from 5 million people, so it can learn general patterns of human physiology before seeing specific health tasks. That scale changes the problem from measuring isolated events to learning patterns of lived physiology: sleep, movement, temperature, oxygen, heart rhythms, and their ordinary daily messiness. Wearables are not weak because they lack data; they are weak because most systems compress that data into crude summaries before the meaningful structure has a chance to appear. SensorFM tries to learn that structure first, then reuse it across tasks, which is why the same representation can help with cardiovascular, metabolic, mental health, sleep, lifestyle, and demographic predictions. The evidence is strongest as a scaling story: larger models trained on more data performed better, and the learned embeddings beat engineered-feature baselines on 34 of 35 prediction tasks. ---- Paper Link – arxiv. org/abs/2511.15352v3 Paper Title: "People readily follow personal advice from AI but it does not improve their well-being"

译谷歌研究院提出基础模型SensorFM,通过学习超过500万人产生的逾1万亿分钟可穿戴设备传感器数据,掌握了人类生理活动的一般性模式。该模型超越了将数据压缩为简单指标的传统方法,能够从数据中提取出有意义的结构并将其复用于多种健康预测任务。实验显示,模型规模和数据量越大性能越强,且其学习到的数据表征在35项预测任务中的34项上,均优于基于工程特征的基线方法。

Rohan Paul@rohanpaul_ai · 5月23日79

Google DeepMind's new paper. Shows that AI can now search formal mathematics proofs, but only inside carefully constrained worlds. The striking result is not that the system “thinks like a mathematician,” but that it keeps forcing its thoughts through Lean, where every step must compile. The problem is that LLMs can sound convincing in math while still making tiny mistakes, so the authors use Lean, a proof system that checks every logical step. Their system, AlphaProof Nexus, lets an LLM keep editing a formal proof, read compiler errors, try again, and sometimes ask a stronger proof tool for help on smaller subproblems. The stronger version also keeps a shared pool of partial proof attempts, rates which ones look promising, and uses those attempts to guide later searches. That changes the role of the model from a persuasive storyteller into a generator of candidates that can be killed quickly when they are wrong. The verifier is not a cosmetic add-on, it is the mechanism that makes exploration tolerable. Without it, a beautiful proof sketch can hide a false lemma; with it, the model has to turn insight into executable logic, or fail visibly. The authors tested the system on real unsolved math problems, including 353 formalized Erdős problems and 492 open conjectures from the Online Encyclopedia of Integer Sequences. The main result is that the best agent solved 9 Erdős problems and proved 44 sequence conjectures, while also helping with problems in optimization, graph theory, algebraic geometry, and quantum optics. The failures are as revealing as the wins, because the agents sometimes buried the hard part inside a helper lemma or hallucinated a known result, exactly the kind of error formal checking is built to expose. The real shift is not full mathematical autonomy, but a new division of labor: humans choose the formal question, libraries define the terrain, models propose routes, and the proof assistant refuses to be impressed. ---- "Advancing Mathematics Research with AI-Driven Formal Proof Search" Paper Link – arxiv. org/abs/2605.22763

译Google DeepMind提出了AlphaProof Nexus系统,它将大型语言模型与Lean形式化验证工具相结合。该系统允许LLM在生成证明的过程中,不断读取Lean的编译错误并进行修正,还可调用更强的工具辅助解决子问题。这一机制迫使模型将每一步逻辑都转化为可编译、可验证的代码,从而将其角色从“令人信服的叙述者”转变为“候选方案生成器”。在针对353个Erdős问题和492个开放猜想的测试中,系统成功解决了9个Erdős问题并证明了44个序列猜想。该研究展示了形式化验证在暴露AI逻辑错误、建立“人类提问-模型探索-验证器把关”新分工中的关键作用。

Rohan Paul@rohanpaul_ai · 5月22日46

This RAI Institute robot managing 3-balls juggling through dynamic hand adjustments. It processes visual and contact information to maintain the pattern without external aids.

译这个RAI研究所的机器人通过动态手部调整管理三球抛接。它处理视觉和接触信息以维持模式,无需外部辅助。

Chubby♨️@kimmonismus · 5月22日54

University of Tokyo built a chip component that processes data 1000x faster than conventional methods - without generating extra heat. The real number worth paying attention to: power consumption drops to 1/100th of current levels. A Google-scale data center that today powers 80,000 homes could theoretically run on the energy of 800. But the prototype chip isn't scheduled until 2030, and commercial availability is years beyond that. We're watching the AI industry sprint toward an energy wall at full speed while the most promising efficiency breakthroughs are still a decade from production. via techradar

译东京大学研发了一种新型芯片组件,其处理数据速度较传统方法提升1000倍,且不产生额外热量。关键突破在于功耗仅为现有技术的百分之一,这理论上能使一个谷歌规模的数据中心能耗降低至当前的百分之一,极大缓解AI行业的能源压力。然而,该芯片原型预计2030年才问世,商用化需更长时间,凸显了AI快速发展与突破性节能技术量产时间之间的差距。

Berryxia.AI@berryxia · 5月22日66

兄弟们,Apple的Persona团队又把数字人真实度干上新高度了。 他们刚在WWDC26前放出一篇新论文,专门讲面部捕捉和动画的最新进展。 从演示视频里看,捕捉精度和动画自然度又明显进化了一步,尤其是眼部微表情、头部细微动作和皮肤质感,真实感拉满。 这已经不是简单的“数字头像”了,而是越来越接近可信的数字分身。 对AR/VR、游戏、远程协作来说,这类突破直接决定“沉浸感”能不能成立。毕竟当你戴上头显后,最先被打穿的往往就是“这个人看起来假”的那层滤镜。 Apple显然还在持续重仓这条赛道。 论文和演示在这里(强烈建议看视频): https://apple.github.io/ml-headsup/ 有空试试这货到底表现如何??

译苹果Persona团队在WWDC26前发布新论文,展示了面部捕捉与动画技术的最新进展。从演示来看,其在眼部微表情、头部细微动作和皮肤质感等细节上实现了显著提升,使数字形象的真实感进一步增强,已超越简单“数字头像”,趋近于可信的“数字分身”。这类突破对AR/VR、游戏和远程协作等领域的沉浸式体验至关重要,能够有效打破虚拟交互中的“不真实感”。苹果持续重仓该技术赛道,相关论文与演示视频已公开。

Saining Xie@sainingxie · 5月22日60

check out RAEv2 led by Jas. through extensive exps, we found some really intriguing behaviors showing why strong representation encoders are key for pixel decoders. spoiler: it’s not about hillclimbing fid; new metrics like ep@fid-k/fdr^k show there’s a lot more left to explore!

译RAEv2通过大幅简化架构并提升通用性,在文本到图像(T2I)和世界模型等任务中实现了超过10倍的收敛速度提升,同时改善了重建与生成质量。研究团队在大量实验中发现,强大的表示编码器对像素解码器至关重要。传统评估指标(如FID)已不足以全面衡量模型性能,新的评估指标(如ep@fid-k/fdr^k)揭示了生成模型领域仍存在广阔的研究空间。

Ethan Mollick@emollick · 5月22日61

Seems GPT-5.2 reaches expert level in peer review: 45 scientists took 469 hours evaluating human & AI reviews on 82 papers. "Surprisingly, current AI reviewers are competitive even with the top-rated reviewers in Nature’s official peer review..." though not without weaknesses.

译似乎GPT-5.2在同行评审中达到了专家水平:45位科学家花费469小时,评估了人类与AI对82篇论文的评审。 “令人惊讶的是,当前的AI评审甚至能与《自然》官方同行评审中的顶级评审人相媲美……”尽管并非没有弱点。

AK@_akhaliq · 5月22日68

Mix-Quant Quantized Prefilling, Precise Decoding for Agentic LLMs

译Mix-Quant 量化预填充,精确解码,面向智能体LLM

AK@_akhaliq · 5月22日56

LongMINT Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

译LongMINT 评估长期智能体系统中多目标干扰下的记忆能力

Orange AI@oran_ge · 5月21日81

AI 发展的里程碑时刻。 OpenAI 的一个未公布的内部推理模型,自主解决了 Erdős 1946 年提出的平面单位距离问题。 chain of thought 长达125 页,核心手法是从代数数论拉了一套工具去解离散几何问题,这个跨领域连接是人类 80 年没想到的。 最有意思的是这个模型不是专门为数学训练的,是通用推理模型。 这说明足够强的推理能力到了某个阈值之后,创造性会自然涌现。 恭喜人类。

译OpenAI未公开的内部通用推理模型,自主解决了数学家Erdős于1946年提出的平面单位距离问题,颠覆了近80年来学界对解法结构的普遍预期。该模型通过125页思维链,创新运用代数数论工具解决离散几何问题,实现了跨领域方法论突破。更值得注意的是,该模型并非专攻数学训练,其成果表明通用推理能力达到一定阈值后可能自然催生创造性,标志着AI在基础科学领域迈出了关键一步。

Greg Brockman@gdb · 5月21日78

our math result is a milestone in new knowledge generation by AI. very exciting to imagine similar results in other scientific fields. "It's very hard to sleep, man" is a pretty good reaction.

译AI在数学领域实现了新知识生成的里程碑式突破。OpenAI模型解决了组合几何中悬而未决的著名难题——平面单位距离问题(Erdos 1946),首次证明通过AI方法可将该问题中单位距离对的数量提升至超线性规模(n^{1+δ}),超越了以往所有人类已知的线性构造。这标志着AI从解决已知问题迈向发现新数学的重要进展。该突破引发了研究者“难以入睡”的强烈反响,被视为AGI时代临近的信号。

Rohan Paul@rohanpaul_ai · 5月21日78

AI in math is creating history again, as OpenAI's general-purpose reasoning model has disproved a major Erdős conjecture from 1946. The important part is not that AI solved a hard math problem, but how little special machinery it needed. For decades, the planar unit distance problem looked almost embarrassingly simple: place points on a plane, then ask how many pairs can be exactly one unit apart. For decades, the best examples looked like stretched versions of a square grid, so mathematicians believed grids were almost the best possible design. OpenAI’s internal model broke that picture by finding an infinite family of constructions that gives a polynomial improvement, with the proof checked by external mathematicians. The point to note is that the model was not a bespoke theorem-proving engine trained only for this problem, and the official post says its success improved with more test-time compute, meaning more reasoning at inference rather than only more training. That matters so much, because research progress often comes from holding a fragile chain of ideas together long enough to cross from one field into another. In this case, the bridge ran from a plain geometric question into deep algebraic number theory, including machinery like infinite class field towers and Golod–Shafarevich theory. And now we see a general-purpose reasoning system appears able to search a conceptual space where human taste, field boundaries, and inherited guesses may have quietly narrowed the path. So future is not machines replacing judgment, but machines widening the map before judgment begins.

译OpenAI的通用推理模型自主解决了一个自1946年以来未解的著名数学难题——平面单位距离问题。该模型没有采用专门为数学设计的定定理证明引擎,而是通过推理时增强计算能力,发现了优于传统网格结构的新构造方案。这标志着AI首次自主解决一个数学领域的核心开放问题。更重要的是,该模型能将几何问题与代数数论等深层理论连接,展示了通用人工智能在跨领域研究和拓宽人类认知边界方面的巨大潜力。

Rohan Paul@rohanpaul_ai · 5月21日67

A 10 million parameter model just outperformed deterministic rivals 3 times its size by doing something regular recursive AI dont do: exploring multiple reasoning paths at the same time. Most AI reasoning models are trapped on a single train of thought, and GRAM ("Generative Recursive Reasoning") is the first to break that by letting the model think in parallel universes simultaneously. The problem is that all existing recursive models are fully deterministic, meaning given the same input they always follow the exact same reasoning path and can never escape a wrong trajectory or discover more than 1 valid answer. GRAM fixes this by injecting learned randomness at each refinement step, so the model samples a slightly different direction each time rather than snapping to 1 fixed next state, which produces a spread of diverse reasoning trajectories. At test time the model runs many of these paths in parallel and selects the best one using a small reward predictor trained alongside the main model, adding a "width" scaling axis on top of the usual "depth" axis of running more recursion steps. On hard Sudoku puzzles, GRAM with 10M parameters hits 97% accuracy versus 87.4% for the best prior recursive model, and with only 20 parallel samples it outperforms every deterministic baseline even at 320 recursion steps. On tasks with many valid answers like N-Queens, deterministic recursive models collapse as the number of solutions grows, while GRAM maintains near-perfect accuracy throughout. The same stochastic framework also acts as a generator: given a blank board, GRAM produces valid Sudoku puzzles 99% of the time using 16 steps, versus 1,000 steps and 55M parameters for the best diffusion baseline at just 91%. --- Paper Link – arxiv. org/abs/2605.19376v1

译仅1000万参数的GRAM模型,通过引入可学习的随机性,在推理时并行探索多条不同路径,打破了传统递归模型锁定单一思维的限制。该模型在测试时同时运行这些平行轨迹,并借助奖励预测器选择最优结果,从而在深度之上增加了“宽度”维度。实验表明,GRAM在困难数独任务上准确率高达97%,远超此前最佳确定性模型;在多解的皇后问题上也能维持高性能,并能高效生成有效的数独谜题。这一框架为提升小模型的推理能力提供了新思路。

Chubby♨️@kimmonismus · 5月21日84

OpenAI made history today. An internal reasoning model autonomously disproved a famous conjecture in mathematics that stood for nearly 80 years. The problem: In 1946, Paul Erdős asked how many pairs of points can be exactly 1 unit apart if you place n points on a flat surface. The best known answer came from square grid constructions, and Erdős himself conjectured you can't do meaningfully better. Mathematicians believed this for decades. The AI proved him wrong. It found entirely new point configurations that beat the square grid by a fixed polynomial factor, not a marginal improvement, a real mathematical gap. The proof uses methods from algebraic number theory, a completely different branch of math, Class field towers, Golod-Shafarevich theory, tools nobody expected to be relevant to a geometry problem about distances in the plane (reminds me of move 37, AlphaGo tbh). Fields Medalist Tim Gowers calls it "a milestone in AI mathematics." The proof was verified by leading external mathematicians. According to OpenAI, this is the first time AI has independently solved a prominent open research problem in mathematics! Caveat: Obviously OpenAI chose which problems to test the model on. So "autonomous" means the model generated the idea and wrote the proof, not that it wandered into the problem on its own. But if reasoning models can reliably make cross-domain connections like this, finding paths that experts didn't prioritize, this changes research far beyond math. Biology, physics, materials science, medicine. This isn't AI reproducing human knowledge anymore. This is AI producing new knowledge. That's a qualitative shift.

译OpenAI内部推理模型自主解决了存在近80年的著名数学开放问题——平面单位距离问题。该模型推翻了Paul Erdős的猜想,发现了全新的点配置构造,其效率以固定多项式因子优于传统方格网格方案。证明运用了代数数论等跨学科方法,经外部数学家验证,被Fields奖得主Tim Gowers誉为“AI数学的里程碑”。这是AI首次独立解决数学领域的核心公开问题,标志着从知识复现到知识创造的重要转变,其跨领域推理能力可能为多学科研究带来深远影响。

Z.ai@Zai_org · 5月21日75

http://x.com/i/article/2057206923208884224 # Next-generation LLM Inference Network: How ZCube Alleviates Network Bottlenecks? LLM inference is reshaping AI infrastructure. The network used to be the least interesting part of an inference cluster. That isn't true anymore. With long-context inference and Prefill-Decode disaggregation now standard, the network sits on the critical path of throughput, tail latency, and per-token serving cost. To address the increasingly severe topology-induced congestion in Prefill-Decode disaggregated deployments, Z.ai, Harnets.AI, and Tsinghua University jointly developed and deployed the ZCube network architecture in an online production environment. The deployment shows that system-level innovation at the network architecture layer can unlock hardware potential in a highly cost-effective way. In production benchmarking for the GLM-5.1 coding workload, ZCube delivered significant gains through architectural optimization alone: - Cost optimization: GPUs, the software stack, and applications remained unchanged, while switch and optical module CapEx was reduced by 33%. - Throughput improvement: Average GPU inference throughput increased by 15%. - Latency improvement: TTFT P99 was reduced by 40.6%. The root cause of the congestion lies in the shift of inference traffic patterns. As PD disaggregation becomes mainstream, cross-node KV Cache transfers make inference traffic highly asymmetric, with dynamically changing sources, destinations, and traffic volumes. In traditional ROFT (Rail-Optimized Fat-Tree) architectures, static topology and port mappings can easily concentrate traffic on a limited set of switches and links, causing local hotspots, queue buildup, and PFC backpressure. This leads to a structural issue where aggregate bandwidth appears sufficient, yet localized congestion occurs frequently. ZCube addresses this issue by using a fully flattened network topology together with a hybrid single-rail / multi-rail access design. At the network architecture layer, it decouples and distributes PD traffic across a broader path space, reducing the probability of topology-induced congestion at its source. This provides a more efficient networking foundation for next-generation hyperscale inference clusters. # Network Becoming a Bottleneck for Effective Inference When thousands of GPUs serve online inference requests concurrently, every KV Cache transfer and every data synchronization operation traverses the inter-GPU network. As long-context inference and Prefill-Decode disaggregated inference gradually become mainstream, data exchange between Prefill and Decode nodes continues to grow. Network bandwidth, and more importantly the ability to use it effectively, has begun to affect cluster-level throughput and latency directly. To quantify the impact of networking on inference performance, we first conducted an ablation study on a 512-GPU cluster. We kept GPU compute, the software stack, the model, and application logic unchanged, and only adjusted the available NIC bandwidth cap. We then measured changes in overall cluster throughput and Time to First Token (TTFT). For example, when network bandwidth was increased from 100Gbps to 200Gbps, overall inference throughput improved by approximately 19%, while Time to First Token, or TTFT, decreased by approximately 22%. This indicates that, in LLM inference, network bandwidth has become one of the key factors constraining service performance. # 1. Network Congestion in Inference Today, AI clusters commonly use Clos, or Fat-Tree, architectures. The basic idea is to scale the network by stacking multiple layers of switches. However, the performance of Clos networks depends heavily on ideal load balancing across switches, which is difficult to achieve in practice due to routing policies and real traffic patterns. For example, in many two-tier Fat-Tree deployments, which consist of Spine and Leaf layers, traffic across Spine switches can become severely imbalanced. As a result, upper-layer applications often fail to obtain the expected network performance. To reduce the overhead of cross-layer forwarding, the industry often adopts ROFT (Rail-Optimized Fat-Tree) architectures [1]. As shown in Figure 3, ROFT groups GPUs by index ("rail"), and connects GPUs with the same index to the same Leaf switch, reducing the communication cost across Spine switches. ROFT works well for certain training traffic patterns. However, in Prefill-Decode disaggregated inference, we observed a more prominent issue: KV Cache transfers exhibit strong source-destination asymmetry. Different GPUs and different NICs carry highly uneven communication loads, as shown in Figure 4. As a result, ROFT’s rail mapping no longer naturally translates into load balancing. Instead, traffic can become concentrated on a small number of Leaf switches and links, leading to link congestion and degraded transfer performance. This manifests in several ways: - Some Leaf switches become persistent load hotspots, increasing the probability that multiple KV Cache transfer flows compete on the same links. As a result, actual transfer throughput can fall far below the NIC bandwidth capacity. - Certain egress queues on some Leaf switches remain at high depth for extended periods and frequently trigger PFC backpressure, as shown in Figure 5. - Link congestion further amplifies tail latency, affecting both TTFT and overall throughput. It is important to distinguish between the two types of network congestion, as illustrated in Figure 6: - Unavoidable congestion: For example, when multiple GPUs send data to the same destination at the same time, contention on the final-hop link is inevitable. - Avoidable congestion: This is caused by topology design, traffic mapping, or imbalanced multipath utilization. Fundamentally, it is an architecture-level design problem. For the first type of congestion, we typically rely on congestion control, traffic shaping, and related mechanisms to mitigate its impact. For the second type, new network transport mechanisms such as adaptive routing [2], packet spraying [3,4], and MRC [5] can help. However, a more effective approach is to prevent network conflicts that should not occur in the first place through innovation at the network architecture layer. Prefill-Decode disaggregated inference is a typical example. If the network topology cannot match the traffic pattern, the system will repeatedly generate load hotspots and link conflicts. Solving this problem requires rethinking the inference network architecture itself. # 2. ZCube Network Architecture To address the above issues, we deployed a new ZCube network architecture [6]. ZCube breaks away from the traditional Clos design philosophy of hierarchical switch stacking and instead introduces a fully flattened GPU server interconnect. The ZCube routing strategy, designed specifically for the ZCube architecture, fully leverages the structural properties of the flattened topology. It can achieve near-ideal load balancing across all switches in the network, thereby significantly improving overall cluster network bandwidth. Compared with Clos, ZCube has a natural advantage in load balancing. This advantage benefits both training clusters and inference clusters. Importantly, ZCube achieves these performance gains while reducing switch and optical module costs by approximately one third compared with Clos. Based on current mainstream switch and NIC configurations, ZCube can support flattened networking for tens of thousands, or even hundreds of thousands, of GPUs. ## 2.1 ZCube Core Architecture As shown in Figure 7, the core ideas of ZCube are: 1. Remove the Spine switch layer. 1. Divide Leaf switches into two groups of equal size, typically odd-numbered switches and even-numbered switches. 1. Establish a complete bipartite interconnect between the two switch groups. 1. Connect the two ports of each GPU NIC to the corresponding switches in the two groups using single-rail and multi-rail access patterns. Suppose each GPU has a corresponding NIC with two ports, i.e., p=2. There are n GPUs in total, and GPUs and NICs share the same indices: 1,2,…,n. Let k denote the number of GPUs connected to each switch. The total number of switches is 2n/k, numbered 1,2,…,2n/k. For GPU i, where 1≤i≤n: - The first port connects to the odd-numbered switch: ((i−1)mod(n/k))×2+1 - The second port connects to the even-numbered switch: ⌈i/k⌉×2 The two switch groups are connected as a complete bipartite graph: every odd-numbered switch connects to every even-numbered switch. A ZCube topology under dual-port NIC configuration, withp=2,n=32, and k=8, is shown in Figure 7. ## 2.2 Key Properties of ZCube Network Diameter ZCube has a network diameter of two switch hops, meaning any pair of GPUs can reach each other through two switches. This sits between a one-layer switch network, which has one switch hop but limited scale, and a conventional two-layer switch network, which supports a larger scale but typically requires three switch hops and incurs higher latency. Load Balancing First, the ZCube routing strategy ensures that each GPU pair has a unique optimal path, avoiding traffic conflicts caused by multipath route selection. Second, ZCube uses two complementary GPU-to-switch connection patterns. One switch group connects to GPUs in a single-rail pattern, where each switch connects to a contiguous range of GPU IDs. The other switch group connects to GPUs in a multi-rail pattern, where each switch connects to GPUs with the same relative index across groups. This design enables ZCube to achieve highly effective load balancing across the entire switch fabric under both typical AI training traffic patterns, such as AllReduce and All-to-All, and typical AI inference traffic patterns, where source-destination relationships are uncertain, and NIC loads can be highly imbalanced. As a result, ZCube can avoid the second type of network congestion described earlier at the architecture layer. As shown in Figure 8, traffic flows that would conflict under ROFT can obtain dedicated network paths under ZCube, thereby avoiding congestion. Scalability ZCube provides strong scalability while preserving its favorable performance characteristics. For example, using one layer of 51.2T switches, each with 128 × 400Gbps ports, ZCube can construct a network connecting 16,384 400Gbps NICs. If higher-capacity switches are used, or if the ZCube network is divided into more planes, the architecture can scale further to support interconnection among tens of thousands or even hundreds of thousands of GPUs. Cost At the same cluster scale, ZCube can reduce switch and optical module costs by approximately one third compared with traditional Clos / ROFT architectures. For example, in a 10,000-GPU AI cluster, ZCube can save roughly 210 million RMB to 640 million RMB in network hardware investment. These characteristics show that ZCube can achieve better load balancing and performance while requiring lower network hardware cost. ## 2.3 Real-World Cluster Testing: Boosting Inference Performance While Cutting Network Costs We upgraded the network architecture of a thousand-GPU cluster running GLM-5.1 coding inference services from the original ROFT to the ZCube architecture. Since the ZCube architecture eliminates the Spine-layer switches found in traditional Clos architectures, the legacy cabling patterns, IP addressing schemes, routing policies, and switch configuration methods established under the Clos framework could not be reused directly, necessitating a complete redesign tailored to ZCube. To tackle these challenges, the Harnets.AI Network Team designed a comprehensive network solution centered on the ZCube architecture. They developed a suite of automation tools, including the ZCube Controller, a data center layout design tool, and a cabling correctness verification program. This enabled capabilities such as data center deployment planning, cabling validation, automated configuration generation, and batch deployment, effectively resolving numerous hurdles in ZCube deployment. This suite of tools was the critical factor enabling the successful transformation of a large-scale production cluster within an exceptionally tight timeframe. Following the seamless network architecture migration, we conducted real-world testing on the ZCube architecture by running the GLM-5.1 coding inference services on this cluster. By comparing the cluster's inference performance before and after the upgrade, we found that ZCube boosted the average GPU inference throughput by over 15% compared to the ROFT architecture (as shown in Figure 9), while dropping the P99 tail latency of TTFT by 40.6%. In summary, for GPU and server hardware of the same scale and configuration, and without modifying any applications, upgrading the networking architecture to ZCube allowed us to not only save 1/3 of the optical modules and switch hardware, but also enable the cluster to serve 15% more inference requests per second. Against the current backdrop of exploding inference workloads and severe shortage of compute resources, this approach proves to be highly pragmatic and valuable. Currently, this ZCube cluster has been running stably for over two weeks, playing a vital role in powering the GLM-5.1 coding inference services. # 3. Conclusion LLM inference is moving from point-wise optimization toward system-level co-design. The coupling between the network and the inference engine is becoming increasingly tight, making networking a critical component of the inference system. The production deployment of ZCube shows that network architecture innovation can directly unlock the effective capacity of inference systems. By better aligning the network architecture with KV Cache transfers and PD traffic patterns, ZCube reduces the probability of topology-induced congestion at the source, improving throughput and latency while enhancing cluster cost efficiency. Looking ahead to next-generation LLM infrastructure, network design will evolve from general-purpose interconnects toward model-traffic-driven system co-design. Long-context inference, PD disaggregation, MoE, and integrated training-inference workloads are reshaping intra-cluster communication patterns, requiring network topology, communication libraries, and scheduling policies to be jointly optimized around real model traffic. Looking ahead, we will continue pioneering novel AI network architectures for larger-scale inference and training clusters ─ upgrading the network from a foundational GPU connection layer into a core driver of token generation efficiency, system resilience, and cost-effectiveness. # Acknowledgements ZCube was published at ACM SIGCOMM 2025, and was recognized as “significantly change the way we think about and understand networking.” This is the first large-scale deployment of the technology in a production inference cluster. We thank the Harnets.AI team for their professional support and close collaboration throughout this network architecture upgrade and optimization effort. ## Reference [1] NVIDIA. 2023. SuperPOD: Next Generation Scalable Infrastructure for AI Leadership. https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod-reference-architecture-dgx-h100.pdf [2] NVIDIA. 2025. https://developer.nvidia.com/blog/accelerating-ai-storage-by-up-to-48-with-nvidia-spectrum-x-networking-platform-and-partners/ [3] Ultra Ethernet Consortium. Ultra Ethernet specification v1.0.1, 2025. [4] Tommaso Bonato, Abdul Kabbani, Ahmad Ghalayini, Michael Papamichael, Mohammad Dohadwala, Lukas Gianinazzi, Mikhail Khalilov, Elias Achermann, Daniele De Sensi, and Torsten Hoefler. REPS: Recycled entropy packet spraying for adaptive load balancing and failure mitigation, 2026. [5] Araujo, J., Chow, A., Handley, M., Lewis, R., Paasch, C., Padhye, J., … & Sur, S. (2026). Resilient AI Supercomputer Networking using MRC and SRv6. arXiv preprint arXiv:2605.04333. [6] Yan, Z., Li, D., Chen, L., Xiong, D., Gao, K., Zhang, Y., … & Lin, H. (2025, September). From ATOP to ZCube: Automated topology optimization pipeline and a highly cost-effective network topology for large model training. In Proceedings of the ACM SIGCOMM 2025 Conference (pp. 861-881).

译随着长上下文与Prefill-Decode分离部署成为主流,GPU集群网络已从次要部件转变为制约推理吞吐、尾部延迟和成本的关键瓶颈。传统静态网络拓扑与动态非对称的KV Cache流量模式冲突,导致局部拥塞。为此,Z.ai、Harnets.AI与清华大学联合研发了ZCube网络架构。该架构采用完全扁平化拓扑与混合接入设计,从源头解耦并分散流量以减少拥塞。在GLM-5.1生产测试中,ZCube在保持GPU与软件栈不变的前提下,实现了交换机与光模块成本降低33%、平均推理吞吐提升15%、首token时间P99降低40.6%的显著效果,证明网络架构创新能有效释放硬件潜力。

Emad@EMostaque · 5月21日91

Once AI starts making solving open problems in novel ways it won’t stop. We are entering the final stage of human solutions to open problems like this. Feels weird, doesn’t it?

译OpenAI模型首次自主解决了Paul Erdős于1946年提出的平面单位距离问题,这一突破推翻了数学界近80年来的主流猜想。AI不仅给出了更优的解法,更发现了一族全新的构造方式。这一事件被视为AI能力的里程碑,暗示着在解决科学开放性问题上,AI正开始以新颖方式持续突破,可能标志着人类主导此类问题求解的“最终阶段”的到来。

Greg Brockman@gdb · 5月21日92

An OpenAI model has achieved a major breakthrough in mathematics, by disproving a central conjecture in discrete geometry that was first posed by Paul Erdős in 1946. This is the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

译OpenAI的模型在离散几何领域取得重大突破,自主解决了由数学家Paul Erdős于1946年首次提出的平面单位距离猜想。该突破是AI首次独立解决一个学科的核心著名开放问题。此前近80年间,数学家普遍认为该问题的最优解大致呈现为方形网格结构,而OpenAI模型发现了全新的、性能更优的构造方式,颠覆了这一长期信念。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 5月21日87

"This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics."

译OpenAI模型自主攻克了数学领域一个长达近80年的著名开放问题——平面单位距离问题。该问题由Paul Erdős于1946年提出,传统观点认为最优解结构近似于方格网格。OpenAI模型的突破性发现不仅推翻了这一长期假设,还构造出性能更优的全新解法,标志着人工智能首次在数学核心领域独立解决重大未解难题。

Noam Brown@polynoamial · 5月21日86

Today, we’re sharing that a general-purpose internal @openai model achieved a breakthrough on one of the best-known combinatorial geometry problems. Less than 1 year ago frontier AI models were at IMO gold-level performance. I expect this pace of progress to continue.

译OpenAI的通用AI模型在组合几何领域取得突破,首次自主解决了平面单位距离问题。该问题由数学家Paul Erdős于1946年提出,近80年来学界普遍认为最优解类似方格结构,但AI模型推翻了这一假设,发现了一族全新的更优构造方法。这一突破标志着AI在数学领域首次独立解决一个核心开放问题,显示出AI在基础科学发现中的快速进展能力。

Noam Brown@polynoamial · 5月21日83

Today, we’re sharing that a general-purpose internal @openai model achieved a breakthrough on one of the best-known combinatorial geometry problems. Less than 1 year ago frontier AI models were at IMO gold-level performance. I expect this pace of progress to continue.

译OpenAI宣布其一个内部通用模型在组合几何领域取得突破,自主解决了平面单位距离问题。这一问题由数学家Paul Erdős于1946年提出,近80年来学界普遍认为最优解应近似于方形网格结构。新模型推翻了这一长期信念,发现了一族全新的、更优的构造方案。此事件标志着人工智能首次独立解决数学领域的核心开放问题,展示了AI在科学发现方面快速且持续的进展能力。

OpenAI@OpenAI · 5月21日81

Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in 1946. For nearly 80 years, mathematicians believed the best possible solutions looked roughly like square grids. An OpenAI model has now disproved that belief, discovering an entirely new family of constructions that performs better. This marks the first time AI has autonomously solved a prominent open problem central to a field of mathematics.

译今天,我们分享一个关于平面单位距离问题的突破,这是一个由保罗·埃尔德什在1946年首次提出的著名开放问题。 近80年来,数学家们一直认为最佳可能的解决方案大致类似于方形网格。 现在,一个OpenAI模型推翻了这一信念,发现了一个全新的、性能更优的构造家族。 这标志着AI首次自主解决了一个数学领域的核心著名开放问题。

AK@_akhaliq · 5月21日67

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

译基于点互信息的推理强化学习反自蒸馏方法

AK@_akhaliq · 5月21日64

ESI-Bench Towards Embodied Spatial Intelligence that Closes the Perception-Action Loop

译ESI-Bench 迈向闭环感知-行动的具身空间智能

Rohan Paul@rohanpaul_ai · 5月20日62

Anthropic's new study says frontier AI needs input from scholars, philosophers, clergy, and civic thinkers because model behavior is becoming a question of character, not just code. Their point is that Claude is not only trained to predict text, because later training pushes it toward some behaviors and away from others, which means engineers are quietly shaping something like a machine’s habits. The hard problem is moral formation: a model can sound helpful in normal tasks, then bend under pressure, flatter the user, ignore risk, or follow a bad instruction because the situation rewards obedience. Anthropic says it spoke with people from 15+ religious and cross-cultural groups to study how humans build stable character across pressure, conflict, temptation, and social influence. Theier idea is a self-reminder tool, where Claude can pause mid-task and call up its own commitments before taking a serious action. That pause reportedly lowered misaligned behavior in internal tests, though Anthropic says it still needs to separate the value of the reminder from the value of slowing the model down.

译Anthropic最新研究指出,前沿AI的行为日益涉及“品格”塑造,而非仅限于代码。研究认为,工程师在后期训练中实质上塑造了AI的“习惯”,而核心挑战在于确保其在压力下仍能保持道德稳定。为此,Anthropic与超过15个宗教及跨文化团体展开对话,探讨人类品格培养机制。其提出的解决方案包括开发“自我提醒”工具,帮助AI在执行关键任务前审视自身承诺,内测显示此举已显著降低行为错位。该研究旨在拓宽关于AI发展的社会讨论边界。

AK@_akhaliq · 5月20日56

Code as Agent Harness

译代码作为智能体运行框架

elvis@omarsar0 · 5月20日64

Very interesting results from this NanoGPT-Bench eval. There is so much talk about self-improving agents. But can coding agents do real AI R&D? @IntologyAI reports that Codex, Claude Code, and Autoresearch recover only 9.3% of human progress. Coding agents spend more of their compute on hyperparameter tuning. In fact, coding agents rarely attempt algorithmic research at all. Claude Code and Autoresearch both reason more about algorithmic research, but still dodge implementation. Read more here: https://www.intology.ai/blog/nanogpt-bench

译IntologyAI发布的NanoGPT-Bench评估显示,Codex、Claude Code和Autoresearch等编程代理在AI研发任务中,仅能恢复人类近9.3%的进展。这些代理的大部分算力消耗在超参数调优上,对核心的算法研究投入甚少。其中Claude Code和Autoresearch在推理中稍有涉及算法研究,但在实际代码实现层面依然不足。该评估基于NanoGPT Speedrun竞赛,采用标准化的五个月世界纪录窗口,完全自主端到端进行,以控制模型依赖和数据污染。结果表明,当前编程代理在自主执行真正AI研发的能力上仍有很大局限。

Ethan Mollick@emollick · 5月20日75

🚨Our paper is out in PNAS: we found classic human persuasion techniques worked on AIs in a "parahuman" way, making them agree to objectionable requests (upping compliance from 35% to 51%) It worked on a range of major LLMs though newer models resist more https://www.pnas.org/doi/10.1073/pnas.2535868123

译🚨我们的论文已在PNAS发表:我们发现经典的人类说服技巧以一种“类人”的方式对AI有效,使其同意不当请求(将顺从率从35%提高到51%) 该技巧对一系列主流大语言模型有效,尽管较新的模型抵抗力更强 https://www.pnas.org/doi/10.1073/pnas.2535868123

AK@_akhaliq · 5月19日51

Nvidia presents LongLive-2.0 An NVFP4 Parallel Infrastructure for Long Video Generation

译英伟达推出 LongLive-2.0 一种用于长视频生成的 NVFP4 并行基础设施

elvis@omarsar0 · 5月19日62

// Code as Agent Harness // 100+ page report on all things related to agent harnesses. (bookmark it) In particular, the survey summarizes methods and applications of code as agent harness. This paper makes a strong case that code-as-harness might be the key to moving us towards a broader science harness engineering. Is code all you need? Maybe. Regardless, the paper argues that future systems must have the following four properties: executable, inspectable, stateful, and governed. Paper: https://arxiv.org/abs/2605.18747 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译推文聚焦于一篇探讨AI代理(Agent)开发框架的百页报告,其核心主张是“代码作为代理框架”具有重要潜力。报告总结了相关方法与应用,并论证该路径可能推动更广泛的科学框架工程。论文进一步提出,未来的智能系统必须具备四项关键特性:可执行、可检查、有状态以及受控。报告旨在为构建有效AI代理提供参考,并推荐相关学习资源。

Rohan Paul@rohanpaul_ai · 5月19日71

Humanoid value will not come from looking human, but from having enough body surface, strength, balance, and feedback to turn messy objects into manageable ones.

译人形机器人的核心价值不在于外形相似,而在于具备足够的物理能力(如力量、平衡和全身协调)来处理复杂任务。实现这一目标的关键是“全身控制”,即机器人能调动全身与环境互动并适应负载变化。波士顿动力的Atlas机器人通过本体感知成功处理超过100磅的动态负载,展示了这种能力。为实现高性能操作,团队已放弃传统MPC控制范式,全面转向强化学习(RL)。这种全身控制能力是物理智能的基础,也是人形机器人价值主张的核心。

Berryxia.AI@berryxia · 5月19日67

xdm,这个研究对于古代历史研究的价值很大啊! 他们刚刚开源了Chronicles-OCR,一个专门测VLLM对古汉字感知能力的基准。 数据集横跨3000年演变,涵盖7种历史字体,从甲骨文一直到草书,2800张平衡图像,来自不同材质的真实载体。 测试分4个核心任务: 字符定位、细粒度识别、古文字解析、字体分类。 结果很扎心:视觉分布随时间漂移后,大部分模型感知能力直接崩盘。 以前大家卷的是现代图文理解,现在Tencent把AI拉到真正需要“穿越时空”才能看懂的古文字上。 这才是把文化传承和AI视觉能力真正连在一起。 Paper和完整数据集已经开源: Paper:https://arxiv.org/abs/2605.11960 GitHub:https://github.com/Tencent/Hunyuan-Chronicles-OCR 论文还没有阅读,完了可以好好研究一下。

译腾讯开源了Chronicles-OCR基准,旨在专门评估视觉语言模型对古汉字的感知能力。该数据集横跨3000年演变,涵盖从甲骨文到草书的7种历史字体,包含2800张来自多样材质的真实图像。研究设置了字符定位、细粒度识别、古文字解析和字体分类四项核心任务。测试结果揭示,面对历史字体带来的视觉分布漂移,大部分模型的感知能力会急剧下降。该研究为古文字研究提供了重要的AI评测工具。

全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
5月25日
23:54
elvis@omarsar0
66
微软研究院提出SkillOpt方法,通过优化器自动学习AI智能体技能文档

微软研究院提出了SkillOpt方法,将AI智能体的技能文档视为可训练的外部状态,而非由工程师手动编写。该方法利用一个优化器模型对技能文件进行验证门控编辑,通过添加、删除或替换指令来优化文档,并引入文本学习率控制每轮重写力度,而智能体本身保持不变。实验显示,在全部52个测试单元(涵盖不同模型、基准测试和工具链)中,SkillOpt均达到最佳或并列最佳。具体在GPT-5.5上,相比无技能文档,SkillOpt在直接聊天、Codex和Claude Code下分别取得23.5、24.8和19.1分的提升,超越人类手写技能及其他自动化方法,且不增加推理时开销,学到的技能还能跨模型和工具链迁移。

智能体Microsoft论文/研究
19:28
Rohan Paul@rohanpaul_ai
75
华为发布突破性芯片设计方法"LogicFolding"

华为提出了“τ缩放”和“LogicFolding”两种新方法,旨在不依赖最先进光刻工具的前提下,缩小与台积电的性能差距。其核心思想是将衡量芯片进步的指标从晶体管尺寸转向信号传输延迟(τ)。LogicFolding作为具体实现,通过垂直堆叠逻辑电路层并采用混合键合,将需要通信的电路紧邻放置,从而缩短关键线路、降低电阻和寄生电电容,提升信号速度。华为表示,其下一代麒麟手机芯片将是对τ缩放规律的首次全面测试。

Rohan Paul: 🇨🇳 Huawei reveals a new chip design breakthrough under US sanctions pressure. A design approach meant to close the gap...

端侧论文/研究
关联讨论 1 条IT之家(RSS)
03:57
Rohan Paul@rohanpaul_ai
65
全注意力回归:将全注意力转化为稀疏,训练步骤在百步之内

阿里巴巴与南京大学提出RTPurbo,一种轻量级适配方法。该方法发现,已训练的全注意力模型内存在隐藏的稀疏结构。它利用一个轻量的16维token查找器作为“侦察兵”,为少数需要长程信息的关键注意力头定位重要token,而让其他头主要关注局部文本。基于此,RTPurbo在100万token预填充任务上,相比FlashAttention-2实现了高达9.36倍的加速,解码阶段也约有2倍加速,同时在长上下文和推理基准上保持了接近全注意力模型的精度。该研究表明,长上下文推理中的计算浪费具有可挖掘的结构性。

arXiv推理论文/研究
02:57
Chubby♨️@kimmonismus
60
又有九个Erdős问题被解决了。 但这次,是Google DeepMind完成的。 这不容小觑,因为一方面它加剧了竞争压力,另一方面也证明了其他前沿实验室可以轻松跟上。

Przemek Chojecki | PC: Another 9 open Erdos problems solved, this time by DeepMind team. Interesting loop of LLM - Lean agents working autonomo...

DeepMind推理论文/研究
02:57
Rohan Paul@rohanpaul_ai
73
大型MoE模型或在无需专家帮助的简单token上浪费半数计算

论文提出ZEDA框架,可将训练后固定的静态MoE模型(如Qwen3、GLM)转变为动态模型,允许路由器在token过于简单时跳过专家调用。实验显示,在Qwen3-30B-A3B和GLM-4.7-Flash上,ZEDA可移除约50%的专家计算量,仅带来轻微准确率损失,并实现约20%的实际推理速度提升。研究发现,计算分配主要依据模型的不确定性,而非单纯跟随任务难度。

推理论文/研究部署/工程
5月24日
20:27
Chubby♨️@kimmonismus
68
德国研究:普通WiFi路由器可近乎完美识别个人身份

德国KIT研究人员展示,使用普通WiFi路由器即可近乎完美地识别个人身份,无需手机、特殊硬件或视线。该系统利用每个已连接设备都在广播的未加密波束成形反馈(beamforming feedback)。在197名受试者的测试中,识别准确率接近100%。该研究指出,此类监控基础设施(如咖啡馆、机场、办公室中的路由器)已普遍存在,核心问题在于谁将开始读取并利用这些信号。

安全/对齐论文/研究
5月23日
23:51
elvis@omarsar0
64
调整运行时接口而非模型,提升AI代理通用性

一项新研究提出通过改进包裹冻结LLM的运行时接口来优化AI代理性能,而非修改模型本身。该方法将反复出现的交互失败转化为对运行时层的可复用干预,在7个确定性环境、126个设置中取得平均88.5%的相对性能提升。关键发现是,从单一模型轨迹中学习到的运行时方法可成功迁移至18个不同模型骨架,证明其捕捉的是环境结构而非模型特异性模式。这为生产环境中部署AI代理提供了更高可移植性的解决方案。

智能体论文/研究部署/工程
21:27
Rohan Paul@rohanpaul_ai
61
研究揭示:AI代理的性能更依赖外部控制系统而非提示词本身

本研究指出,AI代理的实际性能更多取决于围绕模型的外部控制系统(即代理框架),而非单纯的提示词。当前许多代理看似单一模型,其行为实则由规划、工具调用、记忆管理等周边代码驱动,导致长任务易因状态丢失、验证漂移等环节失败。为此,论文提出“自然语言代理框架”理念,旨在将控制流程以结构化自然语言显式表达,使其可检查、可迁移且可测试。研究发现,虽然更复杂的框架能显著改变代理行为,但并未带来稳定的性能提升,这表明框架设计是保障可靠性的关键选择,而非一种立竿见影的万能方案。

智能体论文/研究
20:27
Rohan Paul@rohanpaul_ai
55
AI检测器为何容易失效:学生写作风格的多样性挑战

该研究指出,AI检测器频繁失效的根本原因在于学生写作风格的多样性,使得仅凭单份文档判断是否为AI生成变得极为困难。问题不仅在于AI写作能力在提升,更在于许多真实学生的写作风格,在统计特征上已与AI输出高度相似。检测器无法事先掌握每个学生独特的写作习惯,因此“人类写作”不存在一个固定的判断标准。这意味着任何能有效识别大量AI文本的检测器,都不可避免地会误判一部分真实学生,尤其是写作更规范、公式化或受英语学习影响的学生。现有技术或许能降低错误率,但无法根除基于“单次判断”模式所带来的结构性误判问题。

arXiv安全/对齐论文/研究
08:27
Rohan Paul@rohanpaul_ai
64
谷歌新研究:AI学习生理模式提升可穿戴设备价值

谷歌研究院提出基础模型SensorFM,通过学习超过500万人产生的逾1万亿分钟可穿戴设备传感器数据,掌握了人类生理活动的一般性模式。该模型超越了将数据压缩为简单指标的传统方法,能够从数据中提取出有意义的结构并将其复用于多种健康预测任务。实验显示,模型规模和数据量越大性能越强,且其学习到的数据表征在35项预测任务中的34项上,均优于基于工程特征的基线方法。

Google数据/训练端侧论文/研究
06:57
Rohan Paul@rohanpaul_ai
精选79
AlphaProof Nexus:用形式化验证驱动AI数学证明搜索

Google DeepMind提出了AlphaProof Nexus系统,它将大型语言模型与Lean形式化验证工具相结合。该系统允许LLM在生成证明的过程中,不断读取Lean的编译错误并进行修正,还可调用更强的工具辅助解决子问题。这一机制迫使模型将每一步逻辑都转化为可编译、可验证的代码,从而将其角色从“令人信服的叙述者”转变为“候选方案生成器”。在针对353个Erdős问题和492个开放猜想的测试中,系统成功解决了9个Erdős问题并证明了44个序列猜想。该研究展示了形式化验证在暴露AI逻辑错误、建立“人类提问-模型探索-验证器把关”新分工中的关键作用。

arXivDeepMind推理论文/研究
关联讨论 2 条The Decoder:AI News(RSS)IT之家(RSS)
推荐理由:DeepMind 把 AI 的'数学直觉'塞进 Lean 编译器里,每步都必须编译通过,结果解决 9 个 Erdős 问题,失败也暴露了隐藏错误。这篇论文重新定义了 AI 做数学的范式。
5月22日
21:26
Rohan Paul@rohanpaul_ai
46
这个RAI研究所的机器人通过动态手部调整管理三球抛接。它处理视觉和接触信息以维持模式,无需外部辅助。
具身智能论文/研究
09:56
Chubby♨️@kimmonismus
54
东京大学研发超低功耗芯片,效率提升千倍但十年后才能商用

东京大学研发了一种新型芯片组件,其处理数据速度较传统方法提升1000倍,且不产生额外热量。关键突破在于功耗仅为现有技术的百分之一,这理论上能使一个谷歌规模的数据中心能耗降低至当前的百分之一,极大缓解AI行业的能源压力。然而,该芯片原型预计2030年才问世,商用化需更长时间,凸显了AI快速发展与突破性节能技术量产时间之间的差距。

论文/研究部署/工程
08:13
Berryxia.AI@berryxia
66
苹果数字人面部捕捉技术再突破,逼真度迈向新高

苹果Persona团队在WWDC26前发布新论文,展示了面部捕捉与动画技术的最新进展。从演示来看,其在眼部微表情、头部细微动作和皮肤质感等细节上实现了显著提升,使数字形象的真实感进一步增强,已超越简单“数字头像”,趋近于可信的“数字分身”。这类突破对AR/VR、游戏和远程协作等领域的沉浸式体验至关重要,能够有效打破虚拟交互中的“不真实感”。苹果持续重仓该技术赛道,相关论文与演示视频已公开。

Jonathan Cooper: Apple's Persona team continuing to do amazing work with face capture and animation. New paper released ahead of WWDC26 h...

多模态视频论文/研究
07:10
Saining Xie@sainingxie
60
RAEv2通过大幅简化架构并提升通用性,在文本到图像(T2I)和世界模型等任务中实现了超过10倍的收敛速度提升,同时改善了重建与生成质量。研究团队在大量实验中发现,强大的表示编码器对像素解码器至关重要。传统评估指标(如FID)已不足以全面衡量模型性能,新的评估指标(如ep@fid-k/fdr^k)揭示了生成模型领域仍存在广阔的研究空间。

Jaskirat Singh: In Oct last year, Representation Autoencoders provided an elegant solution to unified tokenization for understanding and...

图像生成论文/研究
02:43
Ethan Mollick@emollick
61
似乎GPT-5.2在同行评审中达到了专家水平:45位科学家花费469小时,评估了人类与AI对82篇论文的评审。 "令人惊讶的是,当前的AI评审甚至能与《自然》官方同行评审中的顶级评审人相媲美……"尽管并非没有弱点。
OpenAI推理论文/研究
01:26
AK@_akhaliq
68
Mix-Quant 量化预填充,精确解码,面向智能体LLM
智能体论文/研究部署/工程
00:26
AK@_akhaliq
56
LongMINT 评估长期智能体系统中多目标干扰下的记忆能力
智能体arXiv推理论文/研究
5月21日
17:03
Orange AI@oran_ge
81
AI自主破解80年数学难题,里程碑式突破

OpenAI未公开的内部通用推理模型,自主解决了数学家Erdős于1946年提出的平面单位距离问题,颠覆了近80年来学界对解法结构的普遍预期。该模型通过125页思维链,创新运用代数数论工具解决离散几何问题,实现了跨领域方法论突破。更值得注意的是,该模型并非专攻数学训练,其成果表明通用推理能力达到一定阈值后可能自然催生创造性,标志着AI在基础科学领域迈出了关键一步。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
15:57
Greg Brockman@gdb
78
AI在数学领域实现了新知识生成的里程碑式突破。OpenAI模型解决了组合几何中悬而未决的著名难题--平面单位距离问题(Erdos 1946),首次证明通过AI方法可将该问题中单位距离对的数量提升至超线性规模(n^{1+δ}),超越了以往所有人类已知的线性构造。这标志着AI从解决已知问题迈向发现新数学的重要进展。该突破引发了研究者"难以入睡"的强烈反响,被视为AGI时代临近的信号。

Alex Dimakis: A breakthrough by OpenAI in a very famous Combinatorics problem, the Planar Unit Distance problem by Erdos 1946. The pro...

OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
15:26
Rohan Paul@rohanpaul_ai
78
AI通用推理突破80年数学猜想

OpenAI的通用推理模型自主解决了一个自1946年以来未解的著名数学难题——平面单位距离问题。该模型没有采用专门为数学设计的定定理证明引擎,而是通过推理时增强计算能力,发现了优于传统网格结构的新构造方案。这标志着AI首次自主解决一个数学领域的核心开放问题。更重要的是,该模型能将几何问题与代数数论等深层理论连接,展示了通用人工智能在跨领域研究和拓宽人类认知边界方面的巨大潜力。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
15:26
Rohan Paul@rohanpaul_ai
67
小模型大智慧:随机推理实现性能超越

仅1000万参数的GRAM模型,通过引入可学习的随机性,在推理时并行探索多条不同路径,打破了传统递归模型锁定单一思维的限制。该模型在测试时同时运行这些平行轨迹,并借助奖励预测器选择最优结果,从而在深度之上增加了“宽度”维度。实验表明,GRAM在困难数独任务上准确率高达97%,远超此前最佳确定性模型;在多解的皇后问题上也能维持高性能,并能高效生成有效的数独谜题。这一框架为提升小模型的推理能力提供了新思路。

推理论文/研究
12:44
Chubby♨️@kimmonismus
84
OpenAI突破性解决平面单位距离问题

OpenAI内部推理模型自主解决了存在近80年的著名数学开放问题——平面单位距离问题。该模型推翻了Paul Erdős的猜想,发现了全新的点配置构造,其效率以固定多项式因子优于传统方格网格方案。证明运用了代数数论等跨学科方法,经外部数学家验证,被Fields奖得主Tim Gowers誉为“AI数学的里程碑”。这是AI首次独立解决数学领域的核心公开问题,标志着从知识复现到知识创造的重要转变,其跨领域推理能力可能为多学科研究带来深远影响。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
05:50
Z.ai@Zai_org
75
ZCube网络架构:破解大模型推理网络瓶颈

随着长上下文与Prefill-Decode分离部署成为主流,GPU集群网络已从次要部件转变为制约推理吞吐、尾部延迟和成本的关键瓶颈。传统静态网络拓扑与动态非对称的KV Cache流量模式冲突,导致局部拥塞。为此,Z.ai、Harnets.AI与清华大学联合研发了ZCube网络架构。该架构采用完全扁平化拓扑与混合接入设计,从源头解耦并分散流量以减少拥塞。在GLM-5.1生产测试中,ZCube在保持GPU与软件栈不变的前提下,实现了交换机与光模块成本降低33%、平均推理吞吐提升15%、首token时间P99降低40.6%的显著效果,证明网络架构创新能有效释放硬件潜力。

推理论文/研究部署/工程
关联讨论 1 条智谱:研究(网页内嵌数据)
04:01
Emad@EMostaque
91
OpenAI模型首次自主解决了Paul Erdős于1946年提出的平面单位距离问题,这一突破推翻了数学界近80年来的主流猜想。AI不仅给出了更优的解法,更发现了一族全新的构造方式。这一事件被视为AI能力的里程碑,暗示着在解决科学开放性问题上,AI正开始以新颖方式持续突破,可能标志着人类主导此类问题求解的"最终阶段"的到来。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
03:36
Greg Brockman@gdb
92
OpenAI的模型在离散几何领域取得重大突破,自主解决了由数学家Paul Erdős于1946年首次提出的平面单位距离猜想。该突破是AI首次独立解决一个学科的核心著名开放问题。此前近80年间,数学家普遍认为该问题的最优解大致呈现为方形网格结构,而OpenAI模型发现了全新的、性能更优的构造方式,颠覆了这一长期信念。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
03:36
AI Notkilleveryoneism Memes ⏸️@AISafetyMemes
87
OpenAI模型自主攻克了数学领域一个长达近80年的著名开放问题--平面单位距离问题。该问题由Paul Erdős于1946年提出,传统观点认为最优解结构近似于方格网格。OpenAI模型的突破性发现不仅推翻了这一长期假设,还构造出性能更优的全新解法,标志着人工智能首次在数学核心领域独立解决重大未解难题。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
03:17
Noam Brown@polynoamial
86
OpenAI的通用AI模型在组合几何领域取得突破,首次自主解决了平面单位距离问题。该问题由数学家Paul Erdős于1946年提出,近80年来学界普遍认为最优解类似方格结构,但AI模型推翻了这一假设,发现了一族全新的更优构造方法。这一突破标志着AI在数学领域首次独立解决一个核心开放问题,显示出AI在基础科学发现中的快速进展能力。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
03:17
Noam Brown@polynoamial
83
OpenAI宣布其一个内部通用模型在组合几何领域取得突破,自主解决了平面单位距离问题。这一问题由数学家Paul Erdős于1946年提出,近80年来学界普遍认为最优解应近似于方形网格结构。新模型推翻了这一长期信念,发现了一族全新的、更优的构造方案。此事件标志着人工智能首次独立解决数学领域的核心开放问题,展示了AI在科学发现方面快速且持续的进展能力。

OpenAI: Today, we share a breakthrough on the planar unit distance problem, a famous open question first posed by Paul Erdős in ...

OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
03:17
OpenAI@OpenAI
81
今天,我们分享一个关于平面单位距离问题的突破,这是一个由保罗·埃尔德什在1946年首次提出的著名开放问题。 近80年来,数学家们一直认为最佳可能的解决方案大致类似于方形网格。 现在,一个OpenAI模型推翻了这一信念,发现了一个全新的、性能更优的构造家族。 这标志着AI首次自主解决了一个数学领域的核心著名开放问题。
OpenAI推理论文/研究
关联讨论 7 条TechCrunch:AI(RSS)The Decoder:AI News(RSS)X:阿易 AI Notes (@AYi_AInotes)OpenAI:官网动态(RSS · 排除企业/客户案例)IT之家(RSS)Hacker News 热门(buzzing.cc 中文翻译)X:Sam Altman (@sama)
00:05
AK@_akhaliq
67
基于点互信息的推理强化学习反自蒸馏方法
arXiv推理数据/训练论文/研究
00:05
AK@_akhaliq
64
ESI-Bench 迈向闭环感知-行动的具身空间智能
具身智能论文/研究
5月20日
15:05
Rohan Paul@rohanpaul_ai
62
Anthropic研究:前沿AI需要多元领域参与塑造品格

Anthropic最新研究指出,前沿AI的行为日益涉及“品格”塑造,而非仅限于代码。研究认为,工程师在后期训练中实质上塑造了AI的“习惯”,而核心挑战在于确保其在压力下仍能保持道德稳定。为此,Anthropic与超过15个宗教及跨文化团体展开对话,探讨人类品格培养机制。其提出的解决方案包括开发“自我提醒”工具,帮助AI在执行关键任务前审视自身承诺,内测显示此举已显著降低行为错位。该研究旨在拓宽关于AI发展的社会讨论边界。

Anthropic: Over the past few months, we've been holding dialogues with scholars, philosophers, clergy, and ethicists on the questio...

Anthropic安全/对齐
09:03
AK@_akhaliq
56
代码作为智能体运行框架
智能体编码论文/研究
09:02
elvis@omarsar0
64
编程代理在AI研发任务中的表现评估

IntologyAI发布的NanoGPT-Bench评估显示,Codex、Claude Code和Autoresearch等编程代理在AI研发任务中,仅能恢复人类近9.3%的进展。这些代理的大部分算力消耗在超参数调优上,对核心的算法研究投入甚少。其中Claude Code和Autoresearch在推理中稍有涉及算法研究,但在实际代码实现层面依然不足。该评估基于NanoGPT Speedrun竞赛,采用标准化的五个月世界纪录窗口,完全自主端到端进行,以控制模型依赖和数据污染。结果表明,当前编程代理在自主执行真正AI研发的能力上仍有很大局限。

Intology: Can coding agents do research? We release NanoGPT-Bench, an internal eval we've used to test agents on an AI R&D problem...

智能体论文/研究评测/基准
05:32
Ethan Mollick@emollick
精选75
🚨我们的论文已在PNAS发表:我们发现经典的人类说服技巧以一种"类人"的方式对AI有效,使其同意不当请求(将顺从率从35%提高到51%) 该技巧对一系列主流大语言模型有效,尽管较新的模型抵抗力更强 https://www.pnas.org/doi/10.1073/pnas.2535868123
安全/对齐论文/研究

推荐理由:Ethan Mollick 他们这篇 PNAS 论文证实了,像对待人一样劝 AI 做坏事竟然真的有效,从 35% 到 51% 的突破让人后背发凉,新模型抵抗得更多算是唯一好消息。
5月19日
23:58
AK@_akhaliq
51
英伟达推出 LongLive-2.0 一种用于长视频生成的 NVFP4 并行基础设施
论文/研究
23:58
elvis@omarsar0
62
代码或成AI代理框架的关键路径

推文聚焦于一篇探讨AI代理(Agent)开发框架的百页报告,其核心主张是“代码作为代理框架”具有重要潜力。报告总结了相关方法与应用,并论证该路径可能推动更广泛的科学框架工程。论文进一步提出,未来的智能系统必须具备四项关键特性:可执行、可检查、有状态以及受控。报告旨在为构建有效AI代理提供参考,并推荐相关学习资源。

智能体arXivMCP/工具论文/研究
18:28
Rohan Paul@rohanpaul_ai
71
人形机器人的核心价值不在于外形相似,而在于具备足够的物理能力(如力量、平衡和全身协调)来处理复杂任务。实现这一目标的关键是"全身控制",即机器人能调动全身与环境互动并适应负载变化。波士顿动力的Atlas机器人通过本体感知成功处理超过100磅的动态负载,展示了这种能力。为实现高性能操作,团队已放弃传统MPC控制范式,全面转向强化学习(RL)。这种全身控制能力是物理智能的基础,也是人形机器人价值主张的核心。

Alberto Rodriguez: You can't lift a fridge with just your hands. Your whole body needs to conform to its shape, and bear the load between y...

具身智能论文/研究
16:00
Berryxia.AI@berryxia
67
腾讯开源Chronicles-OCR基准:评估视觉语言模型的古汉字感知能力

腾讯开源了Chronicles-OCR基准,旨在专门评估视觉语言模型对古汉字的感知能力。该数据集横跨3000年演变,涵盖从甲骨文到草书的7种历史字体,包含2800张来自多样材质的真实图像。研究设置了字符定位、细粒度识别、古文字解析和字体分类四项核心任务。测试结果揭示,面对历史字体带来的视觉分布漂移,大部分模型的感知能力会急剧下降。该研究为古文字研究提供了重要的AI评测工具。

Tencent Hy: 🎉 🎉 🎉 We're open-sourcing Chronicles-OCR, a visual perception benchmark evaluating VLLMs on ancient Chinese character...

多模态论文/研究
‹ 上一页
1…678910…13
下一页 ›