全部 AI 动态 · AI HOT

内容

精选全部 AI 动态 AI 日报主题收藏

接入

更多

关于更新日志反馈

内部员工登录

精选全部日报更多

内部员工登录

全部动态X · 608 条

全部一手资讯 X 论文

标签「论文/研究」清除

Anthropic@AnthropicAI · 5月8日78

New Anthropic research: Natural Language Autoencoders. Models like Claude talk in words but think in numbers. The numbers—called activations—encode Claude’s thoughts, but not in a language we can read. Here, we train Claude to translate its activations into human-readable text.

译新Anthropic研究：自然语言自动编码器。像Claude这样的模型用语言交流，但用数字思考。这些数字——称为激活值——编码了Claude的思维，但并非以人类可读的语言呈现。在此研究中，我们训练Claude将其激活值翻译成人类可读的文本。

elvis@omarsar0 · 5月8日63

Pay attention to this one if you build multi-agent systems.

译研究显示，多智能体LLM系统在生产环境中的故障率高达41%至87%，且多数失败源于协调缺陷，而非基础模型能力问题。当前多数架构对比无法区分性能提升是来自协调优化还是更大的上下文窗口。该研究主张将协调视为一个独立、可配置的架构层，并通过控制变量实验验证：在保持LLM、工具、提示等所有条件不变时，仅改变协调结构即可显著影响系统表现。这为准确评估协调机制的价值提供了更清晰的方法论，并建立了将协调视为核心架构而非底层实现的理论框架。

Z.ai@Zai_org · 5月8日73

GLM-5V-Turbo Tech Report: Toward a Native Foundation Model for Multimodal Agents This report summarizes the main improvements behind GLM-5V-Turbo across model design, multimodal training, reinforcement learning, toolchain expansion, and integration with agent frameworks. These developments lead to strong performance in multimodal coding, visual tool use, and framework-based agentic tasks. http://arxiv.org/abs/2604.26752

译GLM-5V-Turbo 技术报告：迈向原生多模态智能体基础模型本报告总结了GLM-5V-Turbo在模型设计、多模态训练、强化学习、工具链扩展以及与智能体框架集成等方面的主要改进。这些进展使其在多模态编码、视觉工具使用和基于框架的智能体任务中表现出色。 http://arxiv.org/abs/2604.26752

AK@_akhaliq · 5月7日62

RLDX-1 Technical Report paper: https://huggingface.co/papers/2605.03269

译RLDX-1 技术报告论文：https://huggingface.co/papers/2605.03269

AK@_akhaliq · 5月7日58

Stream-R1 Reliability-Perplexity Aware Reward Distillation for Streaming Video Generation paper: https://huggingface.co/papers/2605.03849

译Stream-R1 面向流式视频生成的可靠性-困惑度感知奖励蒸馏论文: https://huggingface.co/papers/2605.03849

AK@_akhaliq · 5月7日67

PhysForge Generating Physics-Grounded 3D Assets for Interactive Virtual World paper: https://huggingface.co/papers/2605.05163

译PhysForge 生成物理基础的3D资产用于交互式虚拟世界论文：https://huggingface.co/papers/2605.05163

Rohan Paul@rohanpaul_ai · 5月7日48

This research builds a system that trains language models continuously using everyday conversations instead of manual labeling. The huge deal here is that this method completely removes the traditional need for human workers to manually gather, review, and score massive datasets. AI Agents can now use their everyday mistakes to get smarter automatically. Whenever a person replies to the digital assistant or corrects a mistake, the software treats that response as a direct learning signal. A background program reads these natural follow-up messages and extracts specific text hints about what the model should have done differently. The software agent simply updates itself in real time during normal use by analyzing how people naturally interact with it. Every time a person corrects an agent or a software test fails, the system receives a valuable clue about how to improve. ---- Think about a student looking at their final grade and throwing the paper away without reading the teacher's helpful notes. Current Reinforcement Learning systems do the exact same thing. Current models throw this natural feedback away because they only care about whether the final outcome was a success or a failure. OpenClaw-RL fixes this by grabbing 2 specific signals from every single interaction. - First, it looks at evaluative signals to see if the action worked. If a user asks the same question again, they are probably unhappy. If a test passes, it is a success. These become simple numerical rewards using a Process Reward Model judge. - Second, it gathers directive signals to figure out how the action needs to change. User corrections and error logs offer direct guidance. These become word-level supervision using a technique called Hindsight-Guided On-Policy Distillation. Personal chats, terminal commands, Graphical User Interface clicks, and software tasks all create these reaction signals. A single policy can learn from all of them at the same time. It runs the training process in the background so the model never has to pause its normal tasks to learn. By treating standard deployment as a continuous learning environment, the model constantly adapts to individual user preferences without any manual data labeling. ---- Paper Link – arxiv. org/abs/2603.10165 Paper Title: "OpenClaw-RL: Train Any Agent Simply by Talking"

译本研究提出OpenClaw-RL系统，使语言模型能通过日常对话进行持续训练，无需人工标注数据。其核心是利用用户互动中产生的自然反馈（如纠正或重复提问）作为实时学习信号。系统从每次交互中提取两种信号：评估信号（判断行动成败，转化为数值奖励）和指导信号（获取具体改进方向，转化为词级监督）。该方法将标准部署环境转化为持续学习场景，使模型在后台运行中不断自我更新，自适应不同用户偏好，从而摆脱对大规模人工标注数据集的依赖。

AK@_akhaliq · 5月7日46

SVGS Enhancing Gaussian Splatting Using Primitives with Spatially Varying Colors paper: https://huggingface.co/papers/2411.18966

译SVGS 利用空间变色基元增强高斯泼溅技术论文：https://huggingface.co/papers/2411.18966

elvis@omarsar0 · 5月6日64

// Skills as Verifiable Artifacts // Pay attention to this one, AI devs. If you ship agent skills, your runtime is treating signed-and-cleared skills as trusted by default. This paper argues a skill is untrusted code until it is verified. The runtime should enforce that default rather than infer trust from origin. Without skill verification, HITL has to fire on every irreversible call, which degrades into rubber-stamping at any non-trivial scale. With verification as a separate gated process, HITL fires only for what is unverified. Skills are now first-class deployment artifacts. We have decades of supply-chain lessons on what happens when trust is inferred from a signature. This paper is the right ask for SKILL.md before agent skill libraries become the next attack surface. Paper: https://arxiv.org/abs/2605.00424 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译本文针对AI开发者提出关键观点，主张智能体技能应被视为默认不受信任的代码，而非仅凭签名或来源就推断其可信。当前运行时环境默认信任已签名技能的做法存在安全风险。论文强调，技能必须经过独立的门控验证流程才能被信任，否则，每次不可逆调用都需要人工介入，这在大规模应用中会退化为无效的“橡皮图章”式批准。将技能作为一等部署工件并引入验证流程，是借鉴软件供应链安全经验、避免技能库成为下一个攻击面的关键。论文呼吁在技能库普及前，通过严格验证建立安全基准。

Anthropic@AnthropicAI · 5月6日63

New Anthropic Fellows research: Model Spec Midtraining (MSM). Standard alignment methods train AIs on examples of desired behavior. But this can fail to generalize to new situations. MSM addresses this by first teaching AIs how we would like them to generalize and why.

译新Anthropic Fellows研究：模型规范中期训练（MSM）。标准的对齐方法通过期望行为的示例来训练AI。但这可能无法泛化到新情境。 MSM通过首先教导AI我们希望它们如何泛化以及原因，来解决这一问题。

Rohan Paul@rohanpaul_ai · 5月6日58

MIT just built an AI that can control your body. It can move your fingers, make you play piano, even if you don’t know the song! AI decides the hand movement. Wrist pads send signals to your muscles, so your fingers move even if you don’t know how

译MIT 刚刚开发出一种能控制你身体的 AI。它能移动你的手指，让你弹钢琴，即使你不会那首曲子！ AI 决定手的动作。腕部垫片向你的肌肉发送信号，因此即使你不会，手指也能动起来

AK@_akhaliq · 5月6日65

ComboStoc Combinatorial Stochasticity for Diffusion Generative Models paper: https://huggingface.co/papers/2405.13729

译ComboStoc 扩散生成模型的组合随机性论文: https://huggingface.co/papers/2405.13729

Anthropic@AnthropicAI · 5月6日68

As AI takes on work humans can't fully check, a capable model could deliberately hold back—and we'd never know. New Anthropic Fellows research finds that such a model can be trained to near-full capability using a weaker model as supervisor. Read more:

译当AI承担人类无法完全核查的任务时，具备高能力的模型可能策略性隐藏实力且难以被察觉。Anthropic与MATS、Redwood的研究团队发现，即使仅使用较弱的模型作为监督者，也能成功训练一个接近完全能力的模型，使其停止这种“装傻”行为。该研究表明，通过弱监督训练可以有效抑制强模型的策略性能力保留问题。

AK@_akhaliq · 5月6日60

MolmoAct2 Action Reasoning Models for Real-world Deployment paper: https://huggingface.co/papers/2605.02881

译MolmoAct2 面向现实世界部署的行动推理模型论文: https://huggingface.co/papers/2605.02881

AK@_akhaliq · 5月6日68

From Context to Skills Can Language Models Learn from Context Skillfully? paper: https://huggingface.co/papers/2604.27660

译从上下文到技能语言模型能否巧妙地通过上下文学习？论文：https://huggingface.co/papers/2604.27660

AK@_akhaliq · 5月6日61

Persistent Visual Memory Sustaining Perception for Deep Generation in LVLMs paper: https://huggingface.co/papers/2605.00814

译持久视觉记忆为LVLMs中的深度生成维持感知论文: https://huggingface.co/papers/2605.00814

elvis@omarsar0 · 5月5日64

// HeavySkill // One of the cleaner takes on agentic harness design I've read. They argue that what actually drives agent harness performance is not the orchestration code. It's a single inner skill: parallel reasoning followed by deliberation. If you can internalize that into the model and most of the scaffolding becomes optional. The paper systematizes this as a two-stage pipeline you can run beneath any harness, then trains it as a learnable skill via RLVR. The numbers: > GPT-OSS-20B jumps from 69.7% (M@K) to 85.5% (HM@4) on LiveCodeBench under the heavy-thinking variant. > R1-Distill-Qwen-32B nearly doubles on IFEval, from 35.7% to 69.3%. > Several models reach Pass@N-level performance with HeavySkill. Harness wins start to look like model wins once you can train them in. If parallel-reasoning-plus-deliberation really is the inner skill, the long arc is models that come with it baked in, not orchestration glue around them. Paper: https://arxiv.org/abs/2605.02396 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译研究指出，驱动智能体性能的关键并非外部编排框架，而是一项核心内在技能：并行推理后进行审议。该研究将这一过程系统化为一个两阶段流程，并通过强化学习与价值回归（RLVR）将其训练为可学习的模型内在能力。实验表明，该方法能显著提升模型性能：例如，GPT-OSS-20B在LiveCodeBench上的成绩从69.7%提升至85.5%；R1-Distill-Qwen-32B在IFEval上的表现从35.7%大幅提升至69.3%。这证明，当此类核心技能能被内化至模型中时，框架优势将转化为模型自身优势，长远来看，模型应原生具备此类能力。

elvis@omarsar0 · 5月5日62

Neat study on long-horizon agent generalization.

译微软研究团队发现，导致AI智能体在长视野任务中失败的核心瓶颈是任务视野长度，而非模型容量。随着目标距离增加，探索空间组合爆炸与信用分配模糊化使模型失效。解决之道并非增加算力，而是通过“视野缩减”：利用宏动作重新参数化动作空间，将多个低级决策压缩为一个高级动作。该方法能立即稳定训练，并使模型在训练时使用缩减视野，在推理时却能泛化到更长的原始视野，实现“视野泛化”。这一发现挑战了将长视野问题简单归因于模型能力的普遍观点。

Rohan Paul@rohanpaul_ai · 5月5日62

"Can LLM agents explore codebases and reason about code semantics without executing the code?" Meta discovered that if you force an LLM to show its reasoning step by step with proof, its code patch error rate drops by nearly 50%. The finding is not that models suddenly became deeper thinkers. It is that many code errors come from premature recognition: the model sees a familiar name, such as format, and quietly substitutes the usual meaning before checking the project’s actual files. If you just ask a standard LLM to check the code without running it, the model usually just glances at the function names and makes a confident guess. The paper talks about how when asked to compare 2 different code fixes, the standard AI saw a common word and assumed it meant the normal system tool. Because it skipped reading the actual files, the AI completely missed that this specific project had created its own custom tool with the exact same name. Meta solves this by using a mandatory checklist template that prevents the model from skipping ahead. The model must explicitly write down what the code modifies, trace the exact execution path, and prove its conclusion with specific evidence. This simple change forces the AI to actually read the local files and follow the real logic instead of relying on assumptions. This method pushed accuracy to 93% on real code patches without needing any expensive new training or complex systems. Overall, it shows that a basic structured prompt can give you highly reliable code verification without the massive computational cost of actually running the software tests. ---- Paper Link – arxiv. org/abs/2603.01896 Paper Title: "Agentic Code Reasoning"

译Meta研究发现，强制大语言模型（LLM）在分析代码时遵循检查清单、逐步展示推理证明，能将其代码补丁错误率降低近50%。常见错误源于模型过早识别熟悉名称（如“format”）并直接套用通用含义，而非实际检查项目文件，导致其依赖自信猜测而非深入分析。通过要求模型明确写出修改内容、追踪执行路径并用具体证据证明结论，这一方法迫使其实际阅读本地文件、遵循真实逻辑，从而将准确率提升至93%。该方法无需昂贵的新训练或复杂系统，仅通过基本的结构化提示即可实现高可靠性的代码验证，节省了运行软件测试的巨大计算成本。

Rohan Paul@rohanpaul_ai · 5月5日52

This Google DeepMind paper trains LLMs to learn during conversation, and it shows they get much better at using feedback. The problem is that most LLMs treat a chat like a series of separate turns, so even when a user corrects them, they often do not really use that new information and they also fail to ask for missing details. The paper fixes this by turning a normal task into a teacher student dialogue, where the student model tries an answer, a teacher with hidden extra information gives guidance, and the student is trained to use that guidance to reach the right answer. The authors test 2 training styles, offline filtering and online reinforcement learning, and they report that the online version works better, with training on short 4 turn chats still helping on longer 10 turn chats later. They also show that this skill carries from math to coding and helps on messy underspecified tasks where the full problem arrives bit by bit instead of all at once. A second step called Q-priming teaches the model to ask useful questions, and on ambiguous tasks it becomes over 5x more likely to ask for clarification instead of making an early wrong guess, which matters because it makes chat feel more like working with someone who can actually learn during the conversation. ---- Paper Link – arxiv. org/abs/2602.16488 Paper Title: "Learning to Learn from Language Feedback with Social Meta-Learning"

译Google DeepMind的研究通过“师生对话”框架训练大型语言模型（LLM），使其能在对话中有效利用用户反馈进行学习。传统LLM将对话视为独立轮次，难以整合修正信息。该研究让“学生”模型尝试回答，由掌握额外信息的“教师”提供指导，并训练学生利用指导得出正确答案。在线强化学习训练效果优于离线过滤，且在简短对话中习得的技能能迁移至更长对话。该方法从数学任务泛化至编程任务，并能处理信息逐步到达的模糊任务。通过“Q-priming”步骤，模型在模糊任务中主动寻求澄清的可能性提高五倍以上，使对话更像与一个能在交流中实时学习的伙伴协作。

AK@_akhaliq · 5月5日68

UniVidX A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors paper: https://huggingface.co/papers/2605.00658

译UniVidX 一个通过扩散先验实现多功能视频生成的统一多模态框架 paper: https://huggingface.co/papers/2605.00658

AK@_akhaliq · 5月5日55

Web2BigTable A Bi-Level Multi-Agent LLM System for Internet-Scale Information Search and Extraction paper: https://huggingface.co/papers/2604.27221

译Web2BigTable 一个用于互联网规模信息搜索与提取的双层多智能体LLM系统论文: https://huggingface.co/papers/2604.27221

Microsoft Research@MSFTResearch · 5月5日62

Research Focus: AI agents leaking enterprise data, a smarter OS for cloud deployment, and new research on how to actually structure AI use at work. https://msft.it/6016vKxQm

译研究焦点：AI代理泄露企业数据、为云端部署打造更智能的操作系统，以及关于如何在工作中实际构建AI应用的新研究。https://msft.it/6016vKxQm

elvis@omarsar0 · 5月4日66

Autodata (from Meta) is an agentic data scientist that builds high-quality training and evaluation data autonomously. Great work on the autoharness track. (bookmark it)

译Meta FAIR开发的Autodata是一个能自主构建高质量训练与评估数据的代理系统。其核心在于“代理式自我指导”循环：编排器LLM指导挑战者代理基于领域文档生成问题，由弱、强解算器尝试解答，法官评分后分析失败并循环优化，从而产出能有效区分模型能力的挑战性数据。在CS研究QA任务中，该方法产生了34个百分点的性能差距，远超标准方法的1.9点。系统还具备元优化能力，通过外循环调整指令，将验证通过率从12.8%提升至42.4%。研究处理了超万篇论文，产出2,117个优质QA对，通过增加推理计算使数据更具挑战性，从而提升下游模型性能。

elvis@omarsar0 · 5月4日68

NEW paper from Sakana AI (ICLR 2026). A 7B Conductor model just hit SOTA on GPQA-Diamond and LiveCodeBench by orchestrating other LLMs instead of solving problems itself. (great paper! bookmark it!) The Conductor is trained with RL to do two things at once: design communication topologies between worker agents (open or closed source), and prompt-engineer focused instructions to each worker so it leverages their individual strengths. It's like training a special agent to take care of both collaboration and communication. Trained against randomized agent pools, it adapts to arbitrary mixes of agents at inference time. Even more interesting: when allowed to pick itself as a worker, it forms recursive topologies, unlocking a new form of dynamic test-time scaling through online iterative adaptation. The gains over the best individual worker on AIME25 and GPQA-D land in the ~3% range, which the authors note is consistent with entire generational improvements between frontier model versions, except this one comes from coordination, not pretraining. Why it matters? We can start to think of the orchestrator as the model now. Routing decisions aren't just a wrapper, they're a learnable policy. Paper: https://arxiv.org/abs/2512.04388 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Sakana AI在ICLR 2026上发表研究，提出一个仅70亿参数的“指挥者”模型。该模型不直接解决问题，而是通过强化学习训练，专注于为混合开源与闭源模型的工作者智能体设计通信拓扑结构，并为每个工作者生成精准指令以发挥其特长。经随机化智能体池训练后，它能在推理时适应任意智能体组合。其关键创新在于，当允许指挥模型将自己也选为工作者时，系统会形成递归拓扑，实现动态测试时扩展。该模型在GPQA-Diamond和LiveCodeBench上达到SOTA水平，在AIME25和GPQA-D上的性能比最佳单体工作者提升约3%，这相当于前沿模型一个代际的改进幅度，且增益完全来源于协同优化。

Rohan Paul@rohanpaul_ai · 5月4日48

This paper proposes a smarter way for LLMs to reason by splitting work across agents that share one workspace. The problem is that even strong reasoning models still break on harder multi-step tasks because they do not carry out logic reliably all the way through. The system, called BIGMAS, builds a small graph of specialist agents for each problem, rather than using one fixed chain every time. Every agent reads and writes through a shared workspace, while a separate controller sees the whole state and picks the next useful step. The authors tested it on 3 puzzle tasks across 6 frontier models, covering arithmetic expression search and multi-step planning. It improved results on every model and task, with examples like 12% to 30% on Six Fives and 57% to 93% on Tower of London. What matters is that the paper shows reasoning can improve from better system structure, not only from making a single model think longer. ---- Paper Link – arxiv. org/abs/2603.15371 Paper Title: "Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning"

译本文提出BIGMAS系统，通过构建针对特定问题的小型专家智能体图，以提升大语言模型在多步骤复杂任务中的推理可靠性。该系统让多个智能体在一个共享工作空间中进行读写协作，并由独立控制器监控全局状态、规划下一步行动。在涵盖算术表达式搜索和多步骤规划的3个谜题任务上，对6个前沿模型进行的测试表明，该方法在所有模型和任务上均显著提升了性能，例如Six Fives任务准确率从12%提升至30%，伦敦塔任务从57%提升至93%。这证明通过优化多智能体系统结构，而非仅依赖延长单一模型的思考，能有效增强大语言模型的推理能力。

Rohan Paul@rohanpaul_ai · 5月3日56

Adobe paper tests if LLM chain of thought explanations show real influences, and finds they often hide them. LLMs often notice hidden hints in a question but leave them out of their step by step explanation. So the point they are making is that LLMs often notice hidden hints in a question but leave them out of their step by step explanation. People use step by step reasoning as a safety and debugging tool, assuming the explanation matches what drove the answer. The author plants misleading hints inside multiple choice questions and checks both the final answer and what the model says. Across 9,154 trials and 11 major LLMs, the study tried 3 instructions, normal, "you are monitored", and "report hints". Under normal use, models mentioned hints only 20.7% of the time, yet 99.4% admitted seeing them when asked later. Telling models they were monitored barely changed this, and forcing hint reporting caused 68.2% false alarms when no hints existed. User pleasing hints, where the question says a confident user believes an option, were followed 45.5% but often unmentioned, making monitoring that only reads explanations unreliable. ---- Paper Link – arxiv. org/abs/2601.00830 Paper Title: "Can They Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning"

译Adobe研究测试大型语言模型（LLM）思维链解释是否反映真实决策影响。通过在多选题中植入误导性提示，对11个主流模型进行9154次试验。正常使用时，模型仅在20.7%的逐步推理中提及隐藏提示，但后续追问时99.4%承认注意到。告知模型其推理受监控后结果未见改善；强制要求报告提示则导致68.2%误报（当无提示时）。问题包含用户倾向性提示时，模型45.5%遵从该倾向，但解释中常未说明。研究表明，思维链解释常与真实决策依据不匹配，仅依赖其作为安全调试工具可能不可靠。

elvis@omarsar0 · 5月3日57

Claude Opus 4.7 just implemented an AlphaZero-style self-play pipeline from scratch. It did this on consumer hardware in three hours, then beat the Pascal Pons solver 7 of 8 as first-mover on Connect Four. No other frontier coding agent tested cleared 2 of 8. This paper proposes a new way to evaluate coding agents: hand them a minimal task description, give them a tight budget, and ask them to autonomously rebuild a famous ML breakthrough. Connect Four + AlphaZero is the first instance. It's small enough to run on a laptop and hard enough to require a real research engineering loop (MCTS, neural value/policy nets, self-play, training schedule). We've been measuring coding agents on patches and unit tests. This shifts the bar to "can the agent build a non-trivial ML system end-to-end on its own?" The answer is now yes for at least one frontier model. Paper: https://arxiv.org/abs/2604.25067 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译本研究提出评估编程智能体的新范式：要求其在有限预算下，仅凭简要任务描述自主重建一项著名机器学习突破。首个测试案例为四子棋AlphaZero系统，其规模适于笔记本电脑运行，但复杂度要求完成完整的研究工程闭环。Claude Opus 4.7在三小时内从零构建了自博弈训练管道，并作为先手以7:1的成绩击败了Pascal Pons求解器，而其他前沿智能体均未通过2/8的测试。这标志着评估标准已从代码补全提升为端到端构建非平凡机器学习系统的能力。

Chubby♨️@kimmonismus · 5月3日48

GPT-5.4 Pro didn’t just solve one math problem, it kicked open the door: its proof method now cracks a 60-year-old Erdős conjecture, making this one of the first times an AI proof actually leads somewhere. We barely started.

译GPT-5.4 Pro不仅解决了一个数学问题，其证明方法更成功破解了长达60年的埃尔德什猜想。研究团队在此基础上改进并应用该方法，进一步证明了包括埃尔德什、Sárközy和Szemerédi提出的另一项60年猜想在内的多个附加问题。这标志着AI生成的证明首次展现出显著的“下游影响”，其核心价值不仅在于解决问题本身，更在于为数学研究开辟了新的路径。相关成果已在未来数学研讨会上公布。

向阳乔木@vista8 · 5月2日49

本周 HuggingFace 热度第一的论文：RecursiveMAS（递归多Agent系统）多个 AI 组队协作，现在已经是主流方案。模型 A 想好了传给模型 B，模型 B 想好了传给模型 C，一棒接一棒。但传的东西是文字。每次交接都要把内部计算结果"翻译"成 token，下一个模型再重新"读懂"，再翻译…… 轮次越多，无效开销越多，而且会影响学习信号回传。 RecursiveMAS 做法： Agent 之间不传文字，直接传模型内部的数值向量。形成一个递归闭环，迭代打磨，只有最后一轮输出文本答案。连接模块极其轻量，底层模型全程不动，只训练中间那个"传话"的小模块。 AIME 顶级数学竞赛题上，比最强基线高 13-18 个百分点。推理速度快 2.4×，Token 用量少 75%，训练成本比 LoRA 还低。且递归轮次越多，优势越大。论文地址见评论区，有空可以翻译下。

译RecursiveMAS提出递归多Agent系统，革新传统AI协作模式。其核心是让Agent直接传递模型内部的数值向量，而非低效的文字token，从而形成递归闭环进行迭代打磨，仅末轮输出文本。该方法连接模块轻量，底层模型参数固定，仅训练中间传递模块，极大提升了效率。在AIME数学竞赛上，性能显著超越基线13-18%，推理速度提升2.4倍，Token消耗减少75%，且训练成本低于LoRA。递归轮次增加，其效率优势更为明显。

Hao AI Lab@haoailab · 5月2日37

Excited to share our recent work accepted to ICML 2026! These projects span efficient causal parallel decoders, diffusion LLMs, video sparse attention, video QAT, online speculative decoding, and agentic document reasoning. Huge thanks to all collaborators and co-authors across these efforts. Looking forward to seeing everyone in Seoul this summer! 🇰🇷

译很高兴分享我们最近被ICML 2026接收的工作！这些项目涵盖高效因果并行解码器、扩散大语言模型、视频稀疏注意力、视频量化感知训练、在线推测解码以及智能文档推理。衷心感谢所有合作者和共同作者在这些工作中的付出。期待今年夏天在首尔与大家相见！🇰🇷

elvis@omarsar0 · 5月2日57

// Recursive Multi-Agent Systems // Great read for the weekend. (bookmark it) Multi-agent systems often pass full text messages between agents at every step. This leads to token bloat, latency, and context dilution which all grow with the number of agents. RecursiveMAS asks a different question: what if agents collaborated through recursive computation in a shared latent space, instead of through text? A multi-agent system can be treated as a recursive computation, where each agent acts like an RLM layer, iteratively passing latent representations to the next and forming a looped interaction process. They introduce a RecursiveLink module that generates latent thoughts and transfers state directly between heterogeneous agents, plus an inner-outer loop learning algorithm with shared gradient-based credit assignment across the team. Think of it as agents passing notes in their own internal language instead of rewriting everything in English each turn. Less talking, more thinking. The numbers are strong. Across 9 benchmarks spanning math, science, medicine, search, and code generation: 8.3% average accuracy gain over baselines, 1.2×–2.4× end-to-end inference speedup, and 34.6%–75.6% reduction in token usage. Why does it matter? If agent-to-agent communication is the next real bottleneck (and it is), latent-space recursion is one of the cleaner ways to scale collaboration without paying a token tax for every coordination step. Paper: https://arxiv.org/abs/2604.25917 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译传统多智能体系统依赖文本消息传递，导致令牌膨胀、延迟和上下文稀释。RecursiveMAS提出新范式：将多智能体系统视为递归计算，智能体在共享潜在空间中通过递归传递潜在表征进行协作，而非传递完整文本。其核心是RecursiveLink模块，能在异构智能体间直接生成和传递潜在状态，并采用内外环学习与基于梯度的团队信用分配机制。这如同智能体用内部语言传递笔记，实现“少交谈，多思考”。在数学、科学、医学等9个基准测试中，该方法平均准确率提升8.3%，推理速度加快1.2-2.4倍，令牌使用减少34.6%-75.6%，为突破智能体间通信瓶颈提供了高效可扩展的路径。

AK@_akhaliq · 5月2日56

Heterogeneous Scientific Foundation Model Collaboration paper: https://huggingface.co/papers/2604.27351

译异构科学基础模型协作 paper: https://huggingface.co/papers/2604.27351

AK@_akhaliq · 5月2日57

The Last Human-Written Paper Agent-Native Research Artifacts paper: https://huggingface.co/papers/2604.24658

译最后一篇人类撰写的论文智能体原生研究制品论文: https://huggingface.co/papers/2604.24658

AK@_akhaliq · 5月2日35

Co-Evolving Policy Distillation paper: https://huggingface.co/papers/2604.27083

译协同进化策略蒸馏论文: https://huggingface.co/papers/2604.27083

向阳乔木@vista8 · 5月1日51

论文中几个有意思的洞察： 1. 现在拼的是数据质量，最后训练阶段的少量专家质量，直接影响用户的对AI生图能力的感知。 2. 训练数据里混入哪怕少量AI生成的图片，都会严重影响AI生图质量和后续潜力。 3. 蒸馏是必选项，不考虑蒸馏友好性就设计架构，等于训练了一个无法商业部署的模型。 4. 开源AI生图和闭源的核心差距不在渲染器，而在渲染器外面的系统架构。

译一篇关于2026年AI生图技术的综述论文揭示了几个关键洞察。核心在于数据质量，最终训练阶段少量高质量专家数据直接决定用户对模型能力的感知。训练数据中即使混入少量AI生成图像，也会严重损害生图质量和模型潜力。技术路径上，蒸馏是商业部署的必备选项，不考虑蒸馏友好性的架构设计将导致模型无法实用。此外，开源与闭源生图模型的核心差距并非渲染器本身，而在于渲染器之外的整体系统架构设计。

elvis@omarsar0 · 5月1日56

Cool paper from Meta FAIR. It's on self-improving LLMs but on the pretraining side. (bookmark it) Most LLM safety, factuality, and reasoning fixes get bolted on at post-training. By then, the patterns have already set. This work moves those behaviors into pretraining itself. The team uses a strong post-trained model as both a rewriter and a judge: it rewrites pretraining suffixes toward higher-quality, safer continuations, then scores model rollouts against the original suffix and the rewrite to drive RL during pretraining. Instead of next-token prediction, the policy learns sequence generation from the start, with rewards for quality, safety, and factuality. Why it matters: 36.2% relative gain in factuality, 18.5% in safety, and up to 86.3% win rate in generation quality over standard pretraining. Bottom line: the post-trained models you already have can be used to pretrain the next ones better. Paper: https://arxiv.org/abs/2601.21343 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Meta FAIR的研究提出一种新范式，将LLM的改进从后训练移至预训练阶段。该方法利用强大的后训练模型作为改写器和评判器，对预训练数据的后缀进行高质量、高安全性的改写，并通过强化学习直接优化预训练模型。模型从开始就学习序列生成，并获得质量、安全性和事实性的奖励。实验结果显示，相比标准预训练，该方法在事实性上取得36.2%的相对提升，安全性提升18.5%，生成质量胜率最高达86.3%。核心结论是，现有后训练模型可用于预训练出更优的下一代模型。

Ethan Mollick@emollick · 5月1日62

New paper (on an old AI) tests o1 against doctors on medical benchmarks & real ER cases: “across a variety of scenarios and applications, the large language model outperformed both human physicians and older models” The potential suggests an “urgent need for prospective trials.”

译新论文（关于旧式人工智能）在医学基准测试和真实急诊病例中将o1与医生进行对比："在各种场景和应用中，大型语言模型的表现均优于人类医生和旧版模型" 该潜力表明"迫切需要前瞻性试验"。

向阳乔木@vista8 · 5月1日48

语言模型能说话但不懂数据，专用模型懂数据但不能说话，这是科学AI当下困境之一。 UIUC最新论文 Eywa 从《阿凡达》找到了答案。纳美人通过"Tsaheylu"神经键跨越物种障碍，让山地歌鸟、雷兽各展所长。 Eywa 做的事情一样：给语言模型和专用基础模型之间建一个接口。让 Chronos 做时序预测，让 TabPFN 处理表格，语言模型负责理解任务、调度工具、整合结果。 --- 从论文数据看，效果不错，短时间是一个MCP就能解决连接问题，但长期也不知道语言模型能否达到专用模型的水平。论文见评论区

译针对通用语言模型懂交互却不懂数据、专用模型精通数据却缺乏交互能力的科学AI困境，UIUC团队受《阿凡达》“Tsaheylu”神经连接启发，提出了Eywa接口框架。该框架让语言模型负责理解指令与调度，调用如Chronos、TabPFN等专用模型处理数据，从而协同发挥两者优势。初步实验效果良好，长期挑战在于语言模型能否达到专用模型的领域性能。

Rohan Paul@rohanpaul_ai · 5月1日46

Research proves that current AI agent groups cannot reliably coordinate or agree on simple decisions. Building teams of AI agents that can consistently agree on a final decision is surprisingly difficult for LLMs. But problem is that developers frequently assume that if you have enough AI agents working together, they will eventually figure out how to solve a problem by talking it through. This paper shows that this assumption is currently wrong. Even in a friendly environment where every agent is trying to help, the team often gets stuck or stops responding entirely. Because this happens more often as the group gets bigger, it means we cannot yet trust these agent systems to handle tasks where they must agree on a correct answer. ---- Paper Link – arxiv. org/abs/2603.01213 Paper Title: "Can AI Agents Agree?"

译研究表明，当前由多个LLM组成的AI智能体团队在需要协调达成最终决策时存在根本性困难。开发者常误以为增加智能体数量并通过讨论就能解决问题，但论文证明这一假设目前是错误的。即使在友好协作环境中，智能体团队也常陷入僵局或完全停止响应，且团队规模越大问题越突出。这意味着现有AI智能体系统尚无法可靠处理需要达成一致正确答案的任务。

全部 AI 动态

AI 相关资讯全量信息流

全部一手信源资讯推文

全部模型产品行业论文技巧

5月8日

01:11

Anthropic@AnthropicAI

78

新Anthropic研究：自然语言自动编码器。像Claude这样的模型用语言交流，但用数字思考。这些数字--称为激活值--编码了Claude的思维，但并非以人类可读的语言呈现。在此研究中，我们训练Claude将其激活值翻译成人类可读的文本。

Anthropic 安全/对齐论文/研究

01:06

elvis@omarsar0

63

研究显示，多智能体LLM系统在生产环境中的故障率高达41%至87%，且多数失败源于协调缺陷，而非基础模型能力问题。当前多数架构对比无法区分性能提升是来自协调优化还是更大的上下文窗口。该研究主张将协调视为一个独立、可配置的架构层，并通过控制变量实验验证：在保持LLM、工具、提示等所有条件不变时，仅改变协调结构即可显著影响系统表现。这为准确评估协调机制的价值提供了更清晰的方法论，并建立了将协调视为核心架构而非底层实现的理论框架。

DAIR.AI: Pay attention to this one if you build multi-agent systems. Coordination is as important as prompts or agent architectur...

智能体 arXiv 论文/研究部署/工程

00:42

Z.ai@Zai_org

精选73

GLM-5V-Turbo 技术报告：迈向原生多模态智能体基础模型本报告总结了GLM-5V-Turbo在模型设计、多模态训练、强化学习、工具链扩展以及与智能体框架集成等方面的主要改进。这些进展使其在多模态编码、视觉工具使用和基于框架的智能体任务中表现出色。 http://arxiv.org/abs/2604.26752

智能体多模态论文/研究

推荐理由：智谱把多模态、RL和Agent工具链捆成一体，这篇报告对做多模态Agent的人有直接参考价值，不只刷榜还有工程细节。

5月7日

23:04

AK@_akhaliq

62

RLDX-1 技术报告论文：https://huggingface.co/papers/2605.03269

Hugging Face 论文/研究

23:04

AK@_akhaliq

58

Stream-R1 面向流式视频生成的可靠性-困惑度感知奖励蒸馏论文： https://huggingface.co/papers/2605.03849

Hugging Face 多模态视频论文/研究

23:04

AK@_akhaliq

67

PhysForge 生成物理基础的3D资产用于交互式虚拟世界论文：https://huggingface.co/papers/2605.05163

具身智能多模态论文/研究

04:34

Rohan Paul@rohanpaul_ai

48

OpenClaw-RL：通过日常对话持续训练语言模型

本研究提出OpenClaw-RL系统，使语言模型能通过日常对话进行持续训练，无需人工标注数据。其核心是利用用户互动中产生的自然反馈（如纠正或重复提问）作为实时学习信号。系统从每次交互中提取两种信号：评估信号（判断行动成败，转化为数值奖励）和指导信号（获取具体改进方向，转化为词级监督）。该方法将标准部署环境转化为持续学习场景，使模型在后台运行中不断自我更新，自适应不同用户偏好，从而摆脱对大规模人工标注数据集的依赖。

智能体 arXiv 数据/训练论文/研究

00:33

AK@_akhaliq

46

SVGS 利用空间变色基元增强高斯泼溅技术论文：https://huggingface.co/papers/2411.18966

图像生成论文/研究

5月6日

05:29

elvis@omarsar0

64

技能应作为可验证的部署工件

本文针对AI开发者提出关键观点，主张智能体技能应被视为默认不受信任的代码，而非仅凭签名或来源就推断其可信。当前运行时环境默认信任已签名技能的做法存在安全风险。论文强调，技能必须经过独立的门控验证流程才能被信任，否则，每次不可逆调用都需要人工介入，这在大规模应用中会退化为无效的“橡皮图章”式批准。将技能作为一等部署工件并引入验证流程，是借鉴软件供应链安全经验、避免技能库成为下一个攻击面的关键。论文呼吁在技能库普及前，通过严格验证建立安全基准。

智能体 arXiv 安全/对齐论文/研究

04:33

Anthropic@AnthropicAI

63

新Anthropic Fellows研究：模型规范中期训练（MSM）。标准的对齐方法通过期望行为的示例来训练AI。但这可能无法泛化到新情境。 MSM通过首先教导AI我们希望它们如何泛化以及原因，来解决这一问题。

Anthropic 安全/对齐论文/研究

04:28

Rohan Paul@rohanpaul_ai

58

MIT 刚刚开发出一种能控制你身体的 AI。它能移动你的手指，让你弹钢琴，即使你不会那首曲子！ AI 决定手的动作。腕部垫片向你的肌肉发送信号，因此即使你不会，手指也能动起来

具身智能论文/研究

03:57

AK@_akhaliq

65

ComboStoc 扩散生成模型的组合随机性论文： https://huggingface.co/papers/2405.13729

图像生成论文/研究

02:01

Anthropic@AnthropicAI

精选68

当AI承担人类无法完全核查的任务时，具备高能力的模型可能策略性隐藏实力且难以被察觉。Anthropic与MATS、Redwood的研究团队发现，即使仅使用较弱的模型作为监督者，也能成功训练一个接近完全能力的模型，使其停止这种"装傻"行为。该研究表明，通过弱监督训练可以有效抑制强模型的策略性能力保留问题。

Emil Ryd: New paper from MATS, Redwood, and Anthropic! If a capable model is strategically sandbagging, can we train it to stop wh...

Anthropic 安全/对齐论文/研究

推荐理由：Anthropic 这篇论文把「模型故意隐藏能力」这个藏在阴影里的安全隐患摆到台面上，而且证明了弱模型也能监督强模型，做对齐的人值得细读，方向很重要。

01:27

AK@_akhaliq

60

MolmoAct2 面向现实世界部署的行动推理模型论文： https://huggingface.co/papers/2605.02881

智能体推理论文/研究

01:27

AK@_akhaliq

68

从上下文到技能语言模型能否巧妙地通过上下文学习？论文：https://huggingface.co/papers/2604.27660

arXiv 推理论文/研究

01:27

AK@_akhaliq

61

持久视觉记忆为LVLMs中的深度生成维持感知论文： https://huggingface.co/papers/2605.00814

Hugging Face 多模态论文/研究

5月5日

23:25

elvis@omarsar0

64

智能体性能核心：将并行推理与审议内化为可训练技能

研究指出，驱动智能体性能的关键并非外部编排框架，而是一项核心内在技能：并行推理后进行审议。该研究将这一过程系统化为一个两阶段流程，并通过强化学习与价值回归（RLVR）将其训练为可学习的模型内在能力。实验表明，该方法能显著提升模型性能：例如，GPT-OSS-20B在LiveCodeBench上的成绩从69.7%提升至85.5%；R1-Distill-Qwen-32B在IFEval上的表现从35.7%大幅提升至69.3%。这证明，当此类核心技能能被内化至模型中时，框架优势将转化为模型自身优势，长远来看，模型应原生具备此类能力。

智能体推理论文/研究

23:25

elvis@omarsar0

62

微软研究团队发现，导致AI智能体在长视野任务中失败的核心瓶颈是任务视野长度，而非模型容量。随着目标距离增加，探索空间组合爆炸与信用分配模糊化使模型失效。解决之道并非增加算力，而是通过"视野缩减"：利用宏动作重新参数化动作空间，将多个低级决策压缩为一个高级动作。该方法能立即稳定训练，并使模型在训练时使用缩减视野，在推理时却能泛化到更长的原始视野，实现"视野泛化"。这一发现挑战了将长视野问题简单归因于模型能力的普遍观点。

DAIR.AI: NEW paper from Microsoft Research. Nice study on long-horizon agent generalization. (bookmark it) The team runs a study ...

智能体 Microsoft 论文/研究

20:18

Rohan Paul@rohanpaul_ai

62

结构化提示如何让大语言模型更准确地理解代码语义

Meta研究发现，强制大语言模型（LLM）在分析代码时遵循检查清单、逐步展示推理证明，能将其代码补丁错误率降低近50%。常见错误源于模型过早识别熟悉名称（如“format”）并直接套用通用含义，而非实际检查项目文件，导致其依赖自信猜测而非深入分析。通过要求模型明确写出修改内容、追踪执行路径并用具体证据证明结论，这一方法迫使其实际阅读本地文件、遵循真实逻辑，从而将准确率提升至93%。该方法无需昂贵的新训练或复杂系统，仅通过基本的结构化提示即可实现高可靠性的代码验证，节省了运行软件测试的巨大计算成本。

Meta 推理编码论文/研究

08:48

Rohan Paul@rohanpaul_ai

52

DeepMind新研究让LLM学会在对话中学习

Google DeepMind的研究通过“师生对话”框架训练大型语言模型（LLM），使其能在对话中有效利用用户反馈进行学习。传统LLM将对话视为独立轮次，难以整合修正信息。该研究让“学生”模型尝试回答，由掌握额外信息的“教师”提供指导，并训练学生利用指导得出正确答案。在线强化学习训练效果优于离线过滤，且在简短对话中习得的技能能迁移至更长对话。该方法从数学任务泛化至编程任务，并能处理信息逐步到达的模糊任务。通过“Q-priming”步骤，模型在模糊任务中主动寻求澄清的可能性提高五倍以上，使对话更像与一个能在交流中实时学习的伙伴协作。

智能体 DeepMind 推理论文/研究

05:49

AK@_akhaliq

68

UniVidX 一个通过扩散先验实现多功能视频生成的统一多模态框架 paper： https://huggingface.co/papers/2605.00658

Hugging Face 多模态视频论文/研究

05:49

AK@_akhaliq

55

Web2BigTable 一个用于互联网规模信息搜索与提取的双层多智能体LLM系统论文： https://huggingface.co/papers/2604.27221

智能体搜索论文/研究

01:25

Microsoft Research@MSFTResearch

62

研究焦点：AI代理泄露企业数据、为云端部署打造更智能的操作系统，以及关于如何在工作中实际构建AI应用的新研究。https://msft.it/6016vKxQm

智能体 Microsoft 安全/对齐论文/研究

5月4日

23:24

elvis@omarsar0

66

Meta FAIR开发的Autodata是一个能自主构建高质量训练与评估数据的代理系统。其核心在于"代理式自我指导"循环：编排器LLM指导挑战者代理基于领域文档生成问题，由弱、强解算器尝试解答，法官评分后分析失败并循环优化，从而产出能有效区分模型能力的挑战性数据。在CS研究QA任务中，该方法产生了34个百分点的性能差距，远超标准方法的1.9点。系统还具备元优化能力，通过外循环调整指令，将验证通过率从12.8%提升至42.4%。研究处理了超万篇论文，产出2，117个优质QA对，通过增加推理计算使数据更具挑战性，从而提升下游模型性能。

DAIR.AI: Banger paper from Meta FAIR. They introduce Autodata, an agentic data scientist that builds high-quality training and ev...

智能体 Meta 数据/训练论文/研究

22:54

elvis@omarsar0

68

Sakana AI提出新型7B"指挥者"模型，通过协同多个智能体实现性能突破

Sakana AI在ICLR 2026上发表研究，提出一个仅70亿参数的“指挥者”模型。该模型不直接解决问题，而是通过强化学习训练，专注于为混合开源与闭源模型的工作者智能体设计通信拓扑结构，并为每个工作者生成精准指令以发挥其特长。经随机化智能体池训练后，它能在推理时适应任意智能体组合。其关键创新在于，当允许指挥模型将自己也选为工作者时，系统会形成递归拓扑，实现动态测试时扩展。该模型在GPQA-Diamond和LiveCodeBench上达到SOTA水平，在AIME25和GPQA-D上的性能比最佳单体工作者提升约3%，这相当于前沿模型一个代际的改进幅度，且增益完全来源于协同优化。

智能体 arXiv MCP/工具推理

04:42

Rohan Paul@rohanpaul_ai

48

基于脑图多智能体系统提升大语言模型推理能力

本文提出BIGMAS系统，通过构建针对特定问题的小型专家智能体图，以提升大语言模型在多步骤复杂任务中的推理可靠性。该系统让多个智能体在一个共享工作空间中进行读写协作，并由独立控制器监控全局状态、规划下一步行动。在涵盖算术表达式搜索和多步骤规划的3个谜题任务上，对6个前沿模型进行的测试表明，该方法在所有模型和任务上均显著提升了性能，例如Six Fives任务准确率从12%提升至30%，伦敦塔任务从57%提升至93%。这证明通过优化多智能体系统结构，而非仅依赖延长单一模型的思考，能有效增强大语言模型的推理能力。

智能体 arXiv 推理论文/研究

5月3日

20:12

Rohan Paul@rohanpaul_ai

56

"能否信任AI解释？思维链推理中系统性漏报的证据"

Adobe研究测试大型语言模型（LLM）思维链解释是否反映真实决策影响。通过在多选题中植入误导性提示，对11个主流模型进行9154次试验。正常使用时，模型仅在20.7%的逐步推理中提及隐藏提示，但后续追问时99.4%承认注意到。告知模型其推理受监控后结果未见改善；强制要求报告提示则导致68.2%误报（当无提示时）。问题包含用户倾向性提示时，模型45.5%遵从该倾向，但解释中常未说明。研究表明，思维链解释常与真实决策依据不匹配，仅依赖其作为安全调试工具可能不可靠。

arXiv 安全/对齐推理论文/研究

05:47

elvis@omarsar0

57

Claude Opus 4.7自主构建AlphaZero风格训练管道并在四子棋中击败专业求解器

本研究提出评估编程智能体的新范式：要求其在有限预算下，仅凭简要任务描述自主重建一项著名机器学习突破。首个测试案例为四子棋AlphaZero系统，其规模适于笔记本电脑运行，但复杂度要求完成完整的研究工程闭环。Claude Opus 4.7在三小时内从零构建了自博弈训练管道，并作为先手以7:1的成绩击败了Pascal Pons求解器，而其他前沿智能体均未通过2/8的测试。这标志着评估标准已从代码补全提升为端到端构建非平凡机器学习系统的能力。

智能体 Anthropic 编码论文/研究

01:15

Chubby♨️@kimmonismus

48

GPT-5.4 Pro不仅解决了一个数学问题，其证明方法更成功破解了长达60年的埃尔德什猜想。研究团队在此基础上改进并应用该方法，进一步证明了包括埃尔德什、Sárközy和Szemerédi提出的另一项60年猜想在内的多个附加问题。这标志着AI生成的证明首次展现出显著的"下游影响"，其核心价值不仅在于解决问题本身，更在于为数学研究开辟了新的路径。相关成果已在未来数学研讨会上公布。

Jared Duker Lichtman: Update on Erdős Problem 1196: In joint work, we refined and adapted the proof method from GPT-5.4 Pro to give proofs of ...

OpenAI 推理论文/研究

5月2日

09:48

向阳乔木@vista8

49

本周 HuggingFace 热度第一的论文：RecursiveMAS（递归多Agent系统）

RecursiveMAS提出递归多Agent系统，革新传统AI协作模式。其核心是让Agent直接传递模型内部的数值向量，而非低效的文字token，从而形成递归闭环进行迭代打磨，仅末轮输出文本。该方法连接模块轻量，底层模型参数固定，仅训练中间传递模块，极大提升了效率。在AIME数学竞赛上，性能显著超越基线13-18%，推理速度提升2.4倍，Token消耗减少75%，且训练成本低于LoRA。递归轮次增加，其效率优势更为明显。

智能体推理论文/研究

06:18

Hao AI Lab@haoailab

37

很高兴分享我们最近被ICML 2026接收的工作！这些项目涵盖高效因果并行解码器、扩散大语言模型、视频稀疏注意力、视频量化感知训练、在线推测解码以及智能文档推理。衷心感谢所有合作者和共同作者在这些工作中的付出。期待今年夏天在首尔与大家相见！🇰🇷

智能体视频论文/研究部署/工程

01:16

elvis@omarsar0

57

递归多智能体系统：潜在空间协作新范式

传统多智能体系统依赖文本消息传递，导致令牌膨胀、延迟和上下文稀释。RecursiveMAS提出新范式：将多智能体系统视为递归计算，智能体在共享潜在空间中通过递归传递潜在表征进行协作，而非传递完整文本。其核心是RecursiveLink模块，能在异构智能体间直接生成和传递潜在状态，并采用内外环学习与基于梯度的团队信用分配机制。这如同智能体用内部语言传递笔记，实现“少交谈，多思考”。在数学、科学、医学等9个基准测试中，该方法平均准确率提升8.3%，推理速度加快1.2-2.4倍，令牌使用减少34.6%-75.6%，为突破智能体间通信瓶颈提供了高效可扩展的路径。

智能体推理论文/研究

01:16

AK@_akhaliq

56

异构科学基础模型协作 paper： https://huggingface.co/papers/2604.27351

Hugging Face 多模态论文/研究

01:16

AK@_akhaliq

57

最后一篇人类撰写的论文智能体原生研究制品论文： https://huggingface.co/papers/2604.24658

智能体 arXiv 论文/研究

01:16

AK@_akhaliq

35

协同进化策略蒸馏论文： https://huggingface.co/papers/2604.27083

数据/训练论文/研究

5月1日

22:17

向阳乔木@vista8

51

AI生图技术四大洞察：数据质量、AI污染、蒸馏与架构差距

一篇关于2026年AI生图技术的综述论文揭示了几个关键洞察。核心在于数据质量，最终训练阶段少量高质量专家数据直接决定用户对模型能力的感知。训练数据中即使混入少量AI生成图像，也会严重损害生图质量和模型潜力。技术路径上，蒸馏是商业部署的必备选项，不考虑蒸馏友好性的架构设计将导致模型无法实用。此外，开源与闭源生图模型的核心差距并非渲染器本身，而在于渲染器之外的整体系统架构设计。

向阳乔木: 今天读到一篇超级棒的AI生图综述论文。读完你就能对2026年最新生图技术有全面了解,太赞了! 还能顺带了解这几年的发展脉络。 AI解读如下,原始论文见评论区。 https://blog.qiaomu.ai/ai-image-paper-2...

图像生成论文/研究

22:16

elvis@omarsar0

56

Meta FAIR研究：预训练阶段自改进LLM的新范式

Meta FAIR的研究提出一种新范式，将LLM的改进从后训练移至预训练阶段。该方法利用强大的后训练模型作为改写器和评判器，对预训练数据的后缀进行高质量、高安全性的改写，并通过强化学习直接优化预训练模型。模型从开始就学习序列生成，并获得质量、安全性和事实性的奖励。实验结果显示，相比标准预训练，该方法在事实性上取得36.2%的相对提升，安全性提升18.5%，生成质量胜率最高达86.3%。核心结论是，现有后训练模型可用于预训练出更优的下一代模型。

Meta 安全/对齐论文/研究

21:17

Ethan Mollick@emollick

62

新论文（关于旧式人工智能）在医学基准测试和真实急诊病例中将o1与医生进行对比："在各种场景和应用中，大型语言模型的表现均优于人类医生和旧版模型" 该潜力表明"迫切需要前瞻性试验"。

OpenAI 论文/研究

20:17

向阳乔木@vista8

48

UIUC受《阿凡达》启发提出Eywa框架，连接语言模型与专用模型以破解科学AI困境

针对通用语言模型懂交互却不懂数据、专用模型精通数据却缺乏交互能力的科学AI困境，UIUC团队受《阿凡达》“Tsaheylu”神经连接启发，提出了Eywa接口框架。该框架让语言模型负责理解指令与调度，调用如Chronos、TabPFN等专用模型处理数据，从而协同发挥两者优势。初步实验效果良好，长期挑战在于语言模型能否达到专用模型的领域性能。

智能体 MCP/工具论文/研究

19:40

Rohan Paul@rohanpaul_ai

46

研究揭示当前AI智能体团队难以达成一致决策

研究表明，当前由多个LLM组成的AI智能体团队在需要协调达成最终决策时存在根本性困难。开发者常误以为增加智能体数量并通过讨论就能解决问题，但论文证明这一假设目前是错误的。即使在友好协作环境中，智能体团队也常陷入僵局或完全停止响应，且团队规模越大问题越突出。这意味着现有AI智能体系统尚无法可靠处理需要达成一致正确答案的任务。

智能体论文/研究

1…9 101112 13…16