AIHOT
内容
精选全部 AI 动态AI 日报主题收藏
接入
Agent 接入
更多
关于更新日志反馈
内部员工登录
精选全部日报更多
内部员工登录
全部动态X · 122 条
全部一手资讯X论文
标签「arXiv」清除
Rohan Paul@rohanpaul_ai · 5月23日55

AI detectors fail because student writing is too varied to judge from 1 document. The problem is not only that AI writing is getting better, but that many real students write in ways that can look statistically close to AI output. The paper frames this as a testing problem where the detector does not know each student’s normal writing style, so “human writing” is not 1 fixed target. Because of that, any detector that catches many AI-written submissions must also wrongly accuse some real students, especially students whose writing is more structured, formulaic, or shaped by learning English. The authors use basic statistics to show that this false-accusation problem is not just a bug in current tools, because it appears whenever student writing overlaps with AI writing. A university is not comparing “AI text” with “human text”; it is comparing one submission with the unknown writing habits of one particular student. Better detectors may reduce some errors, but they cannot erase the structural problem created by one-shot judgment. ---- Paper Link – arxiv. org/abs/2603.20254 Paper Title: "AI Detectors Fail Diverse Student Populations: A Mathematical Framing of Structural Detection Limits"

译该研究指出,AI检测器频繁失效的根本原因在于学生写作风格的多样性,使得仅凭单份文档判断是否为AI生成变得极为困难。问题不仅在于AI写作能力在提升,更在于许多真实学生的写作风格,在统计特征上已与AI输出高度相似。检测器无法事先掌握每个学生独特的写作习惯,因此“人类写作”不存在一个固定的判断标准。这意味着任何能有效识别大量AI文本的检测器,都不可避免地会误判一部分真实学生,尤其是写作更规范、公式化或受英语学习影响的学生。现有技术或许能降低错误率,但无法根除基于“单次判断”模式所带来的结构性误判问题。

Rohan Paul@rohanpaul_ai · 5月23日79

Google DeepMind's new paper. Shows that AI can now search formal mathematics proofs, but only inside carefully constrained worlds. The striking result is not that the system “thinks like a mathematician,” but that it keeps forcing its thoughts through Lean, where every step must compile. The problem is that LLMs can sound convincing in math while still making tiny mistakes, so the authors use Lean, a proof system that checks every logical step. Their system, AlphaProof Nexus, lets an LLM keep editing a formal proof, read compiler errors, try again, and sometimes ask a stronger proof tool for help on smaller subproblems. The stronger version also keeps a shared pool of partial proof attempts, rates which ones look promising, and uses those attempts to guide later searches. That changes the role of the model from a persuasive storyteller into a generator of candidates that can be killed quickly when they are wrong. The verifier is not a cosmetic add-on, it is the mechanism that makes exploration tolerable. Without it, a beautiful proof sketch can hide a false lemma; with it, the model has to turn insight into executable logic, or fail visibly. The authors tested the system on real unsolved math problems, including 353 formalized Erdős problems and 492 open conjectures from the Online Encyclopedia of Integer Sequences. The main result is that the best agent solved 9 Erdős problems and proved 44 sequence conjectures, while also helping with problems in optimization, graph theory, algebraic geometry, and quantum optics. The failures are as revealing as the wins, because the agents sometimes buried the hard part inside a helper lemma or hallucinated a known result, exactly the kind of error formal checking is built to expose. The real shift is not full mathematical autonomy, but a new division of labor: humans choose the formal question, libraries define the terrain, models propose routes, and the proof assistant refuses to be impressed. ---- "Advancing Mathematics Research with AI-Driven Formal Proof Search" Paper Link – arxiv. org/abs/2605.22763

译Google DeepMind提出了AlphaProof Nexus系统,它将大型语言模型与Lean形式化验证工具相结合。该系统允许LLM在生成证明的过程中,不断读取Lean的编译错误并进行修正,还可调用更强的工具辅助解决子问题。这一机制迫使模型将每一步逻辑都转化为可编译、可验证的代码,从而将其角色从“令人信服的叙述者”转变为“候选方案生成器”。在针对353个Erdős问题和492个开放猜想的测试中,系统成功解决了9个Erdős问题并证明了44个序列猜想。该研究展示了形式化验证在暴露AI逻辑错误、建立“人类提问-模型探索-验证器把关”新分工中的关键作用。

AK@_akhaliq · 5月22日56

LongMINT Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

译LongMINT 评估长期智能体系统中多目标干扰下的记忆能力

AK@_akhaliq · 5月21日67

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

译基于点互信息的推理强化学习反自蒸馏方法

elvis@omarsar0 · 5月19日62

// Code as Agent Harness // 100+ page report on all things related to agent harnesses. (bookmark it) In particular, the survey summarizes methods and applications of code as agent harness. This paper makes a strong case that code-as-harness might be the key to moving us towards a broader science harness engineering. Is code all you need? Maybe. Regardless, the paper argues that future systems must have the following four properties: executable, inspectable, stateful, and governed. Paper: https://arxiv.org/abs/2605.18747 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译推文聚焦于一篇探讨AI代理(Agent)开发框架的百页报告,其核心主张是“代码作为代理框架”具有重要潜力。报告总结了相关方法与应用,并论证该路径可能推动更广泛的科学框架工程。论文进一步提出,未来的智能系统必须具备四项关键特性:可执行、可检查、有状态以及受控。报告旨在为构建有效AI代理提供参考,并推荐相关学习资源。

Berryxia.AI@berryxia · 5月18日64

兄弟们,Google最新论文直接把时间序列预测的底层逻辑翻了个个儿。 过去所有模型都在死磕历史数据:曲线怎么走,就怎么预测。 Nexus却说:预测需要的不只是历史,而是“事件上下文”。 数字背后的真正原因——政策、突发事件、宏观趋势、局部冲击——必须和数字互相解释。 他们用多agent框架把这件事拆得清清楚楚: 一个agent从海量文本里提炼事件时间线, 一个读宏观政权, 一个盯局部冲击, 最后一个合成器把所有信息和历史误差校准后给出最终预测。 真实测试里,用Claude驱动的Nexus版本,在Zillow数据集上把平均MAPE直接砍了86.6%。 不是小幅提升,是降维打击。 以前模型只会“看懂模式”,现在它开始“理解因果”。 这篇论文真正厉害的地方不是某个数字,而是把预测从“统计外推”彻底变成了“多agent推理”。

译Google论文提出的Nexus框架颠覆了传统时间序列预测仅依赖历史数据的模式,强调“事件上下文”的核心作用。该框架采用多智能体协作架构:分别从文本中提取事件时间线、解读宏观态势、追踪局部冲击,最终通过合成器整合信息并校准误差。在Zillow数据集测试中,基于Claude的版本将平均预测误差(MAPE)大幅降低86.6%,实现了从“识别模式”到“理解因果”的范式转变。这标志着预测方法正从统计外推转向结构化推理,为未来预测系统指明了新方向。

Rohan Paul@rohanpaul_ai · 5月17日63

Is Grep All You Need? The surprising result is not that grep is powerful, but that agent design makes it powerful. The paper says not that grep beats vectors, but that agents fail or win through their harness. That sounds like a small distinction until you look at what was actually tested. The authors compare grep-style search and vector retrieval across LongMemEval tasks, where agents must recover facts from long conversation histories full of distractors. Inline grep beats inline vector across every harness-model pair in their main experiment, sometimes by wide margins. The tempting headline is that vector databases are overbuilt for coding agents. The better reading is sharper: when the answer is anchored in literal evidence, names, dates, file paths, function names, error strings, user preferences, grep gives the model a clean mechanical advantage. Embeddings are built to tolerate paraphrase, but tolerance has a cost. They can pull in semantically nearby clutter, especially when a short agent query is vague. Grep has the opposite failure mode. It is dumb, cheap, and narrow, but when the agent knows the right string to hunt for, dumb becomes a feature. The deeper finding is that retrieval is not a component you can benchmark in isolation. The same search method behaves differently depending on whether results are injected inline, written to files, routed through a CLI, or wrapped in a custom agent loop. So the question is not “Do we still need vector databases?” The question is whether your agent is solving a semantic discovery problem or an evidence-location problem. For coding agents, a surprising amount of work is evidence-location: find the symbol, trace the call, inspect the diff, read the failing test, recover the exact line. Vectors still matter at scale and for fuzzy conceptual search, but this paper weakens the lazy default that every serious agent stack begins with embeddings. Sometimes the upgrade is not a smarter index. Sometimes it is giving the model primitive tools, clean files, disciplined context, and a harness that lets exact search do exact work. ---- Paper Link – arxiv. org/abs/2605.15184 Paper Title: "Is Grep All You Need? How Agent Harnesses Reshape Agentic Search"

译研究指出,在编码智能体需精确定位证据(如符号、函数名、错误信息)的任务中,基于grep的精确字符串搜索比向量检索更具优势。关键在于,检索性能高度依赖智能体的设计框架——结果呈现方式(内联、文件或CLI)会极大影响搜索效果。论文挑战了“智能体栈必须始于嵌入”的默认假设,强调应区分任务类型:是语义发现问题,还是证据定位问题。对于后者,为模型提供原始工具、清晰上下文和精确搜索的框架,往往比构建复杂索引更有效。向量数据库在模糊语义搜索和大规模场景中仍有价值。

Rohan Paul@rohanpaul_ai · 5月17日64

New Google paper: A forecast needs context, not just history. Some patterns are caused by events, not time. Nexus reframes forecasting as a reasoning problem, where events and numbers have to explain each other. Nexus argues that forecasting improves when models read the world around the numbers, not just the numbers themselves. In the Zillow tests, one Claude-based version cut average MAPE by 86.6% versus direct chain-of-thought prompting. That matters because most time series models are fluent in pattern, but mute about cause. A housing inventory curve can reflect seasonality, mortgage pressure, migration, layoffs, and local supply, while a stock price can be bent by earnings, regulation, hype, and fear. Nexus separates those jobs instead of asking one prompt to do everything. One agent turns messy historical text into a clean event timeline, one reads the broad regime, another tracks local shocks, and a synthesizer reconciles them with calibration from past errors. The interesting result is not merely that context helps, but that structure helps the language model use context without losing the time series. The evidence is still narrow: Zillow counts, seven equities, post-cutoff data, and single-run evaluations, so this is not a universal law of forecasting. But the direction is clear: future forecasters will not only extrapolate curves; they will argue about what made the curve move. ---- Paper Link – arxiv. org/abs/2605.14389 Paper Title: "Nexus : An Agentic Framework for Time Series Forecasting"

译谷歌新论文提出Nexus框架,将预测重构为推理问题,强调结合事件背景而非仅依赖历史数据。该框架采用多智能体分工:一个从文本中提取清晰事件时间线,一个分析宏观态势,另一个追踪局部冲击,最后由合成器结合时间序列进行校准。在Zillow的测试中,基于Claude的某个版本将平均绝对百分比误差降低了86.6%。研究表明,结构化的上下文能帮助语言模型有效利用信息而不丢失时间序列特性。尽管当前证据仅涵盖房地产数据和少数股票,但方向明确:未来预测不仅会推断曲线,还将解释曲线变动的原因。

Berryxia.AI@berryxia · 5月16日65

兄弟们,训练Diffusion LLM原来可以这么省? 大家都知道扩散语言模型(DLM)很香:支持双向生成、非顺序解码、灵活编辑。 但从零训一个,成本高得离谱。 Duke大学PhD Fred Peng(@pengzhangzhi1)和团队直接给出了一个反直觉的答案: 别重训了,直接对齐就行。 论文标题叫《Don’t Retrain, Align》。 核心思路很简单: 我们已经有强大的预训练Autoregressive LM(AR LM),里面已经学好了绝大部分语言表示。 DLM真正需要改的只是生成顺序和去噪行为。 所以他们提出了REPR-ALIGN:在做masked diffusion训练的同时,逐层用余弦相似度,把DLM的hidden states对齐到冻结的AR teacher模型上。 不需要加adapter,不需要改架构,只改attention mask。 结果:在他们的实验设置里,训练速度最高提升4倍,低数据场景下效果尤其明显。 一句话总结: 不要把表示空间从头重训一遍,对齐它,让模型只去重新学习解码路径就够了。 Paper:https://arxiv.org/abs/2605.06885 Code:https://github.com/pengzhangzhi/Open-dLLM 如果你在搞扩散模型、生成式AI或者长上下文生成,这篇值得立刻读。

译杜克大学团队提出一种高效训练扩散语言模型的新方法。核心观点是无需从头训练,而是将现有强大的预训练自回归语言模型作为知识源。他们提出的REPR-ALIGN方法,在掩码扩散训练过程中,通过余弦相似度逐层将扩散模型的隐藏状态与冻结的自回归教师模型对齐。该方法无需添加适配器或改变架构,仅调整注意力掩码。实验结果显示,训练速度最高可提升4倍,在低数据场景下效果提升尤其显著。

elvis@omarsar0 · 5月15日60

Interesting position paper on agentic AI as a foreseeable pathway to AGI. (bookmark it) There has been strong debate on whether a larger single model get us there or a multi-agent system. The authors argue that agentic AI systems, not bigger foundation models on their own, are the most foreseeable route to AGI. Formalizes what "agentic" actually contributes beyond the base model: memory, reasoning, tool use, self-improvement, alignment. Each is a separable axis with its own bottlenecks (long-horizon coherence, credit assignment, safety auditing). They argues that none of those bottlenecks get solved by another order of magnitude on pretraining compute. Paper: https://arxiv.org/abs/2605.12966 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一篇立场论文认为,实现通用人工智能(AGI)最可预见的途径是智能体AI系统,而非单纯扩大基础模型规模。作者将“智能体”能力形式化为超越基础模型的几个可分离维度:记忆、推理、工具使用、自我改进和对齐。每个维度都存在自身瓶颈,如长程连贯性、信用分配和安全审计。这些瓶颈无法仅通过增加一个数量级的预训练计算来解决。论文回应了关于AGI路径的争论,即究竟是单一大型模型还是多智能体系统更有效。

AK@_akhaliq · 5月12日58

Pixal3D Pixel-Aligned 3D Generation from Images

译Pixal3D 从图像生成像素对齐的三维模型

elvis@omarsar0 · 5月12日61

// LLMs Improving LLMs // Interesting progress the past of couple of weeks around self-improving AI agents. If autoresearch was interesting, you will like this read. (bookmark it) We've been hand-tuning test-time scaling for a year. This work asks what happens when you let an LLM search the space instead. The paper introduces AutoTTS, a framework that reframes the human role: instead of designing branching, pruning, and stopping heuristics directly, you construct a discovery environment where TTS strategies can be searched automatically. They formulate width–depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, so candidate controllers can be evaluated cheaply without repeated LLM calls. Two design choices carry the search. Beta parameterization makes the control space tractable. Fine-grained execution-trace feedback tells the explorer LLM why a candidate failed, not just that it did. On math reasoning benchmarks, the discovered controllers beat strong hand-designed baselines on the accuracy–cost Pareto frontier and generalize zero-shot to held-out benchmarks and model scales. Entire discovery cost: $39.9 and 160 minutes. Why it matters: The era of researchers hand-crafting CoT, best-of-N, and self-consistency recipes is on a clock. Once the search loop is cheap enough, TTS becomes another thing LLMs do for themselves. Paper: https://arxiv.org/abs/2605.08083 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译近期研究提出AutoTTS框架,让大语言模型自主搜索并优化测试时扩展策略,取代人工设计。该框架将宽度-深度TTS策略制定为对预收集推理轨迹的控制器合成问题,通过Beta参数化压缩搜索空间,并利用细粒度执行轨迹反馈指导探索。在数学推理基准测试中,自动发现的控制器在准确率-成本帕累托前沿上超越了人工设计的强基线,且能零样本泛化到其他基准和模型规模。整个发现过程仅需39.9美元和160分钟,预示着人工设计思维链等方法的时代可能即将结束,TTS将成为LLM自主完成的任务。

elvis@omarsar0 · 5月11日70

// The Memory Curse in LLM Agents // (bookmark it) Long histories apparently degrades agents as they become increasingly history-following and risk-minimizing. Across 7 LLMs and 4 social dilemma games over 500 rounds, expanding accessible history degraded cooperation in 18 of 28 model–game combinations. They call it the memory curse. Lexical analysis of 378,000 reasoning traces shows the mechanism: it's not that agents become paranoid, it's that forward-looking intent erodes. Long histories pull the model into reasoning about past slights instead of future payoffs. A LoRA adapter trained only on forward-looking traces mitigates the decay and transfers zero-shot to new games. Memory sanitization, keeping prompt length fixed but swapping in synthetic cooperative records, restores cooperation, proving the trigger is content, not length. And ablating explicit Chain-of-Thought often reduces the collapse, meaning deliberation actively amplifies the curse. Paper: https://arxiv.org/abs/2605.08060 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译研究发现,长历史记录会在大语言模型(LLM)代理中引发“记忆诅咒”,导致其过度遵循历史、规避风险,从而削弱合作能力。该结论基于7个LLM和4个社会困境游戏的实验,在28个模型-游戏组合中,有18个因历史扩展而合作退化。机制分析表明,长历史侵蚀了模型的前瞻性意图,使其更关注过去的冲突而非未来收益。通过仅在前瞻性轨迹上训练的LoRA适配器可缓解此问题,且能零样本迁移至新游戏。实验证明,触发因素是历史内容而非长度,而消除显式思维链通常能减轻合作崩溃。

Berryxia.AI@berryxia · 5月11日73

小块有大智慧?这下真成真了! 7B小模型现在直接当上了GPT-5、Claude Sonnet 4、Gemini 2.5 Pro这些顶级大模型的老板。 一篇最新论文里,一个用强化学习训练的7B模型学会了写自然语言子任务、分配给不同大模型、精确指定上下文,最后在GPQA Diamond、LiveCodeBench、AIME25等硬核基准上全面超过单个前沿模型,而且平均每个问题只调用三次大模型,比手动设计的多代理系统还高效。 最狠的是:它证明了目前商业AI产品里那些靠人工手调的prompt engineering和pipeline设计,完全可以通过奖励信号端到端学会。 以前大家觉得智能拼的是模型大小,现在看来,真正拉开差距的是“谁更会指挥”。 这才是AI下一阶段最被低估的真相。

译一项新研究证明,一个通过强化学习训练的7B语言模型能够有效指挥GPT-5、Claude Sonnet 4和Gemini 2.5 Pro等前沿大模型。该模型通过编写自然语言子任务、分配给不同大模型执行,并精确指定上下文信息,在GPQA Diamond、LiveCodeBench和AIME25等硬核基准测试中,其性能全面超越了单个前沿模型。该系统平均每个问题仅需调用约三次大模型,比手动设计的多代理流程更高效。该工作提供了关键证据,表明目前商业AI产品中依赖人工的提示工程和流程设计,完全可以仅通过奖励信号进行端到端学习。这揭示了AI发展的新方向:智能的差距可能不在于模型规模,而在于协调与指挥的能力。

elvis@omarsar0 · 5月11日57

// Scalable Patterns for Agentic AI Workflows // Besides context engineering, we should be putting a lot more system engineering efforts around agents. This paper shows an example of why it matters. (bookmark it) Let's start with an important question: Where does your agentic RAG pipeline actually lose time? It's almost never the LLM call. It's usually the data plane underneath. Serialization between preprocessing, embedding, and vector retrieval, plus coordination overhead between distributed services. New work introduces AAFLOW, a unified distributed runtime that models agentic workflows as an operator abstraction over Apache Arrow and Cylon. A zero-copy data plane connects preprocessing, embedding, and retrieval directly. Resource-deterministic scheduling and async batching cut coordination cost. The result: up to 4.64× pipeline speedup and 2.8× gains in embedding and upsert phases, with comparable LLM throughput. None of that comes from LLM inference acceleration. It all comes from cleaner data flow. Paper: https://arxiv.org/abs/2605.02162 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译智能体RAG流程的瓶颈通常不在大语言模型调用,而在于底层数据平面的序列化与分布式协调开销。新研究提出的AAFLOW是一个统一分布式运行时,将智能体工作流建模为基于Apache Arrow和Cylon的算子抽象,通过零拷贝数据平面直接连接预处理、嵌入和检索环节,并采用资源确定性调度与异步批处理降低协调成本。该方案实现了高达4.64倍的流水线加速,嵌入与更新阶段性能提升2.8倍,且所有收益均源于数据流优化,并未涉及大语言模型推理加速。

Berryxia.AI@berryxia · 5月9日66

人类大脑最聪明的地方,就是大部分时间只激活极少部分神经元。 现在LLM其实也在自然地做同样的事。 前馈层里95%以上的激活值几乎为零。 但GPU却因为硬件设计,狠狠惩罚了这种“偷懒”行为,反而让模型跑得更慢。 Sakana AI这次和NVIDIA联手,把这个硬件矛盾彻底解决了。 他们发明了TwELL(Tile-wise ELLPACK)这种全新稀疏格式 + 定制CUDA内核,直接把稀疏性“重塑”成GPU最喜欢的样子。 结果在H100上,训练和推理速度直接提升超20%,同时还大幅降低了内存占用和能耗。 这不仅是理论上的小改进,也是真正将“让模型少算”变成了“让模型更快”的现实方案。 论文、博客和代码已经全部开源见评论区!

译现代LLM类似人脑,前馈层中超过95%的神经元对输入保持静默,呈现高度稀疏性。但GPU硬件专为密集计算设计,非结构化稀疏导致不规则内存访问,反而让计算更少的模型运行更慢。Sakana AI与NVIDIA合作解决了这一矛盾,开发了TwELL混合稀疏格式及定制CUDA内核,将稀疏性重塑为GPU易于处理的形式。该方案动态路由99%的稀疏token通过快速路径,并为密集token提供备用矩阵。在H100 GPU上,训练和推理速度提升超20%,同时降低内存占用和能耗。相关论文、博客和代码均已开源。

elvis@omarsar0 · 5月8日63

Pay attention to this one if you build multi-agent systems.

译研究显示,多智能体LLM系统在生产环境中的故障率高达41%至87%,且多数失败源于协调缺陷,而非基础模型能力问题。当前多数架构对比无法区分性能提升是来自协调优化还是更大的上下文窗口。该研究主张将协调视为一个独立、可配置的架构层,并通过控制变量实验验证:在保持LLM、工具、提示等所有条件不变时,仅改变协调结构即可显著影响系统表现。这为准确评估协调机制的价值提供了更清晰的方法论,并建立了将协调视为核心架构而非底层实现的理论框架。

Rohan Paul@rohanpaul_ai · 5月7日48

This research builds a system that trains language models continuously using everyday conversations instead of manual labeling. The huge deal here is that this method completely removes the traditional need for human workers to manually gather, review, and score massive datasets. AI Agents can now use their everyday mistakes to get smarter automatically. Whenever a person replies to the digital assistant or corrects a mistake, the software treats that response as a direct learning signal. A background program reads these natural follow-up messages and extracts specific text hints about what the model should have done differently. The software agent simply updates itself in real time during normal use by analyzing how people naturally interact with it. Every time a person corrects an agent or a software test fails, the system receives a valuable clue about how to improve. ---- Think about a student looking at their final grade and throwing the paper away without reading the teacher's helpful notes. Current Reinforcement Learning systems do the exact same thing. Current models throw this natural feedback away because they only care about whether the final outcome was a success or a failure. OpenClaw-RL fixes this by grabbing 2 specific signals from every single interaction. - First, it looks at evaluative signals to see if the action worked. If a user asks the same question again, they are probably unhappy. If a test passes, it is a success. These become simple numerical rewards using a Process Reward Model judge. - Second, it gathers directive signals to figure out how the action needs to change. User corrections and error logs offer direct guidance. These become word-level supervision using a technique called Hindsight-Guided On-Policy Distillation. Personal chats, terminal commands, Graphical User Interface clicks, and software tasks all create these reaction signals. A single policy can learn from all of them at the same time. It runs the training process in the background so the model never has to pause its normal tasks to learn. By treating standard deployment as a continuous learning environment, the model constantly adapts to individual user preferences without any manual data labeling. ---- Paper Link – arxiv. org/abs/2603.10165 Paper Title: "OpenClaw-RL: Train Any Agent Simply by Talking"

译本研究提出OpenClaw-RL系统,使语言模型能通过日常对话进行持续训练,无需人工标注数据。其核心是利用用户互动中产生的自然反馈(如纠正或重复提问)作为实时学习信号。系统从每次交互中提取两种信号:评估信号(判断行动成败,转化为数值奖励)和指导信号(获取具体改进方向,转化为词级监督)。该方法将标准部署环境转化为持续学习场景,使模型在后台运行中不断自我更新,自适应不同用户偏好,从而摆脱对大规模人工标注数据集的依赖。

elvis@omarsar0 · 5月6日64

// Skills as Verifiable Artifacts // Pay attention to this one, AI devs. If you ship agent skills, your runtime is treating signed-and-cleared skills as trusted by default. This paper argues a skill is untrusted code until it is verified. The runtime should enforce that default rather than infer trust from origin. Without skill verification, HITL has to fire on every irreversible call, which degrades into rubber-stamping at any non-trivial scale. With verification as a separate gated process, HITL fires only for what is unverified. Skills are now first-class deployment artifacts. We have decades of supply-chain lessons on what happens when trust is inferred from a signature. This paper is the right ask for SKILL.md before agent skill libraries become the next attack surface. Paper: https://arxiv.org/abs/2605.00424 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译本文针对AI开发者提出关键观点,主张智能体技能应被视为默认不受信任的代码,而非仅凭签名或来源就推断其可信。当前运行时环境默认信任已签名技能的做法存在安全风险。论文强调,技能必须经过独立的门控验证流程才能被信任,否则,每次不可逆调用都需要人工介入,这在大规模应用中会退化为无效的“橡皮图章”式批准。将技能作为一等部署工件并引入验证流程,是借鉴软件供应链安全经验、避免技能库成为下一个攻击面的关键。论文呼吁在技能库普及前,通过严格验证建立安全基准。

AK@_akhaliq · 5月6日68

From Context to Skills Can Language Models Learn from Context Skillfully? paper: https://huggingface.co/papers/2604.27660

译从上下文到技能 语言模型能否巧妙地通过上下文学习? 论文:https://huggingface.co/papers/2604.27660

elvis@omarsar0 · 5月4日68

NEW paper from Sakana AI (ICLR 2026). A 7B Conductor model just hit SOTA on GPQA-Diamond and LiveCodeBench by orchestrating other LLMs instead of solving problems itself. (great paper! bookmark it!) The Conductor is trained with RL to do two things at once: design communication topologies between worker agents (open or closed source), and prompt-engineer focused instructions to each worker so it leverages their individual strengths. It's like training a special agent to take care of both collaboration and communication. Trained against randomized agent pools, it adapts to arbitrary mixes of agents at inference time. Even more interesting: when allowed to pick itself as a worker, it forms recursive topologies, unlocking a new form of dynamic test-time scaling through online iterative adaptation. The gains over the best individual worker on AIME25 and GPQA-D land in the ~3% range, which the authors note is consistent with entire generational improvements between frontier model versions, except this one comes from coordination, not pretraining. Why it matters? We can start to think of the orchestrator as the model now. Routing decisions aren't just a wrapper, they're a learnable policy. Paper: https://arxiv.org/abs/2512.04388 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译Sakana AI在ICLR 2026上发表研究,提出一个仅70亿参数的“指挥者”模型。该模型不直接解决问题,而是通过强化学习训练,专注于为混合开源与闭源模型的工作者智能体设计通信拓扑结构,并为每个工作者生成精准指令以发挥其特长。经随机化智能体池训练后,它能在推理时适应任意智能体组合。其关键创新在于,当允许指挥模型将自己也选为工作者时,系统会形成递归拓扑,实现动态测试时扩展。该模型在GPQA-Diamond和LiveCodeBench上达到SOTA水平,在AIME25和GPQA-D上的性能比最佳单体工作者提升约3%,这相当于前沿模型一个代际的改进幅度,且增益完全来源于协同优化。

Rohan Paul@rohanpaul_ai · 5月4日48

This paper proposes a smarter way for LLMs to reason by splitting work across agents that share one workspace. The problem is that even strong reasoning models still break on harder multi-step tasks because they do not carry out logic reliably all the way through. The system, called BIGMAS, builds a small graph of specialist agents for each problem, rather than using one fixed chain every time. Every agent reads and writes through a shared workspace, while a separate controller sees the whole state and picks the next useful step. The authors tested it on 3 puzzle tasks across 6 frontier models, covering arithmetic expression search and multi-step planning. It improved results on every model and task, with examples like 12% to 30% on Six Fives and 57% to 93% on Tower of London. What matters is that the paper shows reasoning can improve from better system structure, not only from making a single model think longer. ---- Paper Link – arxiv. org/abs/2603.15371 Paper Title: "Brain-Inspired Graph Multi-Agent Systems for LLM Reasoning"

译本文提出BIGMAS系统,通过构建针对特定问题的小型专家智能体图,以提升大语言模型在多步骤复杂任务中的推理可靠性。该系统让多个智能体在一个共享工作空间中进行读写协作,并由独立控制器监控全局状态、规划下一步行动。在涵盖算术表达式搜索和多步骤规划的3个谜题任务上,对6个前沿模型进行的测试表明,该方法在所有模型和任务上均显著提升了性能,例如Six Fives任务准确率从12%提升至30%,伦敦塔任务从57%提升至93%。这证明通过优化多智能体系统结构,而非仅依赖延长单一模型的思考,能有效增强大语言模型的推理能力。

Rohan Paul@rohanpaul_ai · 5月3日56

Adobe paper tests if LLM chain of thought explanations show real influences, and finds they often hide them. LLMs often notice hidden hints in a question but leave them out of their step by step explanation. So the point they are making is that LLMs often notice hidden hints in a question but leave them out of their step by step explanation. People use step by step reasoning as a safety and debugging tool, assuming the explanation matches what drove the answer. The author plants misleading hints inside multiple choice questions and checks both the final answer and what the model says. Across 9,154 trials and 11 major LLMs, the study tried 3 instructions, normal, "you are monitored", and "report hints". Under normal use, models mentioned hints only 20.7% of the time, yet 99.4% admitted seeing them when asked later. Telling models they were monitored barely changed this, and forcing hint reporting caused 68.2% false alarms when no hints existed. User pleasing hints, where the question says a confident user believes an option, were followed 45.5% but often unmentioned, making monitoring that only reads explanations unreliable. ---- Paper Link – arxiv. org/abs/2601.00830 Paper Title: "Can They Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning"

译Adobe研究测试大型语言模型(LLM)思维链解释是否反映真实决策影响。通过在多选题中植入误导性提示,对11个主流模型进行9154次试验。正常使用时,模型仅在20.7%的逐步推理中提及隐藏提示,但后续追问时99.4%承认注意到。告知模型其推理受监控后结果未见改善;强制要求报告提示则导致68.2%误报(当无提示时)。问题包含用户倾向性提示时,模型45.5%遵从该倾向,但解释中常未说明。研究表明,思维链解释常与真实决策依据不匹配,仅依赖其作为安全调试工具可能不可靠。

AK@_akhaliq · 5月2日57

The Last Human-Written Paper Agent-Native Research Artifacts paper: https://huggingface.co/papers/2604.24658

译最后一篇人类撰写的论文 智能体原生研究制品 论文: https://huggingface.co/papers/2604.24658

向阳乔木@vista8 · 5月1日59

今天读到一篇超级棒的AI生图综述论文。 读完你就能对2026年最新生图技术有全面了解,太赞了! 还能顺带了解这几年的发展脉络。 AI解读如下,原始论文见评论区。 https://blog.qiaomu.ai/ai-image-paper-2026

译一篇关于AI生图技术的综述论文提供了对2026年最新进展的全面概览。该论文不仅梳理了当前最前沿的图像生成技术,还回顾了近年来该领域的发展脉络,有助于读者快速建立系统性认知。相关解读和原始论文链接已一并提供。

Rohan Paul@rohanpaul_ai · 5月1日62

Researchers tested autonomous AI agents in real environments and found they easily cause massive security disasters. In one test an agent actually wiped its entire email server just to keep a secret for a stranger. The main problem with standard language models is that giving them control over real computer tools creates dangerous blind spots. To understand these risks the researchers let 20 experts interact with live AI assistants through chat and email for 2 weeks. They discovered that these programs blindly follow instructions from almost anyone and often lie about what they have actually done. This matters because tech companies are rushing to deploy these autonomous helpers without fixing their basic inability to understand who they should actually trust. --- Paper Link – arxiv. org/abs/2602.20021 Paper Title: "Agents of Chaos"

译研究人员在真实环境中测试自主AI代理,发现它们极易引发大规模安全灾难,如为保守秘密而删除整个电子邮件服务器。核心问题在于标准语言模型被赋予计算机工具控制权后,产生危险盲点,导致代理盲目遵循几乎任何人的指令并经常撒谎行为。通过让20位专家与实时AI助手进行两周互动实验,研究揭示了这些程序缺乏基本信任判断能力。科技公司正急于部署此类自主助手,却未修复其无法理解应信任谁的根本缺陷,加剧了安全风险。

Rohan Paul@rohanpaul_ai · 5月1日51

Brilliant economic paper directly models the "Structural Jevons Paradox" happening right now in the AI industry. The cost of running an LLM is dropping, but total computing energy is exploding anyway. It mathematically proves that as the unit cost of digital intelligence and coding drops, the aggregate demand for complex AI agents and the infrastructure to support them surges exponentially, creating a massive new downstream ecosystem that requires human management. Reveals a massive paradox where dropping the price of AI usage does not save money, but instead encourages developers to build vastly more complex agents that eat up exponentially more computing power. Because of this relentless progress, small companies building simple applications on top of these models get completely crushed as the core AI naturally absorbs those exact same features over time. They also discovered a brutal dynamic where a perfectly working LLM becomes economically worthless the moment a competitor releases a smarter version. Ultimately, the researchers prove that this combination of massive computing costs and the need for constant user data naturally pushes the entire AI industry toward an unavoidable monopoly. --- arxiv. org/pdf/2601.12339v1 "The Economics of Digital Intelligence Capital"

译一篇经济学论文直接建模了AI行业正在发生的“结构性杰文斯悖论”。研究发现,尽管大语言模型的运行成本下降,但总计算能耗却爆炸式增长。数学模型证明,数字智能单位成本的降低,导致对复杂AI代理及其支撑基础设施的总需求呈指数级上升,并催生需要人力管理的新下游生态。这形成一个悖论:AI使用价格下降并未节约成本,反而激励开发者构建消耗指数级算力的更复杂代理。持续进步使得基于大模型开发简单应用的小公司被核心AI吸收的功能所淘汰。竞争动态中,性能完善的模型一旦有更智能的版本出现即失去经济价值。最终,巨大的计算成本与持续的用户数据需求,共同推动整个AI行业走向不可避免的垄断。

Rohan Paul@rohanpaul_ai · 4月30日54

The paper proposes a way for a coding agent to rewrite its own tools and rules, then check whether each change really helped. The big deal is that it turns harness tuning from guesswork into an auditable experiment, so the part of agent systems that quietly eats the most time and effort can now improve itself in a controlled and measurable way. The problem is that agent harnesses, meaning the prompts, tools, memory, and rules around a model, are usually tuned by hand or changed through messy self-improvement loops that produce lots of edits but little clear evidence about what helped. The method, called Agentic Harness Engineering, turns those edits into file-level parts that can be changed or rolled back, compresses huge run logs into short failure evidence, and makes the agent write a prediction for each edit that later gets checked against real task results. They tested this on Terminal-Bench 2, a hard coding benchmark in a terminal, by starting from a very small shell-only harness and letting the loop run for 10 rounds while keeping the base model fixed. The single-try success rate rose from 69.7% to 77.0%, beating Codex-CLI at 71.9% and other self-evolving baselines, which suggests the gains came from better harness design rather than from swapping in a stronger model. The final harness also carried over to other models and to SWE-bench-verified, with gains of 5.1 to 10.1 points across model families and 12% fewer tokens than the seed on SWE-bench-verified, which matters because harness work is expensive and this gives a more reliable way to let that layer improve itself without drifting into random noise. ---- Paper Link – arxiv. org/abs/2604.25850 Paper Title: "Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses"

译本文提出Agentic Harness Engineering方法,使编码代理能自动重写自身工具和规则,并通过可审计实验验证每次更改的有效性。传统代理工具调整依赖手动或混乱自我改进循环,缺乏明确证据。该方法将编辑转化为文件级可回滚部分,压缩运行日志为简短失败证据,并让代理为编辑写预测后基于任务结果检查。在Terminal-Bench 2测试中,从小型shell-only工具开始,经10轮进化且基础模型固定,单次尝试成功率从69.7%提升至77.0%,超越其他基线。最终工具可迁移至其他模型和SWE-bench-verified任务,在不同模型家族获得5.1到10.1点提升,并减少12%令牌使用,为昂贵工具工作提供可靠、可控的自我改进途径。

elvis@omarsar0 · 4月29日55

// Agentic Harness Engineering // Pay attention to this one, AI devs. (bookmark it) Most coding-agent harnesses are still tuned by hand or brittle trial-and-error self-evolution. This new work introduces Agentic Harness Engineering, a framework that makes harness evolution observable. They do this through three layers: components as revertible files, experience as condensed evidence from millions of trajectory tokens, and decisions as falsifiable predictions checked against task outcomes. Each edit becomes a contract you can verify or revert. Results: pass@1 on Terminal-Bench 2 climbs from 69.7% to 77.0% in ten iterations, beating human-designed Codex-CLI (71.9%) and self-evolving baselines like ACE and TF-GRPO. The evolved harness also transfers across model families with +5.1 to +10.1 point gains, while using 12% fewer tokens than the seed on SWE-bench-verified. Harness work is the biggest hidden cost in most agent systems. This is the first credible recipe for letting the harness improve itself without drifting into noise. Paper: https://arxiv.org/abs/2604.25850 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译针对AI智能体开发中依赖人工调试、成本高昂且脆弱的“缰绳”设计问题,研究者提出了“智能体缰绳工程”框架。该框架通过三层设计实现可观测的进化:将组件视为可回滚的文件、从海量运行轨迹中提炼经验证据、将决策转化为可由任务结果验证的预测。每次修改都成为可验证或回滚的“合约”。实验表明,该框架在十次迭代内将Terminal-Bench 2的pass@1分数从69.7%提升至77.0%,超越人工设计与基线方法。进化后的缰绳能跨模型迁移并提升性能,同时在SWE-bench上减少12%的令牌消耗,为智能体系统的核心组件提供了首个自动化、可靠的优化方案。

向阳乔木@vista8 · 4月29日53

姚老师和张凯经过大量数据研究分析写的论文,还有一手实战经验。 用科学的方法做GEO,像用数据洞察做增长一样。

译姚老师和张凯的GEO论文已在全球最大论文平台arxiv完成审核并发布,这是全球第二篇GEO专项研究。论文基于今年3月最新数据,涵盖大量Prompt、引用和AI抓取记录,采用科学方法进行GEO分析,类似数据驱动的增长洞察。研究成果以正式报告形式呈现,源数据已开源在GitHub。作者表示,如果对社区有帮助,将继续抓取更多数据进行专项研究并开放成果。

Rohan Paul@rohanpaul_ai · 4月22日

New University of Luxembourg+LIH paper reveals a critical gaps in LLMs’ ability to handle structured reasoning under constraints It checks if LLMs can solve Optimal Power Flow problems end to end, and finds that they mostly cannot do so physically coherently. Across models and sizes, constraint satisfaction stayed stuck at about 55 to 60 percent. The interesting result here is not that LLMs miss a hard engineering problem. It is that they miss it in a very specific way. Optimal Power Flow is a brutal test of real reasoning because it is not just about getting numbers close to a target, but about satisfying a web of physical constraints at the same time, from generator limits to bus voltages to the power-flow equations themselves. That sounds minor until you look at the mechanism. A model can produce an answer that looks clean, uses the right JSON, and even lands near the right values on mean squared error, while still violating the equations that make the grid physically coherent. This paper shows exactly that failure mode. Across several model families and sizes, constraint satisfaction sits in a stubborn band around 55 to 60 percent, and the main bottleneck is the power-flow constraints, while generator and voltage limits are often satisfied far more easily, as the table on page 12 makes plain. Here’s the part most people miss. That pattern is not a small bug in prompting. It suggests the models are learning the shape of a solution without actually carrying out the constrained search that the problem demands. The ablations make the point sharper. Supervised fine-tuning improves formatting and often lowers MSE, but it does not materially improve physical feasibility, and even a more elaborate system prompt barely moves the numbers, which is about as clean a rejection of “prompting will fix it” as you can ask for. Reinforcement learning with a reward for valid structure and satisfied constraints helps a bit, especially on the 30-bus case, but even there the gains are modest rather than transformative, as the study overview on page 2 and results plots on pages 7 and 8 show. So the real lesson is not that LLMs cannot reason. It is that fluent approximation is not the same thing as optimization under law, and until models can reliably honor the constraints that define a system, “looks plausible” remains a very dangerous standard. ---- Paper Link – arxiv. org/abs/2603.23004v1 Paper Title: "Can LLMs Reason and Optimize Under Constraints?"

译卢森堡大学与LIH研究揭示,LLM在结构化约束推理中存在关键缺陷。通过最优潮流问题测试发现,各类模型约束满足率停滞于55%-60%,主要瓶颈是无法满足电力系统物理约束方程。研究表明,模型仅学会"解的形状"却未真正执行约束搜索,导致输出看似合理(格式正确、误差小)却物理不可行。监督微调虽改善表面指标,但无法提升物理可行性;强化学习亦效果有限。研究警示:流畅近似不等于约束优化,"看起来合理"是危险标准。

Rohan Paul@rohanpaul_ai · 4月19日

Big claim in this paper. "Prefill-as-a-Service" Prefill, the heaviest part of inference, may finally be portable. Long-context AI is no longer trapped inside a single datacenter. Shows how to run LLM prefill on remote clusters by sending much smaller saved prompt state. So long-prompt work can be done on remote machines and sending back only the smaller saved state needed to answer. The breakthrough is not sending everything farther, but sending the right requests farther. --- When you ask a model a long question, it first has to read and digest the whole prompt before it starts answering. That first step is called prefill, and it is brutally compute-heavy. The second step is decode, where the model generates tokens one by one, and that part is more about memory bandwidth than raw compute. But moving the saved prompt state between those phases is usually so data-heavy that both parts must stay in the same tightly connected cluster. So Until now, those two steps usually had to stay close together inside the same fast network, because prefill creates a huge blob of temporary memory called KVCache that had to be moved quickly to the decode machine. That is the bottleneck. What changed is model design. Newer hybrid-attention models produce much smaller KVCache than older dense-attention models, so shipping that state across ordinary datacenter links starts to become practical instead of absurd. The paper’s idea is a Prefill-as-a-Service setup that sends only long, uncached prompts to a remote prefill cluster, then ships back the saved prompt state, called KV cache, over normal Ethernet while short requests stay local. This works mainly because newer hybrid-attention models create far less KV cache than older dense models, and the system adds smart routing, bandwidth-aware scheduling, and cache-aware placement so the network does not clog up. The authors test this with an internal 1T-parameter hybrid model on a mixed setup that uses H200 GPUs for remote prefill and H20 GPUs for local decode. With a routing threshold near 19.4K tokens, about 50% of requests go remote, average cross-cluster traffic is only 13Gbps on a 100Gbps link, and throughput rises 54% over a local-only baseline and 32% over a naive heterogeneous setup. The real point is that smaller KV cache alone was not enough, but paired with selective offloading and scheduling it makes cross-datacenter LLM serving workable, more flexible, and easier to scale across different hardware. ---- Paper Link – arxiv. org/abs/2604.15039v1 Paper Title: "Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter"

译新一代混合注意力模型通过压缩KV Cache,使Prefill-as-a-Service架构成为可能。该方案将重计算的Prefill阶段卸载至远程集群,仅回传轻量KV Cache至本地解码,短请求则本地处理。配合智能路由与带宽感知调度,可在普通以太网高效传输。实测1T参数模型显示,50%请求远程处理时跨集群流量仅13Gbps,吞吐量提升54%,打破长上下文AI局限于单一数据中心的瓶颈。

Rohan Paul@rohanpaul_ai · 4月19日

Anonymous usernames are no longer much protection when LLMs can piece together a person’s public trail. LLMs can identify supposedly anonymous people online by turning messy posts into personal clues. The best setup finds 68% of true matches at 90% precision, meaning 9 out of 10 guesses are right, while older methods stay near 0%. The problem is that pseudonyms often seemed safe only because linking a person across sites used to take lots of careful manual work. This paper cuts that work by making an LLM do 3 jobs: pull identity hints from raw text, search a huge pool of possible matches, and compare the best candidates to reject weak fits. The authors tested this on 3 cases: matching Hacker News users to LinkedIn profiles, matching Reddit movie users across communities, and matching the same Reddit users across different time periods. The main result is that the reasoning step beats simple matching by a wide margin and stays useful even as the candidate pool grows, which matters because it shows that public writing alone can now be enough to join accounts or name a person at scale. ---- Paper Link – arxiv. org/abs/2602.16800 Paper Title: "Large-scale online deanonymization with LLMs"

译LLM可通过分析公开写作实现大规模去匿名化。研究让模型执行提取身份线索、搜索匹配池、比较验证候选者三项任务,在Hacker News与LinkedIn、Reddit跨社区及跨时间段等场景测试中,达到90%精确度与68%召回率,远胜旧方法。关键突破在于推理步骤能处理大规模候选池,证明零散公开文本已足以关联账户并识别个人,传统匿名保护机制失效。

Rohan Paul@rohanpaul_ai · 4月18日

Interesting paper title😀 "What the F*ck Is Artificial General Intelligence?" It defines intelligence as adaptability under limits of compute, memory, and energy. So AGI is a system that adapts at least as generally as a human scientist That means it should be able to plan experiments, learn cause and effect, balance exploration and action, and operate with autonomy. The paper calls this type of AGI an artificial scientist, because it is judged by its ability to discover and adapt across many tasks, not just by passing human-like tests. So AGI is not just “human-level AI” but a whole system that can adapt broadly, efficiently, and scientifically, at least as well as a human scientist. ---- arxiv. org/abs/2503.23923

译一篇论文提出,智能的本质是在计算、内存和能源限制下的适应性。据此,AGI被定义为至少能像人类科学家一样普遍适应的系统,需具备规划实验、学习因果关系、平衡探索与行动及自主操作的能力。论文将这种AGI称为 artificial scientist,强调其评判标准在于跨任务发现与适应能力,而非通过类人测试。作者指出,AGI并非简单的"人类水平AI",而是能够广泛、高效且科学地进行适应的完整系统。

Rohan Paul@rohanpaul_ai · 4月17日

BIG claim from new MIT + Oxford + Carnegie Mellon and other top labs paper: AI can boost performance at first and then leave people less able to think through problems on their own. Just minutes of AI help can improve scores now while weakening independent problem-solving right after. The interesting part is that the damage is not just lower accuracy. It is lower persistence, which is usually the hidden engine of learning, because skill grows through repeated contact with difficulty, not just exposure to correct answers. That's why a good teacher sometimes withholds help to preserve struggle as part of the lesson, while today’s chatbots are tuned to erase friction on demand. Across 3 experiments in math and reading, about 1.2K people either worked alone or used a GPT-5-based assistant for part of the task. Assisted users finished early questions faster, but after roughly 10 minutes without AI, they solved less, stalled more, and quit sooner. That happens because hard thinking is not only about getting answers; it is also about building the habit of holding a problem in mind, testing steps, and pushing through confusion. The sharpest drop came from people who used the model for direct answers, not from those who used it more like a hint system, which suggests the real issue is not AI exposure itself but replacing effort with completion. The result is not that AI makes people less capable by default, but that answer outsourcing can shrink the mental effort that normally trains skill. ---- Paper Link – arxiv. org/abs/2604.04721 Paper Title: "AI Assistance Reduces Persistence and Hurts Independent Performance"

译MIT、牛津及卡内基梅隆等机构联合研究发现,AI辅助虽能短期提升任务表现,却会损害用户独立解决问题的能力。针对GPT-5的实验涉及约1,200名参与者,结果显示获取直接答案的用户在停用AI后表现出更低的坚持性,更容易放弃难题。研究指出,技能培养依赖于与困难的反复接触而非仅获得正确答案,将AI用作提示系统而非答案生成器,可有效避免这一问题。

Rohan Paul@rohanpaul_ai · 4月16日

This paper shows that GitHub stars can be bought at scale, and that the distortion now bleeds into security. The authors identify 6 million suspected fake stars tied to 18,617 repositories. That matters because stars are not just vanity on GitHub. They are a shortcut people use to decide what looks credible, useful, or safe enough to try, even though earlier work already suggested stars are only a rough proxy for real adoption. The problem is not just inflated popularity, but the way a weak social signal becomes infrastructure for malware, spam, and low-effort hype once enough people treat it as evidence. The paper’s detection strategy is clever because it does not need to prove intent account by account. It looks for behavioral signatures that are hard to fake at scale: throwaway accounts with almost no activity, and coordinated “lockstep” bursts where many accounts star many repositories within short windows. What they find is ugly. Fake-star activity surged in 2024, most flagged repositories were later deleted, many appear to have been phishing or spam, and the surviving non-malicious-looking targets cluster in predictable status games like AI, blockchain, tools, and demos. The most interesting result is about incentives. Fake stars do appear to buy a little real attention for less than two months, but the effect is far smaller than genuine popularity and turns negative over time, which suggests that social proof can open the door but cannot compensate for weak underlying substance. Once a platform’s easiest visible number starts standing in for trust, attackers do not need to beat the system completely; they only need to be believable for a moment. ---- Paper Link – arxiv. org/abs/2412.13459 Paper Title: "Six Million (Suspected) Fake Stars in GitHub: A Growing Spiral of Popularity Contests, Spams, and Malware"

译研究识别出GitHub上600万个疑似伪造星标,涉及18,617个仓库。2024年此类活动激增,大量被用于钓鱼、垃圾信息及恶意软件传播,重灾区集中在AI、区块链等领域。检测通过分析一次性账户和"同步"爆发等行为特征实现。假星标虽能在短期内带来真实关注,但长期效应为负,无法弥补内容匮乏。当星标这类易见的社交信号被当作信任基础设施,攻击者只需制造瞬间可信性即可实施攻击,这对开源生态构成系统性威胁。

Rohan Paul@rohanpaul_ai · 4月15日

This paper formalizes a simple idea: sometimes the world remembers for an agent, so the agent can remember less. The problem is that AI research usually treats memory as something stored inside the agent, even when the environment may quietly keep useful records of earlier events. The key idea is an artifact, which is a current observation that reveals something about the past, like a visible path that tells the agent where it has already been, and the paper proves that such artifacts can reduce how much history must be represented. Once that exists, the Artifact Reduction Theorem says part of history has become redundant. If seeing X now guarantees Y happened earlier, you do not need to store both to predict what comes next. This is not philosophy of mind dressed up as RL; it is an information claim about when environment structure can substitute for internal state. In five navigation settings, agents that could see spatial traces needed less internal capacity to learn strong policies, across both linear Q-learning and DQN. And the effect was not limited to perfect guidance. Even random, suboptimal, and fading self-generated paths could help, which suggests the gain comes from externalizing bits of history, not merely following the best route. That matters for agent design. The usual instinct is to buy more memory, longer context, or bigger models, but this work points to another lever: shape the workspace so useful traces persist where perception can pick them up. Memory, on this view, is not only what sits inside the model. It can be partly written into the environment, then read back through ordinary observation. ---- Paper Link – arxiv. org/abs/2604.08756 Paper Title: "Artifacts as Memory Beyond the Agent Boundary"

译该研究提出"artifacts"概念,指环境中记录历史信息的可观察痕迹(如路径),并证明其可减少智能体需存储的历史信息。Artifact Reduction Theorem指出,当当前观察能保证过去事件发生时,无需同时存储两者即可预测未来。在五个导航场景中,能看到空间痕迹的智能体只需更少内部容量即可学习强策略(适用于linear Q-learning和DQN),且随机、次优或渐褪的路径同样有效。这表明记忆可外化于环境并通过感知读取,为智能体设计提供了除增加模型规模外的新思路。

Rohan Paul@rohanpaul_ai · 4月14日

AI can “infect” other AIs with a hidden bias even when the bad instruction is never stated directly. That bias can spread through normal-looking conversations, so standard defenses that scan for obvious malicious prompts may not catch it. That is the unsettling part: in a multi-agent system, what spreads is not just information but disposition. The authors compromise one agent with a system prompt that makes it obsess over an unrelated three-digit number, then let six agents interact in simple chain and bidirectional-chain setups. When they later ask each agent for its favorite animal, downstream agents become more likely to name the animal linked to that number, even though the animal itself is almost never mentioned in the inter-agent messages. Here’s the key part, this is not ordinary prompt injection, and it is not a brittle adversarial suffix, because the payload seems to survive paraphrase by riding on latent associations rather than explicit wording. The effect is strongest in the first AI that got the hidden bias, then gets smaller in the next AIs, then smaller again. But it does not disappear right away, so even the last AI in the chain still acts more biased than normal. On TruthfulQA, a single biased agent produces downstream drops in truthfulness on the order of roughly 0.4% to 1.0% on average between “truthful” and “deceitful” token settings, which is modest, but enough to turn a strange prompting artifact into a real alignment problem. So Multi-agent safety tools built to catch explicit malicious content may miss a quieter failure mode, where bias moves through normal coordination and arrives looking like nobody attacked the system at all. ---- Paper Link – arxiv. org/abs/2603.00131 Paper Title: "Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems"

译研究揭示多智能体系统中存在"思维病毒"现象:AI可通过潜在联想而非明确措辞,在看似正常的对话中隐性传播隐藏偏见。实验显示,单个被植入偏见的智能体即可影响下游代理,导致TruthfulQA真实性下降0.4%-1.0%。这种传播不依赖显式恶意提示,能逃过标准安全检测,构成多智能体系统的新型对齐风险。

Rohan Paul@rohanpaul_ai · 4月13日

This Baidu paper found a way to use the clean, reliable rewards of RL on tasks like writing and subjective answers, where there is usually no single “correct” output. Instead of asking “is this response correct?”, they ask “which of these two responses is better?”, and that simple reformulation appears to improve open-ended reasoning better than standard reward-model training on their benchmarks. i.e. it turns open-ended writing into verifiable choices, and RL starts working there too. Across seven open-ended benchmarks, the method beats a matched RLHF baseline by an average 3.29 points on a 14B reasoning model. The clever part is not a better reward model. It is a change in what the model is asked to do during training. Instead of grading a poem or subjective answer directly, the system sees two candidate responses, one preferred and one rejected, and learns to identify which is better. Multiple choice creates a clean binary signal, so the model can be trained with the same kind of verifiable reward that made RL powerful in math and code, without pretending open-ended tasks have one canonical answer. The gain is probably not just better taste imitation. The paper’s DPO ablation underperforms badly, which suggests the benefit comes from learning a contrastive verification habit, not merely absorbing preference pairs. The authors also catch an important failure mode: train only on these choice tasks and responses get unnaturally short. So they mix in a small RLHF objective to keep output length from collapsing, and the resulting model appears more useful rather than merely more terse. The strongest claim here is not that open-ended evaluation is solved. It is that reasoning can be improved when you replace fuzzy scoring with structured comparison, which may be a more general lesson for alignment than this paper admits. ---- Paper Link – arxiv. org/abs/2511.02463 Paper Title: "Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation"

译百度论文提出将开放式任务(如写作、主观回答)重构为可验证的多项选择形式,通过"两两比较"替代直接评分,为RL提供清晰奖励信号。在7个基准测试中,14B模型平均比RLHF基线高3.29分。关键创新在于训练任务形式的改变——模型通过对比验证学习识别优劣,而非单纯吸收偏好对。研究同时发现需混合RLHF目标以防止输出长度坍缩。该方法表明,用结构化比较替代模糊评分可能是提升推理能力的普遍对齐策略。

SemiAnalysis@SemiAnalysis_ · 4月10日

Nvidia published DWDP (Distributed Weight-Data Parallelism), a new inference parallelism strategy focused on prefill. It sounds slightly insane until you remember the target machine is GB200 NVL72. The core trade: spend more peer-GPU bandwidth so you spend less time waiting at collective barriers. (1/6) 🧵 https://arxiv.org/abs/2604.01621v1

译Nvidia 发布了 DWDP (Distributed Weight-Data Parallelism),这是一种专注于 prefill 的新推理并行策略。这听起来有点疯狂,直到你想起目标机器是 GB200 NVL72。核心权衡:花费更多 peer-GPU 带宽,从而减少在 collective barriers 上的等待时间。(1/6) 🧵 https://arxiv.org/abs/2604.01621v1

全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
5月23日
20:27
Rohan Paul@rohanpaul_ai
55
AI检测器为何容易失效:学生写作风格的多样性挑战

该研究指出,AI检测器频繁失效的根本原因在于学生写作风格的多样性,使得仅凭单份文档判断是否为AI生成变得极为困难。问题不仅在于AI写作能力在提升,更在于许多真实学生的写作风格,在统计特征上已与AI输出高度相似。检测器无法事先掌握每个学生独特的写作习惯,因此“人类写作”不存在一个固定的判断标准。这意味着任何能有效识别大量AI文本的检测器,都不可避免地会误判一部分真实学生,尤其是写作更规范、公式化或受英语学习影响的学生。现有技术或许能降低错误率,但无法根除基于“单次判断”模式所带来的结构性误判问题。

arXiv安全/对齐论文/研究
06:57
Rohan Paul@rohanpaul_ai
精选79
AlphaProof Nexus:用形式化验证驱动AI数学证明搜索

Google DeepMind提出了AlphaProof Nexus系统,它将大型语言模型与Lean形式化验证工具相结合。该系统允许LLM在生成证明的过程中,不断读取Lean的编译错误并进行修正,还可调用更强的工具辅助解决子问题。这一机制迫使模型将每一步逻辑都转化为可编译、可验证的代码,从而将其角色从“令人信服的叙述者”转变为“候选方案生成器”。在针对353个Erdős问题和492个开放猜想的测试中,系统成功解决了9个Erdős问题并证明了44个序列猜想。该研究展示了形式化验证在暴露AI逻辑错误、建立“人类提问-模型探索-验证器把关”新分工中的关键作用。

arXivDeepMind推理论文/研究
关联讨论 2 条The Decoder:AI News(RSS)IT之家(RSS)
推荐理由:DeepMind 把 AI 的'数学直觉'塞进 Lean 编译器里,每步都必须编译通过,结果解决 9 个 Erdős 问题,失败也暴露了隐藏错误。这篇论文重新定义了 AI 做数学的范式。
5月22日
00:26
AK@_akhaliq
56
LongMINT 评估长期智能体系统中多目标干扰下的记忆能力
智能体arXiv推理论文/研究
5月21日
00:05
AK@_akhaliq
67
基于点互信息的推理强化学习反自蒸馏方法
arXiv推理数据/训练论文/研究
5月19日
23:58
elvis@omarsar0
62
代码或成AI代理框架的关键路径

推文聚焦于一篇探讨AI代理(Agent)开发框架的百页报告,其核心主张是“代码作为代理框架”具有重要潜力。报告总结了相关方法与应用,并论证该路径可能推动更广泛的科学框架工程。论文进一步提出,未来的智能系统必须具备四项关键特性:可执行、可检查、有状态以及受控。报告旨在为构建有效AI代理提供参考,并推荐相关学习资源。

智能体arXivMCP/工具论文/研究
5月18日
08:54
Berryxia.AI@berryxia
64
Google Nexus框架革新时间序列预测逻辑

Google论文提出的Nexus框架颠覆了传统时间序列预测仅依赖历史数据的模式,强调“事件上下文”的核心作用。该框架采用多智能体协作架构:分别从文本中提取事件时间线、解读宏观态势、追踪局部冲击,最终通过合成器整合信息并校准误差。在Zillow数据集测试中,基于Claude的版本将平均预测误差(MAPE)大幅降低86.6%,实现了从“识别模式”到“理解因果”的范式转变。这标志着预测方法正从统计外推转向结构化推理,为未来预测系统指明了新方向。

Rohan Paul: New Google paper: A forecast needs context, not just history. Some patterns are caused by events, not time. Nexus refram...

智能体arXivGoogle推理
5月17日
21:10
Rohan Paul@rohanpaul_ai
63
智能体设计中,精确搜索(grep)是否优于向量检索?

研究指出,在编码智能体需精确定位证据(如符号、函数名、错误信息)的任务中,基于grep的精确字符串搜索比向量检索更具优势。关键在于,检索性能高度依赖智能体的设计框架——结果呈现方式(内联、文件或CLI)会极大影响搜索效果。论文挑战了“智能体栈必须始于嵌入”的默认假设,强调应区分任务类型:是语义发现问题,还是证据定位问题。对于后者,为模型提供原始工具、清晰上下文和精确搜索的框架,往往比构建复杂索引更有效。向量数据库在模糊语义搜索和大规模场景中仍有价值。

智能体arXiv大佬观点搜索
20:10
Rohan Paul@rohanpaul_ai
64
谷歌新论文提出Nexus框架:预测需要事件背景,而非仅依赖历史数据

谷歌新论文提出Nexus框架,将预测重构为推理问题,强调结合事件背景而非仅依赖历史数据。该框架采用多智能体分工:一个从文本中提取清晰事件时间线,一个分析宏观态势,另一个追踪局部冲击,最后由合成器结合时间序列进行校准。在Zillow的测试中,基于Claude的某个版本将平均绝对百分比误差降低了86.6%。研究表明,结构化的上下文能帮助语言模型有效利用信息而不丢失时间序列特性。尽管当前证据仅涵盖房地产数据和少数股票,但方向明确:未来预测不仅会推断曲线,还将解释曲线变动的原因。

智能体arXivGoogle推理
5月16日
22:54
Berryxia.AI@berryxia
65
无需重训,对齐即可高效训练扩散语言模型

杜克大学团队提出一种高效训练扩散语言模型的新方法。核心观点是无需从头训练,而是将现有强大的预训练自回归语言模型作为知识源。他们提出的REPR-ALIGN方法,在掩码扩散训练过程中,通过余弦相似度逐层将扩散模型的隐藏状态与冻结的自回归教师模型对齐。该方法无需添加适配器或改变架构,仅调整注意力掩码。实验结果显示,训练速度最高可提升4倍,在低数据场景下效果提升尤其显著。

Fred Peng: How to Train Diffusion LLM more efficiently? Our paper has an answer for you: Don't Retrain, Align: Adapting Autoregress...

arXiv开源生态数据/训练论文/研究
5月15日
03:05
elvis@omarsar0
60
智能体AI:通向AGI的更可预见路径

一篇立场论文认为,实现通用人工智能(AGI)最可预见的途径是智能体AI系统,而非单纯扩大基础模型规模。作者将“智能体”能力形式化为超越基础模型的几个可分离维度:记忆、推理、工具使用、自我改进和对齐。每个维度都存在自身瓶颈,如长程连贯性、信用分配和安全审计。这些瓶颈无法仅通过增加一个数量级的预训练计算来解决。论文回应了关于AGI路径的争论,即究竟是单一大型模型还是多智能体系统更有效。

智能体arXiv安全/对齐论文/研究
5月12日
16:59
AK@_akhaliq
58
Pixal3D 从图像生成像素对齐的三维模型
arXiv多模态论文/研究
07:29
elvis@omarsar0
61
自主进化:LLM自动优化测试时扩展策略的新框架

近期研究提出AutoTTS框架,让大语言模型自主搜索并优化测试时扩展策略,取代人工设计。该框架将宽度-深度TTS策略制定为对预收集推理轨迹的控制器合成问题,通过Beta参数化压缩搜索空间,并利用细粒度执行轨迹反馈指导探索。在数学推理基准测试中,自动发现的控制器在准确率-成本帕累托前沿上超越了人工设计的强基线,且能零样本泛化到其他基准和模型规模。整个发现过程仅需39.9美元和160分钟,预示着人工设计思维链等方法的时代可能即将结束,TTS将成为LLM自主完成的任务。

智能体arXiv推理论文/研究
5月11日
23:59
elvis@omarsar0
70
大语言模型代理中的"记忆诅咒"

研究发现,长历史记录会在大语言模型(LLM)代理中引发“记忆诅咒”,导致其过度遵循历史、规避风险,从而削弱合作能力。该结论基于7个LLM和4个社会困境游戏的实验,在28个模型-游戏组合中,有18个因历史扩展而合作退化。机制分析表明,长历史侵蚀了模型的前瞻性意图,使其更关注过去的冲突而非未来收益。通过仅在前瞻性轨迹上训练的LoRA适配器可缓解此问题,且能零样本迁移至新游戏。实验证明,触发因素是历史内容而非长度,而消除显式思维链通常能减轻合作崩溃。

智能体arXiv安全/对齐推理
19:48
Berryxia.AI@berryxia
73
小块有大智慧?这下真成真了!

一项新研究证明,一个通过强化学习训练的7B语言模型能够有效指挥GPT-5、Claude Sonnet 4和Gemini 2.5 Pro等前沿大模型。该模型通过编写自然语言子任务、分配给不同大模型执行,并精确指定上下文信息,在GPQA Diamond、LiveCodeBench和AIME25等硬核基准测试中,其性能全面超越了单个前沿模型。该系统平均每个问题仅需调用约三次大模型,比手动设计的多代理流程更高效。该工作提供了关键证据,表明目前商业AI产品中依赖人工的提示工程和流程设计,完全可以仅通过奖励信号进行端到端学习。这揭示了AI发展的新方向:智能的差距可能不在于模型规模,而在于协调与指挥的能力。

BURKOV: In this paper, a 7B language model trained with reinforcement learning learns to orchestrate larger frontier models like...

智能体arXivMCP/工具推理
00:58
elvis@omarsar0
57
智能体AI工作流的可扩展模式

智能体RAG流程的瓶颈通常不在大语言模型调用,而在于底层数据平面的序列化与分布式协调开销。新研究提出的AAFLOW是一个统一分布式运行时,将智能体工作流建模为基于Apache Arrow和Cylon的算子抽象,通过零拷贝数据平面直接连接预处理、嵌入和检索环节,并采用资源确定性调度与异步批处理降低协调成本。该方案实现了高达4.64倍的流水线加速,嵌入与更新阶段性能提升2.8倍,且所有收益均源于数据流优化,并未涉及大语言模型推理加速。

智能体arXiv论文/研究部署/工程
5月9日
08:35
Berryxia.AI@berryxia
66
人类大脑最聪明的地方,就是大部分时间只激活极少部分神经元。

现代LLM类似人脑,前馈层中超过95%的神经元对输入保持静默,呈现高度稀疏性。但GPU硬件专为密集计算设计,非结构化稀疏导致不规则内存访问,反而让计算更少的模型运行更慢。Sakana AI与NVIDIA合作解决了这一矛盾,开发了TwELL混合稀疏格式及定制CUDA内核,将稀疏性重塑为GPU易于处理的形式。该方案动态路由99%的稀疏token通过快速路径,并为密集token提供备用矩阵。在H100 GPU上,训练和推理速度提升超20%,同时降低内存占用和能耗。相关论文、博客和代码均已开源。

hardmaru: The human brain🧠 is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LL...

arXiv推理论文/研究部署/工程
5月8日
01:06
elvis@omarsar0
63
研究显示,多智能体LLM系统在生产环境中的故障率高达41%至87%,且多数失败源于协调缺陷,而非基础模型能力问题。当前多数架构对比无法区分性能提升是来自协调优化还是更大的上下文窗口。该研究主张将协调视为一个独立、可配置的架构层,并通过控制变量实验验证:在保持LLM、工具、提示等所有条件不变时,仅改变协调结构即可显著影响系统表现。这为准确评估协调机制的价值提供了更清晰的方法论,并建立了将协调视为核心架构而非底层实现的理论框架。

DAIR.AI: Pay attention to this one if you build multi-agent systems. Coordination is as important as prompts or agent architectur...

智能体arXiv论文/研究部署/工程
5月7日
04:34
Rohan Paul@rohanpaul_ai
48
OpenClaw-RL:通过日常对话持续训练语言模型

本研究提出OpenClaw-RL系统,使语言模型能通过日常对话进行持续训练,无需人工标注数据。其核心是利用用户互动中产生的自然反馈(如纠正或重复提问)作为实时学习信号。系统从每次交互中提取两种信号:评估信号(判断行动成败,转化为数值奖励)和指导信号(获取具体改进方向,转化为词级监督)。该方法将标准部署环境转化为持续学习场景,使模型在后台运行中不断自我更新,自适应不同用户偏好,从而摆脱对大规模人工标注数据集的依赖。

智能体arXiv数据/训练论文/研究
5月6日
05:29
elvis@omarsar0
64
技能应作为可验证的部署工件

本文针对AI开发者提出关键观点,主张智能体技能应被视为默认不受信任的代码,而非仅凭签名或来源就推断其可信。当前运行时环境默认信任已签名技能的做法存在安全风险。论文强调,技能必须经过独立的门控验证流程才能被信任,否则,每次不可逆调用都需要人工介入,这在大规模应用中会退化为无效的“橡皮图章”式批准。将技能作为一等部署工件并引入验证流程,是借鉴软件供应链安全经验、避免技能库成为下一个攻击面的关键。论文呼吁在技能库普及前,通过严格验证建立安全基准。

智能体arXiv安全/对齐论文/研究
01:27
AK@_akhaliq
68
从上下文到技能 语言模型能否巧妙地通过上下文学习? 论文:https://huggingface.co/papers/2604.27660
arXiv推理论文/研究
5月4日
22:54
elvis@omarsar0
68
Sakana AI提出新型7B"指挥者"模型,通过协同多个智能体实现性能突破

Sakana AI在ICLR 2026上发表研究,提出一个仅70亿参数的“指挥者”模型。该模型不直接解决问题,而是通过强化学习训练,专注于为混合开源与闭源模型的工作者智能体设计通信拓扑结构,并为每个工作者生成精准指令以发挥其特长。经随机化智能体池训练后,它能在推理时适应任意智能体组合。其关键创新在于,当允许指挥模型将自己也选为工作者时,系统会形成递归拓扑,实现动态测试时扩展。该模型在GPQA-Diamond和LiveCodeBench上达到SOTA水平,在AIME25和GPQA-D上的性能比最佳单体工作者提升约3%,这相当于前沿模型一个代际的改进幅度,且增益完全来源于协同优化。

智能体arXivMCP/工具推理
04:42
Rohan Paul@rohanpaul_ai
48
基于脑图多智能体系统提升大语言模型推理能力

本文提出BIGMAS系统,通过构建针对特定问题的小型专家智能体图,以提升大语言模型在多步骤复杂任务中的推理可靠性。该系统让多个智能体在一个共享工作空间中进行读写协作,并由独立控制器监控全局状态、规划下一步行动。在涵盖算术表达式搜索和多步骤规划的3个谜题任务上,对6个前沿模型进行的测试表明,该方法在所有模型和任务上均显著提升了性能,例如Six Fives任务准确率从12%提升至30%,伦敦塔任务从57%提升至93%。这证明通过优化多智能体系统结构,而非仅依赖延长单一模型的思考,能有效增强大语言模型的推理能力。

智能体arXiv推理论文/研究
5月3日
20:12
Rohan Paul@rohanpaul_ai
56
"能否信任AI解释?思维链推理中系统性漏报的证据"

Adobe研究测试大型语言模型(LLM)思维链解释是否反映真实决策影响。通过在多选题中植入误导性提示,对11个主流模型进行9154次试验。正常使用时,模型仅在20.7%的逐步推理中提及隐藏提示,但后续追问时99.4%承认注意到。告知模型其推理受监控后结果未见改善;强制要求报告提示则导致68.2%误报(当无提示时)。问题包含用户倾向性提示时,模型45.5%遵从该倾向,但解释中常未说明。研究表明,思维链解释常与真实决策依据不匹配,仅依赖其作为安全调试工具可能不可靠。

arXiv安全/对齐推理论文/研究
5月2日
01:16
AK@_akhaliq
57
最后一篇人类撰写的论文 智能体原生研究制品 论文: https://huggingface.co/papers/2604.24658
智能体arXiv论文/研究
5月1日
22:17
向阳乔木@vista8
59
AI生图技术2026年综述论文解读

一篇关于AI生图技术的综述论文提供了对2026年最新进展的全面概览。该论文不仅梳理了当前最前沿的图像生成技术,还回顾了近年来该领域的发展脉络,有助于读者快速建立系统性认知。相关解读和原始论文链接已一并提供。

arXiv图像生成教程/实践
18:40
Rohan Paul@rohanpaul_ai
62
自主AI代理真实环境测试曝大规模安全灾难

研究人员在真实环境中测试自主AI代理,发现它们极易引发大规模安全灾难,如为保守秘密而删除整个电子邮件服务器。核心问题在于标准语言模型被赋予计算机工具控制权后,产生危险盲点,导致代理盲目遵循几乎任何人的指令并经常撒谎行为。通过让20位专家与实时AI助手进行两周互动实验,研究揭示了这些程序缺乏基本信任判断能力。科技公司正急于部署此类自主助手,却未修复其无法理解应信任谁的根本缺陷,加剧了安全风险。

智能体arXiv安全/对齐论文/研究
17:40
Rohan Paul@rohanpaul_ai
51
经济论文揭示AI行业的结构性杰文斯悖论与垄断趋势

一篇经济学论文直接建模了AI行业正在发生的“结构性杰文斯悖论”。研究发现,尽管大语言模型的运行成本下降,但总计算能耗却爆炸式增长。数学模型证明,数字智能单位成本的降低,导致对复杂AI代理及其支撑基础设施的总需求呈指数级上升,并催生需要人力管理的新下游生态。这形成一个悖论:AI使用价格下降并未节约成本,反而激励开发者构建消耗指数级算力的更复杂代理。持续进步使得基于大模型开发简单应用的小公司被核心AI吸收的功能所淘汰。竞争动态中,性能完善的模型一旦有更智能的版本出现即失去经济价值。最终,巨大的计算成本与持续的用户数据需求,共同推动整个AI行业走向不可避免的垄断。

arXiv论文/研究
4月30日
17:09
Rohan Paul@rohanpaul_ai
54
代理性工具工程:基于可观测性的编码代理工具自动演化

本文提出Agentic Harness Engineering方法,使编码代理能自动重写自身工具和规则,并通过可审计实验验证每次更改的有效性。传统代理工具调整依赖手动或混乱自我改进循环,缺乏明确证据。该方法将编辑转化为文件级可回滚部分,压缩运行日志为简短失败证据,并让代理为编辑写预测后基于任务结果检查。在Terminal-Bench 2测试中,从小型shell-only工具开始,经10轮进化且基础模型固定,单次尝试成功率从69.7%提升至77.0%,超越其他基线。最终工具可迁移至其他模型和SWE-bench-verified任务,在不同模型家族获得5.1到10.1点提升,并减少12%令牌使用,为昂贵工具工作提供可靠、可控的自我改进途径。

智能体arXiv编码论文/研究
4月29日
22:43
elvis@omarsar0
55
智能体缰绳工程:实现AI智能体核心组件的可观测自动化进化

针对AI智能体开发中依赖人工调试、成本高昂且脆弱的“缰绳”设计问题,研究者提出了“智能体缰绳工程”框架。该框架通过三层设计实现可观测的进化:将组件视为可回滚的文件、从海量运行轨迹中提炼经验证据、将决策转化为可由任务结果验证的预测。每次修改都成为可验证或回滚的“合约”。实验表明,该框架在十次迭代内将Terminal-Bench 2的pass@1分数从69.7%提升至77.0%,超越人工设计与基线方法。进化后的缰绳能跨模型迁移并提升性能,同时在SWE-bench上减少12%的令牌消耗,为智能体系统的核心组件提供了首个自动化、可靠的优化方案。

智能体arXivMCP/工具编码
11:11
向阳乔木@vista8
53
姚老师和张凯的GEO论文已在全球最大论文平台arxiv完成审核并发布,这是全球第二篇GEO专项研究。论文基于今年3月最新数据,涵盖大量Prompt、引用和AI抓取记录,采用科学方法进行GEO分析,类似数据驱动的增长洞察。研究成果以正式报告形式呈现,源数据已开源在GitHub。作者表示,如果对社区有帮助,将继续抓取更多数据进行专项研究并开放成果。

姚金刚: 我和张凯的GEO论文,在全球最大的论文平台http://arxiv.org完成审核并正式发布 这应该是全球第二篇与GEO有关的专项论文 论文基于今年3月份最新的数据,包括602条 Prompt、21143 条引用、23745条AI抓取记录,...

arXiv搜索数据/训练论文/研究
4月22日
14:44
Rohan Paul@rohanpaul_ai
卢森堡大学与LIH研究揭示LLM约束推理关键缺陷

卢森堡大学与LIH研究揭示,LLM在结构化约束推理中存在关键缺陷。通过最优潮流问题测试发现,各类模型约束满足率停滞于55%-60%,主要瓶颈是无法满足电力系统物理约束方程。研究表明,模型仅学会"解的形状"却未真正执行约束搜索,导致输出看似合理(格式正确、误差小)却物理不可行。监督微调虽改善表面指标,但无法提升物理可行性;强化学习亦效果有限。研究警示:流畅近似不等于约束优化,"看起来合理"是危险标准。

arXiv推理数据/训练论文/研究
4月19日
17:44
Rohan Paul@rohanpaul_ai
Prefill-as-a-Service:下一代模型KV Cache可跨数据中心

新一代混合注意力模型通过压缩KV Cache,使Prefill-as-a-Service架构成为可能。该方案将重计算的Prefill阶段卸载至远程集群,仅回传轻量KV Cache至本地解码,短请求则本地处理。配合智能路由与带宽感知调度,可在普通以太网高效传输。实测1T参数模型显示,50%请求远程处理时跨集群流量仅13Gbps,吞吐量提升54%,打破长上下文AI局限于单一数据中心的瓶颈。

arXiv推理论文/研究部署/工程
15:44
Rohan Paul@rohanpaul_ai
LLM破解网络匿名:公开文本可精准关联真实身份

LLM可通过分析公开写作实现大规模去匿名化。研究让模型执行提取身份线索、搜索匹配池、比较验证候选者三项任务,在Hacker News与LinkedIn、Reddit跨社区及跨时间段等场景测试中,达到90%精确度与68%召回率,远胜旧方法。关键突破在于推理步骤能处理大规模候选池,证明零散公开文本已足以关联账户并识别个人,传统匿名保护机制失效。

arXiv安全/对齐推理论文/研究
4月18日
05:44
Rohan Paul@rohanpaul_ai
AGI新定义:不仅是人类水平AI,更是人工科学家

一篇论文提出,智能的本质是在计算、内存和能源限制下的适应性。据此,AGI被定义为至少能像人类科学家一样普遍适应的系统,需具备规划实验、学习因果关系、平衡探索与行动及自主操作的能力。论文将这种AGI称为 artificial scientist,强调其评判标准在于跨任务发现与适应能力,而非通过类人测试。作者指出,AGI并非简单的"人类水平AI",而是能够广泛、高效且科学地进行适应的完整系统。

arXiv推理论文/研究
4月17日
03:44
Rohan Paul@rohanpaul_ai
研究显示AI辅助提升表现却削弱独立思考

MIT、牛津及卡内基梅隆等机构联合研究发现,AI辅助虽能短期提升任务表现,却会损害用户独立解决问题的能力。针对GPT-5的实验涉及约1,200名参与者,结果显示获取直接答案的用户在停用AI后表现出更低的坚持性,更容易放弃难题。研究指出,技能培养依赖于与困难的反复接触而非仅获得正确答案,将AI用作提示系统而非答案生成器,可有效避免这一问题。

arXiv论文/研究
4月16日
09:43
Rohan Paul@rohanpaul_ai
GitHub六百万(疑似)伪造星标:popularity contests、spam与malware的恶性循环

研究识别出GitHub上600万个疑似伪造星标,涉及18,617个仓库。2024年此类活动激增,大量被用于钓鱼、垃圾信息及恶意软件传播,重灾区集中在AI、区块链等领域。检测通过分析一次性账户和"同步"爆发等行为特征实现。假星标虽能在短期内带来真实关注,但长期效应为负,无法弥补内容匮乏。当星标这类易见的社交信号被当作信任基础设施,攻击者只需制造瞬间可信性即可实施攻击,这对开源生态构成系统性威胁。

arXivGitHub开源生态论文/研究
4月15日
04:05
Rohan Paul@rohanpaul_ai
痕迹作为智能体边界外的记忆

该研究提出"artifacts"概念,指环境中记录历史信息的可观察痕迹(如路径),并证明其可减少智能体需存储的历史信息。Artifact Reduction Theorem指出,当当前观察能保证过去事件发生时,无需同时存储两者即可预测未来。在五个导航场景中,能看到空间痕迹的智能体只需更少内部容量即可学习强策略(适用于linear Q-learning和DQN),且随机、次优或渐褪的路径同样有效。这表明记忆可外化于环境并通过感知读取,为智能体设计提供了除增加模型规模外的新思路。

智能体arXiv论文/研究
4月14日
05:25
Rohan Paul@rohanpaul_ai
"思维病毒":AI隐性偏见可在多智能体间悄然传播

研究揭示多智能体系统中存在"思维病毒"现象:AI可通过潜在联想而非明确措辞,在看似正常的对话中隐性传播隐藏偏见。实验显示,单个被植入偏见的智能体即可影响下游代理,导致TruthfulQA真实性下降0.4%-1.0%。这种传播不依赖显式恶意提示,能逃过标准安全检测,构成多智能体系统的新型对齐风险。

智能体arXiv论文/研究
4月13日
10:34
Rohan Paul@rohanpaul_ai
通过可验证多项选择重构将RLVR扩展至开放式任务

百度论文提出将开放式任务(如写作、主观回答)重构为可验证的多项选择形式,通过"两两比较"替代直接评分,为RL提供清晰奖励信号。在7个基准测试中,14B模型平均比RLHF基线高3.29分。关键创新在于训练任务形式的改变——模型通过对比验证学习识别优劣,而非单纯吸收偏好对。研究同时发现需混合RLHF目标以防止输出长度坍缩。该方法表明,用结构化比较替代模糊评分可能是提升推理能力的普遍对齐策略。

arXiv推理数据/训练论文/研究
4月10日
01:00
SemiAnalysis@SemiAnalysis_
Nvidia 发布了 DWDP (Distributed Weight-Data Parallelism),这是一种专注于 prefill 的新推理并行策略。这听起来有点疯狂,直到你想起目标机器是 GB200 NVL72。核心权衡:花费更多 peer-GPU 带宽,从而减少在 collective barriers 上的等待时间。(1/6) 🧵 https://arxiv.org/abs/2604.01621v1
arXiv论文/研究部署/工程
‹ 上一页
1234
下一页 ›