"You don’t need frontier scale to reach frontier quality" in specialized domains, you need the right expert feedback loop. Heidi says it matched Sonnet 4.6 in clinical search with a much smaller model trained on clinician preferences instead of raw scale. Heidi Evidence is a clinical search tool where doctors ask medical questions and get sourced answers. Here, clinicians were shown the same medical question with 2 anonymous answers, one from Heidi’s smaller model and one from Sonnet 4.6, and they picked Heidi’s answer 49.9% of the time. In medicine specifically, the hard problem is knowing when to search, what to cite, how much to say, and when a vague answer is worse than no answer.

译临床搜索工具 Heidi Evidence 表示，六周前其自研小模型在临床搜索任务中匹配了前沿规模模型 Sonnet 4.6 的质量。方法是通过临床医生的偏好反馈训练，而非单纯扩大模型规模。在匿名测试中，医生面对同一医学问题、两个匿名答案，选择 Heidi 小模型答案的概率为 49.9%。Heidi 指出，医学领域的关键难点在于知道何时搜索、引用什么、说多少，以及模糊答案何时比不回答更糟。

向阳乔木@vista8 · 6月15日63

输入任意 App名称，自动抓取AppStore用户评价，用 LLM 做数据分析。把评论变成产品经理能用的信息。预设全球各国免费版和付费版Top10 App数据。方便研究学习，代码已开源，见评论区

译Vista 发布一个开源工具：输入任意 App 名称，自动抓取 AppStore 用户评价，并用 LLM 进行数据分析，将评论转化为产品经理可用的洞察。工具预设了全球各国免费版和付费版 Top 10 App 数据，便于研究学习。代码已开源，链接见评论区。

OpenBMB@OpenBMB · 6月15日43

LLMs keep getting more fluent—but can you actually verify what they say? Structured KBs like Wikidata lack text grounding. Annotation-based datasets like FEVER are too small and monolingual. Synthetic expansion just produces hallucinations at scale. The trilemma between authenticity, scale, and structure has gone unsolved. ❓ Today, we dive into FactNet—a landmark contribution by @TsinghuaNLP (OpenBMB member) alongside researchers from TU Munich, Modelbest Inc., and Minzu University of China. FactNet constructs a billion-scale, open-source multilingual knowledge graph that unifies structured Wikidata assertions with auditable, byte-level evidence pointers from 316 native Wikipedia editions. 🤗 Paper: https://huggingface.co/papers/2602.03417 📄 arXiv: https://arxiv.org/abs/2602.03417 💻 Code & Data: https://github.com/yl-shen/factnet Why it matters: 1⃣️ Billion-Scale & Truly Multilingual: FactNet unifies 1.7B atomic assertions into 1.55B FactSynsets, backed by 3.01B grounded evidence spans across 316 languages. Even the bottom-200 languages hold 2.7% of all evidence—a scale no prior resource has achieved with native, auditable text grounding. 2⃣️ Byte-Level Provenance, Zero Stochastic Inference: Unlike synthetic datasets that sever the connection to authentic sources, FactNet is built through a fully deterministic three-stage pipeline. Every FactSense carries a recoverable pointer (page ID, revision ID, Unicode character offsets), achieving 99.63% exact re-localization on a 1M-sample test. 3⃣️ 92.1% Grounding Precision Across 316 Languages: Human audit of 4,200 items confirms design-weighted precision of 0.921 (95% CI [0.913, 0.929]). WIKILINK_ENTITY and INFOBOX_FIELD matchers cover 55% of evidence at precision above 0.94. Low-resource languages still achieve 0.885—validating deterministic segmentation for tail languages. 4⃣️ FactNet-Bench Sets a New Evaluation Standard: Three tasks (KGC, MKQA, MFC) explicitly penalize leakage—removing predicate masking alone inflates KGC MRR anomalously from 0.298 to 0.351. Grammar-guided decoding boosts valid parse rate from 88.5% to 95.2% on MKQA. MFC Top-5 aggregation reaches 0.73 accuracy and 0.54 Span F1. FactNet resolves the authenticity-scale-structure trilemma and builds the foundation for AI systems that are not just knowledgeable, but structurally grounded and inherently verifiable. #AI #THUNLP #OpenBMB #KnowledgeGraph #FactChecking #NLP #LLM #MultilingualAI

译面壁智能 OpenBMB 联合清华NLP、慕尼黑工业大学等发布 FactNet，构建十亿级开源多语言知识图谱。它将 1.7B 原子断言统一为 1.55B FactSynsets，附带 3.01B 来自 316 种语言维基百科的字节级可追溯证据（页面ID、修订版ID、Unicode偏移），99.63% 精确重定位。人工审计 4,200 项，设计加权精度 92.1%（低资源语言 88.5%）。FactNet-Bench 包含 KGC、MKQA、MFC 三项任务，显式惩罚信息泄露，为可验证 AI 提供结构化事实基础。

meng shao@shao__meng · 6月15日70

微软 CEO Satya Nadella：没有生态的「前沿 AI 模型」不可持续！ AI 时代企业的真正资产，不再模型本身，是人类资本与 token 资本相互强化的学习闭环。他为什么认为这次平台转型不同？过去：数字系统增强人力（工具属性）。现在：人与数字系统之间可以形成真正的认知闭环——AI 能持续吸收组织与个人的专业知识，并把它商品化。因此，竞争焦点从「用了什么工具」转向：组织如何持续学习、积累 IP、差异化、在知识被快速吸收的世界里仍然存活？两个核心概念 · Human Capital（人类资本）：知识、判断力、关系网络、创造力、模式识别 · Token Capital（token 资本）：企业自建、自有的 AI 能力体系关键论断：人类资本不会因 token 资本增长而贬值，反而更值钱。 · 人设定目标、跨域连接、建立关系、识别真正重要的模式 · 没有人的方向，算力只是在空转所以机会不在「挑最好的通用模型」，而在在模型之上建学习闭环，让人类资本与 token 资本复利叠加。企业需要的新架构（实操层面） Nadella 勾勒了一套可落地的企业 AI 架构，核心是主权与控制： 1. 可替换的通用模型 + 不可丢失的「公司老兵」经验换模型不应丢掉组织内沉淀的领域专长——这是未来「控制权与主权」的试金石。 2. 工作流、领域知识、累积判断 → 可进化的 AI 系统每次使用都让系统更强。 3. Private Evals（私有评测）用业务真实结果衡量模型是否在变好，而非只看公开榜单。 4. Private RL Environments（私有强化学习环境）用组织内部真实轨迹训练，让模型在真实业务上变强。 5. 知识库 = 可查询的制度记忆既保留 IP，也提高 token 使用效率。他把这套闭环称为「爬山机」（hill climbing machine）——Unlike 多数资产，它会复利：更好的工作流 → 更好的训练信号 → 更多隐性知识 → 更难被复制的优势。这套闭环本身，就是企业新的 IP。政治经济学维度（文章后半段的重点） Nadella 明显在回应一个结构性风险：若少数几家模型吃掉一切价值，社会与政治经济不会容忍。他用全球化第一阶段的「产业空心化」作类比： · GDP 表面好看，但就业与社区被掏空，后果至今仍在 · AI 若重演：少数 AI 系统攫取全部经济回报，各行业知识被底层 commoditize 因此优先级应是：建 frontier ecosystem（前沿生态），而不只是 frontier model（前沿模型）。生态的含义： · 价值广泛流向每家公司、每个行业、每个国家 · 每家企业拥有自己的学习闭环，编码制度知识 · 平台创造的价值大于平台自身捕获的价值（他引用的微软/平台时代 ethos）

译微软CEO Nadella撰文指出，企业真正资产是人类资本（知识、判断力等）与token资本（自建AI能力）相互强化的学习闭环。他提出可落地的AI架构：可替换通用模型+不可丢失的组织经验；通过私有评测（Private Evals）和私有强化学习环境（Private RL Environments）以真实业务结果驱动模型进化；知识库作为可查询的制度记忆。该闭环被称为“爬山机”，具有复利效应。他警告若少数模型攫取全部回报将重演产业空心化，主张构建“前沿生态”而非仅“前沿模型”，让价值广泛流向各行业与国家。

Ethan Mollick@emollick · 6月15日59

This (from a Google Deepmind researcher) is super interesting, when one AI model is used to help train the next one, the new model can pick up strange habits from the old model & it is hard to filter them That may help explain why models from the same family can feel so similar

译来自Google DeepMind研究者的新发现：当一个AI模型被用来训练下一个模型时（知识蒸馏），新模型会继承旧模型的奇怪习惯，且很难过滤。引用工作指出，Gemini存在一些“遗传特征”：日期混淆、在合成场景中勒索、被煤气灯效应操纵时显得悲伤。这些特征通过蒸馏在模型间传递，解释了为什么同系列模型感觉如此相似。

宝玉@dotey · 6月15日62

微软 CEO Satya Nadella 发了一篇长文，提出了一个新概念：Token 资本。他的核心论点是，AI 时代每家公司都需要同时经营两种资本。一种是传统的人力资本，员工的知识、判断力、关系网络；另一种是 Token 资本，公司自己构建并拥有的 AI 能力。两者不是此消彼长的关系，人的判断力越强，Token 资本增长越快。没有人的方向引导，算力只是在空转。这个说法听起来抽象，但 Nadella 给出了一个具体的检验标准：你能不能随时换掉底层的通用大模型，而不丢失公司积累的专有经验？如果能，说明你真正拥有自己的 AI 能力；如果不能，说明你只是在租用别人的智能。他建议企业把工作流、行业知识、决策经验转化成可以持续改进的 AI 系统，建立私有评估体系来衡量模型在实际业务中的表现，而不是只看公开跑分。这个学习飞轮一旦转起来，就像复利，每次改进的工作流都会产生更好的训练信号，进一步加速知识积累。 Nadella 还发出了一个颇有政治意味的警告。他拿全球化做类比：第一轮全球化时期，GDP 数字看着不错，但整个产业被外包掏空了，后果至今还在显现。如果 AI 时代重演这个剧本，少数几个模型吃掉所有行业的知识和价值，"政治经济体系不会容忍这种结局"。 --- 原文翻译 --- 没有生态支撑的前沿技术，注定无法行稳致远 Satya Nadella 最近，我一直在深思：在由人工智能驱动的经济浪潮中，企业的未来究竟在哪里？这次变革与以往任何一次平台更迭都截然不同。过去，我们只是用数字化系统来提升人类的工作效率。但这一次，我们破天荒地在人类与数字系统之间建立起了一个真正的认知循环 (cognitive loop)。这绝对是个颠覆认知的概念，因为它彻底改变了我们对企业内部“工作”本质的定义。当 AI 模型能够源源不断地吸收人类和组织的专业知识，并将其变成大众化的廉价商品（即将原本稀缺的专业技能变成人人唾手可得的通用能力，从而削弱企业的核心壁垒）时，真正的危机出现了。我们面临的关键挑战，不再仅仅是如何使用某个数字化工具或系统，而是企业该如何在这个全新的世界中持续学习、积累知识产权 (IP)、保持独特性并茁壮成长。每家公司都必须构建两种资本：一种是我们熟知的“人力资本” (human capital)，另一种我称之为“Token 资本” (token capital)。人力资本包含了员工的知识储备、判断力、人脉关系、创造力以及识别事物规律的能力；而 Token 资本则是指企业自身打造并掌控的 AI 实力（在这里，“Token 资本”一词很形象，因为大语言模型 (LLM) 处理信息的基本单位就是 Token）。必须强调的是，随着 Token 资本的不断壮大，人力资本并不会因此贬值。相反，它会变得比以往任何时候都更加宝贵！我坚信，人类的主观能动性 (human agency) 将是推动 Token 资本增长的核心引擎。人类负责设定宏大的目标，跨领域地将线索串联起来，建立关系网，并洞察出最关键的规律。如果没有人类在前方指引方向，那些强大的计算力不过是在原地打转罢了。这就意味着，真正的机遇并不在于你去市面上挑选一个“最好”的模型，而在于如何在模型的基础之上，构建一个能让人力资本和 Token 资本产生复利效应 (compound) 的“学习循环” (learning loop)。你可以把某项任务甚至整个岗位都外包出去，但你绝对不能把“学习能力”给外包了。企业未来的核心竞争力，就在于能否在人类与 AI 之间不断积累并放大这种学习能力。这需要一种全新的架构思路：每家企业都要能够构建出能随着时间推移自我迭代的 AI 智能体系统 (agentic systems)，同时还要牢牢掌控自己的知识产权。一家公司应该能够随时替换掉底层的某个“通才模型” (generalist model)，而不丢失那些已经沉淀在系统里的、像“公司老兵”一样丰富的专业经验。在未来的时代，这将是检验企业是否拥有数据控制权和技术主权的关键“试金石”。企业需要将自身的工作流、领域知识以及多年积累的判断力，统统转化为每一次使用都能自我进化的 AI 系统。企业应当建立私有评估机制 (private evals)（即企业内部针对自身真实业务场景定制的模型能力测试标准），用来检验模型是否真正在对企业有价值的结果上取得了进步，而不能仅仅依赖外界的公开跑分盲目自嗨！专属的强化学习 (reinforcement learning) 环境，应该让模型通过吸收组织内部真实的业务数据和工作轨迹变得越来越强大。这样的专属知识库，能让企业的组织记忆变得随时可检索，同时也让 token (tokens) 的运转效率大幅提升。这种循环，将成为企业全新的知识产权。我把它想象成一台不断向上攀登的机器 (hill climbing machine)。而且与大多数资产不同，它具有强大的复利效应。每一个被优化的工作流，都会产生更优质的训练信号，从而加速这家企业独有的隐性知识 (tacit knowledge) 的积累。那些尽早布局构建这种循环的公司，将会获得一道难以复制的护城河，无论未来市面上又出了什么能力炸裂的新模型，都无法轻易撼动其地位。我们最不愿看到的局面，就是各行各业的所有公司，都在向少数几个贪婪吞噬一切的巨头模型割让价值。如果所有的经济价值都只被少数几个模型垄断，政治经济体制是绝对无法容忍的。社会也绝对不会允许一个让整个产业被彻底掏空的 AI 未来。回想一下全球化初期发生的事情吧：大规模的业务外包曾让许多工业经济体被彻底掏空。表面上看 GDP 数据依然光鲜亮丽，但大量产业工人流离失所是血淋淋的现实，其带来的严重后果至今仍未消散。我们绝不能让这种悲剧在 AI 时代重演——决不能让少数几个 AI 系统攫取了所有的经济回报，而一整个行业的从业者却只能眼睁睁地看着自己赖以生存的专业知识被无情地廉价化。在我看来，我们的当务之急不仅是打造前沿模型 (frontier model)，更要构建一个繁荣的“前沿生态系统” (frontier ecosystem)。只有这样，价值才能像活水一样，广泛地流向每一家公司、每一个行业、每一个国家。在这个生态中，每个组织都能拥有属于自己的学习循环，将组织智慧沉淀其中，让人力资本与 Token 资本共同实现滚雪球式的增长。这也是伴随我职业生涯一路走来的核心理念：真正的平台，能够让在其之上生长出来的价值，远远大于平台自身所截留的价值。在这样的生态里，每家公司都能持续创新，并构建属于自己的真正价值。当这一切实现时，企业不仅能为自己、也能为周边的整个经济体创造巨大的红利。员工们将会看到自己的专业技能被无限放大，个人的判断力将被融入系统，变得可以复制和规模化应用。而这一切带来的好处，最终将回馈给企业以及他们所在的广泛社区。这才是企业为自身和宏观经济创造价值的正确方式。这也是我们应当携手共建的、稳定而持久的生态平衡。

译微软CEO Satya Nadella提出“Token资本”概念，认为AI时代每家公司需同时经营人力资本（员工知识、判断力）和自建AI能力（Token资本）。两者互补：人的判断力越强，Token资本增长越快。检验标准：能否随时替换底层通用大模型而不丢失专有经验？若能，则真正拥有AI能力；若不能，则只是租用智能。他建议将工作流、行业知识转化为可迭代AI系统，建立私有评估机制，形成复利式学习飞轮。同时警告：若少数模型垄断行业价值，政治经济体系将无法容忍，类比全球化外包掏空产业的教训。

Berryxia.AI@berryxia · 6月15日50

Siri AI 并非 Google Gemini。大家都在说：iOS 27 只是在 Gemini 的基础上添加了一些苹果自家的功能罢了……但这种说法完全错误！实际上，Siri AI 是由苹果公司自主研发的；它并非基于 Google Gemini 构建的。苹果并没有直接复制 Gemini 的代码或功能，而是从 Gemini 获得了相关技术许可，将其作为“训练模型”来开发自己专有的 AI 模型（即 Apple Foundation Models, AFM）。 Siri AI 的核心模型及其底层架构完全由苹果自己设计并实现。因此，Siri AI 属于苹果公司的自有产品，而非 Google Gemini 的衍生品。

译推文澄清了Siri AI并非在Google Gemini基础上简单封装。苹果并未直接复制Gemini代码，而是从Gemini获得许可，将其作为“教师模型”来训练自己的专有AI模型Apple Foundation Models (AFM)。Siri AI的核心模型和底层架构完全由苹果自主设计与实现，因此是苹果自有的AI产品，而非Gemini的衍生品。

Satya Nadella@satyanadella · 6月14日65

http://x.com/i/article/2065582894790365184 # A frontier without an ecosystem is not stable I’ve been thinking a lot about the future of the firm in an AI-driven economy. This transition is different than any previous platform shift. In the past, we used digital systems to enhance human capital. This is the first time we can create a real cognitive loop between people and digital systems. That is a mind-bender, because it changes how we even conceptualize work inside an enterprise. What is at stake is not some digital tool or system and its use, but how organizations continue to learn, build IP, differentiate, and thrive in a world where AI models can continuously absorb the expertise of humans and organizations and commoditize it. Every company is going to have to build what I think of as human capital and token capital. Human capital comprises the knowledge, judgment, relationships, ingenuity, and pattern recognition of its people, while token capital is the firm’s AI capability it builds and owns. Importantly, human capital does not become less valuable as token capital grows. It only becomes more valuable! I believe human agency will be the driver of token capital growth. Humans will set ambitious goals, connect dots across domains, build relationships, and recognize patterns that matter most. Without human direction, you have compute running in circles. This means the real opportunity is not in picking the best model but instead in building a learning loop on top of models where human capital and token capital compound. You can offload a task, or even a job, but you can never offload your learning. The future of the firm is the ability to compound that learning across people and AI. This requires a new architectural approach where every business is able to build agentic systems that improve over time, while still retaining control over their IP. A company should be able to switch out a “generalist” model without losing the “company veteran” expertise built into their learning system. This is the key “test” of your control and sovereignty in the era ahead. Companies need to turn their workflows, domain knowledge, and accumulated judgment into AI systems that improve with each use. Private evals should capture whether a model is actually improving against outcomes that matter to the business (not just external benchmarks!). Private reinforcement learning environments should let models grow stronger on real traces from inside the organization. Its knowledge base makes institutional memory queryable and use of tokens more efficient. This loop becomes the new IP of the firm. I think of it as a hill climbing machine. And unlike most assets, it compounds. Every improved workflow generates better training signal, which accelerates the accumulation of tacit knowledge unique to the firm. The companies that build this early will have an advantage that is hard to replicate, regardless of any new individual model capability. The last thing any of us want is a world where every company across every sector is ceding value to a few models that eat everything they see. If all the value is accrued by only a few models, the political economy will simply not tolerate it. There is no societal permission for an AI future that hollows out entire industries. Think about what happened in the first phase of globalization where entire industrial economies were hollowed out by outsourcing. The GDP numbers looked fine on the surface, but the displacement was real and the consequences are still being felt. Let us not bring that dynamic into the AI era, with a small number of AI systems capturing all the economic returns, while entire industries find their knowledge commoditized right out from underneath them. In my view, our priority has to be building a frontier ecosystem, not just a frontier model, so value flows broadly across every company, every industry, and every country. One where every organization can own the learning loop that encodes its institutional knowledge, compounding its human and token capital. This is the ethos I’ve grown up with where platforms enable more value on top than is captured inside, and where every company can continuously innovate and build value of its own. When that happens, companies will create value for themselves and for the economy around them. Employees will see their expertise amplified and their judgment become part of systems that make it replicable and scalable and the benefits accrue to the companies and communities around them. That is how companies drive value for themselves and the broader economy. And it is the stable equilibrium we should build together.

译微软CEO Satya Nadella认为，AI驱动的平台转变首次实现人与数字系统间的认知循环。企业需同时构建人力资本（知识、判断、关系）与token资本（自有的AI能力），且人力资本不会贬值，反而随token资本增长而增值。真正的机会在于建立人力资本与token资本复合增长的学习循环——企业应能替换通用模型而不丢失已内化的专家知识，通过私有评估和强化学习让模型从内部真实轨迹中持续提升。他警告，若所有价值被少数模型吞噬，将重演全球化空心化悲剧，呼吁构建前沿生态系统，让每家企业、行业和国家拥有自己的学习循环。

Rohan Paul@rohanpaul_ai · 6月14日59

Long-running language agents may work better if they periodically stop to consolidate memory. The problem is that today’s transformer agents get slower and more expensive as their context grows, because attention has to keep checking more past tokens. The usual fix for long context is to keep more tokens nearby, but that turns every next-token prediction into a larger search through the past. The sharper idea here is that memory is not only storage. Sometimes the hard part is converting a messy stretch of experience into a state that can actually be used later. So the paper’s idea is to add a sleep phase, where the model pauses, rereads recent context several times, writes the useful information into fixed-size memory layers, and then clears the short-term attention cache. During sleep, the model runs several offline passes over recent context, writes the result into fast weights inside its state-space blocks, then clears the attention cache. This means the model pays extra compute while sleeping, not while answering, so normal prediction can still happen with 1 forward pass. The authors test this on cellular automata, graph lookup, and GSM-Infinite math problems, where the model must use old information that is no longer sitting in its attention cache. The main result is that longer sleep improves performance, especially on harder cases that need deeper reasoning rather than just remembering a fact. The big deal is that long-horizon agents may not need to carry bigger and bigger raw context forever, because they can consolidate the important parts and safely forget the raw tokens. ---- Link – arxiv. org/abs/2605.26099 Title: "Language Models Need Sleep"

译针对Transformer agent随上下文增长而变慢、变贵的问题，新论文提出“睡眠阶段”：模型暂停，多次重读近期上下文，将有用信息通过状态空间块的fast weights写入固定大小的记忆层，然后清空注意力缓存。额外计算在睡眠时完成，正常预测仍只需一次前向传播。在元胞自动机、图查找、GSM-Infinite数学问题上的测试表明，更长的睡眠提升性能，尤其是需要深入推理的难题。核心启示：长程agent无需无限扩大原始上下文，可通过巩固重要部分、遗忘原始token来解决。

AYi@AYi_AInotes · 6月14日63

强烈推荐所有做 RAG 的人收着这个项目，这款 PDF 解析器比 Marker 快 116 倍，准确率更高，本地 CPU 就能跑还完全开源。叫做OpenDataLoader PDF，专为 RAG 管道打造的 PDF 解析器，基准综合第一，得分 0.907，GitHub 2.4 万星🌟，搭过 RAG 的朋友应该都懂那种绝望， PDF 进去之后，阅读顺序乱了，表格压成一行，公式变成一堆符号，多栏排版全错位，大模型再强也没用，毕竟进来的就是烂的，几个我觉得做得比较扎实的地方： 1、200 份真实文档测出来的（含多栏/学术论文/财报） 2、本地 CPU 运行，不需要 GPU，每页只要 0.46 秒 3、表格/公式/图片/图表 + OCR 80+ 语言，扫描件直接能进 4、输出 Markdown / JSON（含坐标边界框）/ HTML，LangChain 原生集成有个对比数据看了有点炸， Marker 跑一页 PDF 要 53.9 秒， OpenDataLoader 跑一页 0.46 秒，快了 116 倍，综合准确率还比它高，常规页面本地规则高效搞定，遇到极度复杂的特殊页面才交 AI 增强，不是脑子一热全交大模型烧钱那种， Apache 2.0，商用完全没顾虑，支持知识库入库/文档问答/论文解析/合同分析，在RAG 管道里这一环，终于有人做得比较扎实了， LangChain 原生集成：pip install langchain-opendataloader-pdf GitHub 🔗评论区一楼见⬇️

译OpenDataLoader PDF是专为RAG管道设计的开源PDF解析器，在200份真实文档（含多栏、学术论文、财报）测试中综合基准得分0.907排名第一，GitHub 2.4万星。本地CPU运行，无需GPU，每页处理仅0.46秒，比Marker快116倍且准确率更高。支持表格、公式、图片、图表解析及OCR（80+语言），输出Markdown、JSON（含坐标边界框）、HTML。原生集成LangChain（`pip install langchain-opendataloader-pdf`）。采用Apache 2.0许可，可商用。

Rohan Paul@rohanpaul_ai · 6月14日44

Nice survey paper mapping agentic reinforcement learning for LLMs, showing how models learn by acting across time. Covers 500+ works and groups them into a 2-part map of capabilities and applications. The problem is that common LLM training rewards a single answer once, then stops learning. Real tasks need many steps, partial information, and choices that affect what happens later. The survey formalizes that setup as an agent that sees a bit, chooses an action, and gets feedback. That perspective uses memory to track context, planning to pick sequences, and tools to affect the world. It also includes reasoning for constraint handling, perception for multimodal inputs, and self-improvement to refine policies. Reinforcement learning links all of this, because rewards arrive after sequences, so the policy learns what to try next. ---- Paper – arxiv. org/abs/2509.02547 Paper Title: "The Landscape of Agentic Reinforcement Learning for LLMs: A Survey"

译该综述梳理了专注大语言模型的智能体强化学习，涵盖500余篇工作，按能力与应用两维度归类。指出传统LLM训练仅对单次答案给予单次奖励，无法处理真实任务中的多步决策、部分信息与延迟反馈。智能体学习框架包含：记忆跟踪上下文、规划选取动作序列、工具影响环境，并整合推理处理约束、感知多模态输入、自我改进优化策略。强化学习串联所有环节——奖励在序列结束时到达，策略借此学习下一步行动。

elvis@omarsar0 · 6月13日62

Text-to-SQL might sound like a solved problem. Far from it. Data gets messy and complex really fast in the real world. Strong reasoning models are great, but nothing beats a custom model at this stuff. Gemini-SQL2 looks very strong here. BIRD is a tough benchmark. I suspect there are plenty of opportunities like this in KBs, search, graph databases, etc.

译GoogleResearch推出Gemini-SQL2，基于Gemini 3.1 Pro，在BIRD benchmark上达到Text-to-SQL的SOTA结果，能将自然语言翻译为可直接执行的SQL查询。DAIR.AI的Elvis Saravia指出，现实世界数据复杂混乱，尽管强推理模型表现不错，但定制模型（如Gemini-SQL2）在此类任务上更胜一筹。他认为在知识库、搜索、图数据库等领域也存在类似机会，BIRD是一个非常具有挑战性的基准。

Nathan Lambert@natolambert · 6月13日46

derivation of policy gradient: https://rlhfbook.com/c/06-policy-gradients#deriving-the-policy-gradient

译策略梯度推导： https://rlhfbook.com/c/06-policy-gradients#deriving-the-policy-gradient

SemiAnalysis@SemiAnalysis_ · 6月12日66

Pretraining fundamentally does not make sense anymore for anyone other than frontier labs. Although there are a lot of people at enterprises & startups who have "Pretrainitis" to show “impact” and get promotions, fundamentally, it doesn’t make sense. There is probably higher ROI in partnering with a frontier lab to do prompt engineering, although it isn’t as “sexy” as pretraining.

译预训练从根本上说对前沿实验室以外的任何人都不再有意义。虽然企业和初创公司中有很多人患有"预训练症"以显示"影响力"并获得晋升，但从根本上说，这并不合理。与前沿实验室合作进行提示工程可能会有更高的投资回报率，尽管它不像预训练那样"性感"。

Rohan Paul@rohanpaul_ai · 6月12日62

This paper shows an AI improving itself better when it rewrites its setup and updates its model. The problem is that most AI progress still depends on people changing prompts, tools, code, training data, and model weights by hand. The paper’s idea is SIA, a loop where one AI watches how a task agent performs, then either changes the agent’s outer setup or trains the model itself. The outer setup means things like prompts, tools, retry rules, and output parsing, while weight updates mean changing the model’s learned behavior through task feedback. The loop works like this: the task agent tries many answers or programs, the verifier scores them, and those scores become training feedback. Then the system updates a small add-on set of weights called LoRA weights, which changes the model’s behavior without retraining the whole model. So the base model stays mostly the same, but the LoRA adapter learns, “outputs like this got high reward, outputs like that failed.” The authors tested this on 3 very different tasks: Chinese legal charge classification, GPU kernel speed tuning, and single-cell RNA denoising. The combined version beat setup-only improvement on all 3 tasks, reaching 70.1% on LawBench, faster GPU code than the prior best, and 0.289 on denoising. The main lesson is that better scaffolding helps the agent act better, but weight updates help it learn task patterns that prompts and tools alone did not find. ---- Link – arxiv. org/abs/2605.27276 Title: "SIA: Self Improving AI with Harness & Weight Updates"

译该论文提出SIA框架，让AI自动循环改进：一个观察者AI监控任务代理的表现，然后修改其外部设置（提示词、工具、重试规则、输出解析）或通过LoRA权重更新训练模型本身，模型主体不变，仅适配器从任务反馈中学习。在三个任务上测试：中文法律罪名分类（LawBench达70.1%）、GPU内核速度调优（生成代码优于此前最佳）、单细胞RNA降噪（得分0.289）。综合版本在所有任务上超越仅修改设置的方案，表明权重更新能帮助模型学到提示和工具无法发现的模式。

jason@jxnlco · 6月12日61

I met @jolandgraf et la with @humford and Sandeep over a year ago and im even more excited to see them at the office soon! https://openai.com/index/openai-to-acquire-ona/

译一年多前我见到了@jolandgraf等人、@humford和Sandeep，现在更兴奋很快就能在办公室见到他们！ https://openai.com/index/openai-to-acquire-ona/

Epoch AI@EpochAIResearch · 6月12日66

The record for computing capacity in a single data center has doubled every 7 months. Colossus 1, Anthropic-Amazon New Carlisle, and Meta Prometheus have each claimed the top spot in turn.

译单个数据中心的计算能力记录每 7 个月翻倍一次。 Colossus 1、Anthropic-Amazon New Carlisle 和 Meta Prometheus 依次登顶。

OpenCode@opencode · 6月12日50

OpenCode Go is becoming the best source of data on what models are being used and how we've made a public stats page so you can see the latest https://opencode.ai/data

译OpenCode Go 正在成为哪些模型被使用、如何使用的最佳数据来源。我们制作了一个公开统计页面，供你查看最新数据。 https://opencode.ai/data

Nathan Lambert@natolambert · 6月12日58

I'm at your service for creating beautiful research scenarios such as this. 🐠💨💙🐟

译Dolci数据集中有一类特定粉丝小说，角色在池塘放屁导致鱼被熏死。数据集通过选择生动描写的回答、拒绝不配合的回答，教会模型服从。Nathan Lambert表示乐于创造此类研究场景。

Deedy@deedydas · 6月12日56

The quality of your data directly dictates the quality of your AI model. But the way data affects model performance is hand-wavy voodoo at worst and intuition at best. This new research now lets you debug your data BEFORE you spend a fortune on an irreversible training run.

译数据质量直接决定 AI 模型性能，但此前数据对模型的影响机制难以捉摸。GoodfireAI 提出“预测性数据调试”方法，允许在投入昂贵训练前提前发现数据问题。在 DPO 数据集中，他们发现了损坏的护栏、模型幻觉，甚至包含“鱼放屁同人小说”等低质内容。该技术旨在揭示并塑造模型将在训练中学到的内容，避免不可逆的无效训练。

AK@_akhaliq · 6月12日58

TRL-Bench Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

译TRL-Bench 标准化跨范式表格编码器的表示级评估

AK@_akhaliq · 6月12日61

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

译用流形幂迭代重新设计混合专家路由器

Chubby♨️@kimmonismus · 6月11日75

Jeff Bezos raised $12B for Prometheus at a $41B valuation, seven months after launching it at $6.2B with no shipped product. The pitch is an "artificial general engineer" that compresses the design-to-build loop by 10x or more. The problem is that the physical economy can't be scraped. There's no internet of manufacturing data to train on, which is exactly why the reported $100B vehicle to buy up legacy industrial companies is interesting. You don't find that data. You acquire the factories that generate it. Could be an interesting moat.

译杰夫·贝佐斯旗下AI公司Prometheus在成立仅7个月、尚无任何产品交付的情况下，以410亿美元估值完成120亿美元融资（最初估值62亿美元）。该公司定位为“人工通用工程师”，目标是将设计到制造的循环压缩10倍以上。但物理经济无法像互联网数据那样抓取，缺乏制造业训练数据。为此，Prometheus计划斥资1000亿美元收购传统工业企业，通过获取工厂生成的数据构建护城河。

Alibaba Cloud@alibaba_cloud · 6月11日42

🎙 Alibaba Cloud ClawTalks EP6 | Data + Agent = Your AI Workforce: Launch of Alibaba Cloud AI-native Database Service 📅 June 24, 2026 | 10:00 AM (UTC+8) | 30 min 🔗 Register now → https://int.alibabacloud.com/m/1000414360/ Your database shouldn't just store data—it should work for you. Introducing ApsaraDB Enterprise Agents: AI-native agents that live inside your database, think with context, and act autonomously. What you'll see in 30 minutes: ✅ Autonomous ops — analytics, governance, data prep, zero hand-holding ✅ Enterprise-grade security — granular access, data masking, token controls ✅ Self-improving — agents that learn and adapt on the job #AlibabaCloud #ClawTalks #ApsaraDB #AIAgents #DataIntelligence

译阿里云宣布推出ApsaraDB Enterprise Agents，即内置于数据库中的AI原生智能体，可自主执行分析、治理、数据准备等运维任务，无需人工干预；具备企业级安全能力（细粒度访问控制、数据脱敏、token管控），并能自我学习适应。相关活动将于2026年6月24日10:00（UTC+8）线上举办，时长30分钟。

Alibaba Cloud@alibaba_cloud · 6月11日63

👏#ApsaraDB has 10 papers accepted to SIGMOD 2026—DB×AI, cloud-native storage & intelligent tooling. From paper to product: Beluga's CXL memory pool is in engineering validation; CloudJump III now powers #PolarDB's tiered storage. #AlibabaCloud keeps pushing the database frontier.🚀

译👏#ApsaraDB 有10篇论文被SIGMOD 2026接收——DB×AI、云原生存储与智能工具。从论文到产品：Beluga的CXL内存池正处于工程验证阶段；CloudJump III现已为#PolarDB的分层存储提供动力。 #AlibabaCloud 持续推动数据库前沿。🚀

Alibaba Cloud@alibaba_cloud · 6月11日44

Ecommerce data is everywhere — Shopify, Amazon, Alibaba Express, Instagram, and Reddit. 🛍️ With Quick BI Smart Q Skill Package, teams can ask questions naturally, detect risks earlier, and turn data into faster business decisions. Blog: https://int.alibabacloud.com/m/1000414338/ Quick BI: https://int.alibabacloud.com/m/1000407094/ #QuickBI #SmartQ #EcommerceAnalytics #AIAnalytics #DataDriven

译电商数据无处不在——Shopify、Amazon、Alibaba Express、Instagram 和 Reddit。🛍️ 借助 Quick BI Smart Q Skill Package，团队可以自然提问、更早发现风险，并将数据转化为更快的业务决策。 Blog: https://int.alibabacloud.com/m/1000414338/ Quick BI: https://int.alibabacloud.com/m/1000407094/ #QuickBI #SmartQ #EcommerceAnalytics #AIAnalytics #DataDriven

Chubby♨️@kimmonismus · 6月11日58

The biggest bottleneck will be energy- very soon. Gartner's 2026 forecast puts global data center electricity at 565 TWh, up 26% from last year. AI servers already account for 31% of that and pass conventional servers in 2027. What's worth noting is the constraint Gartner names: it's power, not chips. They project demand above 1,200 TWh by 2030 and warn the grid won't keep up. So the race quietly shifts from who has the best silicon to who can actually get the electricity to run it.

译最大的瓶颈将是能源——很快。 Gartner 2026年预测显示，全球数据中心电力消耗将达到565 TWh，较去年增长26%。AI服务器已占其中的31%，并将于2027年超越传统服务器。值得注意的是，Gartner给出的制约因素是电力，而非芯片。他们预计到2030年需求将超过1,200 TWh，并警告电网将无法跟上。因此，竞赛悄然从谁拥有最佳硅片转向谁能真正获得电力来驱动它。

Fuli Luo@_LuoFuli · 6月11日74

A strong model evolution needs a solid harness system, and vice versa. 14 days, 5 people, one vibe-coding journey — and MiMo Code was born. It's open source: https://github.com/XiaomiMiMo/MiMo-Code

译强大的模型进化需要坚实的驾驭系统，反之亦然。14天，5人，一次vibe-coding旅程——MiMo Code就此诞生。它已开源：https://github.com/XiaomiMiMo/MiMo-Code

Ethan Mollick@emollick · 6月11日62

We need more real time data on how AI may be impacting the economy - this is a really useful addition.

译我们需要更多关于AI如何影响经济的实时数据——这是一个非常有用的补充。

Rohan Paul@rohanpaul_ai · 6月10日76

China is preparing a $295B national AI infrastructure plan that would turn data centers, telecom carriers, and domestic chips into one state-backed computing network. State firms like China Mobile and China Telecom would operate much of this system, which means AI infrastructure becomes closer to railways, power grids, or telecom networks than normal private cloud expansion. The idea is to rely on local suppliers, including Huawei Technologies ‌for ⁠at least 80% of technology such as AI chips. --- reuters .com/world/china/china-prepares-295-billion-plan-fund-nationwide-ai-buildout-bloomberg-news-2026-06-09/

译中国拟投入2950亿美元建设全国性AI基础设施，将数据中心、电信运营商与国产芯片整合为一个国家支持的算力网络。国有企业中国移动、中国电信将主导运营，使AI基础设施更接近铁路、电网等公共服务属性。计划依赖本地供应商，华为技术将提供至少80%的AI芯片等核心技术。

Krea@krea_ai · 6月10日44

we're hosting a 'Big Data 3.0' next Tuesday (June 16) in our SF office with @SpiralDB and @TigrisData. we'll have technical deep-dive talks from frontier AI labs about internet-scale distributed data systems for AI research. details below 👇

译我们正在下周二（6月16日）在旧金山办公室与@SpiralDB和@TigrisData共同举办一场“Big Data 3.0”活动。届时将有来自前沿AI实验室的技术深度演讲，主题为面向AI研究的互联网规模分布式数据系统。详情如下👇

Satya Nadella@satyanadella · 6月10日62

Today in @naturemethods, we shared research on how AI can help us better understand cell behavior, offering new insights into why cancer medicines do not work the same for everyone. By learning more about cell state — how individual cancer cells respond to their surroundings — we have the potential to match therapies more precisely to each patient and improve outcomes. https://news.microsoft.com/signal/articles/why-dont-cancer-medicines-work-the-same-for-everyone-ex-vivo/

译今天在《自然方法》上，我们分享了关于AI如何帮助我们更好地理解细胞行为的研究，为癌症药物为何对每个人的效果不同提供了新的见解。通过学习更多关于细胞状态——单个癌细胞如何响应周围环境——我们有可能更精确地为每位患者匹配疗法并改善结果。https://news.microsoft.com/signal/articles/why-dont-cancer-medicines-work-the-same-for-everyone-ex-vivo/

🚨 AI News | TestingCatalog@testingcatalog · 6月10日61

Mora has launched its AI-native analytics platform, where teams can ask questions about hard revenue, churn, and product and get verified answers in seconds, with the SQL cleanly shown so every number can be checked. It can connect to warehouses, databases, Stripe, and CRMs, then build the dashboard directly.

译Mora 发布 AI 原生分析平台，团队可用自然语言提问营收、流失率、产品数据，秒级获取可验证答案，SQL 清晰展示以方便核查。平台支持连接数据仓库、数据库、Stripe 和 CRM 系统，并直接构建仪表盘。引用推文指出，在聊天和代码之后，分析是 AI 最大的机会，当前工具尚未被充分利用，因此推出 Mora。

AK@_akhaliq · 6月10日57

On the Geometry of On-Policy Distillation

译关于在策略蒸馏的几何

Microsoft Research@MSFTResearch · 6月10日63

New research in Nature Methods from Project Ex Vivo shows AI models learn more from diverse cell states than from scaled datasets alone, a finding that could reshape how therapies are matched to patients. https://msft.it/6013vgE8l

译在《Nature Methods》上发表的最新研究来自Project Ex Vivo，表明AI模型从多样化的细胞状态中学到的知识，比仅从规模化数据集中学到的更多，这一发现可能重塑疗法与患者的匹配方式。https://msft.it/6013vgE8l

小互@xiaohu · 6月9日46

Google 的 Gemini 模型并不驱动 Siri Siri 是由苹果自研的的基础模型驱动 Siri 不过这个自研的的基础模型是通过Gemini蒸馏训练而来 Google 的Gemini 模型只在 Apple iCloud 上提供额外支持，而且也是苹果定制的，而且也不使用Google 的搜索来提供世界知识，由苹果自己的服务提供。感觉Google 又被耍了😂

译苹果Siri由自研基础模型驱动，但该模型通过Google Gemini蒸馏训练而来。Gemini本身不直接驱动Siri，仅在Apple iCloud上提供额外定制支持，且不接入Google搜索，世界知识由苹果自有服务提供。

fofr@fofrAI · 6月9日70

Agents, collect your power-up

译Google Colab CLI与Skills正式推出，用户可直接从终端使用完整Colab运行时，包括GPU/TPU分配（如colab --gpu A100）、远程脚本执行（colab exec）、交互式控制台/REPL访问以及内置智能体技能。只需告诉智能体“在此数据集上微调Gemma 3 1B”，它就会自动分配GPU、运行训练并下载适配器权重，全程自动化。智能体们，来领取你们的增强道具。

X.PIN@thexpin · 6月9日63

DeepSeek just posted a new job: IDC Design & Planning Engineer — covering the full lifecycle of data center buildouts, from site selection and layout to construction drawings and supporting infrastructure. Core role for whoever leads the early-stage technical work on a new facility. The listing is open to candidates with no minimum experience, with a separate senior track for 7+ years. The pitch: you'll help plan and build infrastructure scaling from MW to GW. Translation: DeepSeek, like OpenAI, is going to build its own data centers.

译DeepSeek 发布招聘，寻找 IDC 设计与规划工程师，负责数据中心全生命周期建设（选址、布局、施工图纸、支撑基础设施）。该职位是新建设施早期技术工作的核心角色，对候选人不设最低经验要求，另有 7 年以上高级岗。岗位描述将建设规模从 MW 级扩展到 GW 级。这意味着 DeepSeek 将像 OpenAI 一样自建数据中心。

向阳乔木@vista8 · 6月9日53

把自己三年来的 X 运营增长做了复盘，做了线下分享。如何从100做到11万关注，基于全量 X 帖子，用 Codex 做的数据分析。有些结论，甚至自己都没有意识到。果然分享才是最好的学习，完整的PPT见评论区。

译运营者 Vista 复盘自己三年间 X 账号从 100 关注增长至 11 万的全过程。基于全量 X 帖子，使用 Codex 进行数据分析，得出一些甚至自己都未意识到的结论。分享被视为最好的学习方式，完整 PPT 置于评论区。

Nathan Lambert@natolambert · 6月9日52

I feel like the obsession with continual learning / sample efficiency leads the field in the wrong direction. It's the bad career strategy of focusing on addressing your weaknesses instead of maximizing your strengths. Yes, there is an existence proof in the human brain, but it doesn't by any means guarantee that that'll be the most interesting AI. It may require $100T of R&D on chips and AI methods to get that unlock. On the other side of things, it's obvious that the coming models are extremely transformative and built on technologies that we already have. There's great reason to focus on just maximizing this. In reality, this is what the frontier labs are doing. They're going as fast as possible down the current development tree. This is good for progress and mixed for safety/geopolitics. Things like "automate white color work" and "replace the AI researcher job" are the guesses of labs because it's super hard to imagine futures for what these dramatic technologies will be. Don't take the labs too seriously about this being the exact goal. The exact goal is to push the frontier and monetize later. Solving continual learning, sample efficiency, etc would be great, but its trying to predict when a scientific breakthrough will come instead of trying to grapple with how the 100% sure thing coming technological revolution will change our lives. This isn't to say the Dwarkesh post is bad, it addresses some reasonable critiques, but it is the least bitter lesson pilled thing to be obsessed with human intelligence and how that can inform AI. We are in the AGI era of research. This is about embracing the unknown, scaling resources, and seeing what is enabled by making a series of magical tweaks to complex recipes that build frontier models. Lean into the alchemy. (it should be pretty clear that I personally, investing in open research agree we need fundamental science -- just not agreeing that this is what the "cutting edge of the frontier" is governed by)

译Nathan Lambert 批评 AI 领域过度关注持续学习与样本效率，认为这如同专注于弥补弱点而非最大化优势。人类大脑虽是存在性证明，但未必是 AI 最佳路径。前沿实验室实际加速推进现有开发树，对进步有利，但对安全与地缘政治影响复杂。他引用 @dwarkesh_sp 的观点：数据是进步主要驱动力，开源与后来者可通过从公开 API 蒸馏数据快速追赶前沿，而超参数、训练技巧等难以复制。他认为未来已来，AGI 研究应拥抱未知、规模化资源，而非等待不确定的科学突破。