全部 AI 动态 · AI HOT

内容

精选全部 AI 动态 AI 日报主题收藏

接入

更多

关于更新日志反馈

内部员工登录

精选全部日报更多

内部员工登录

全部动态X · 505 条

全部一手资讯 X 论文

AK@_akhaliq · 6月1日58

GrepSeek Training Search Agents for Direct Corpus Interaction

译GrepSeek 训练搜索智能体以直接交互语料库

OpenClaw🦞@openclaw · 6月1日72

In collaboration with @nvidia, we’re open-sourcing a dataset of security scans for 67,453 ClawHub skills on @huggingface: - NVIDIA SkillSpector flagged 1/2 for agentic risk - Only 0.31% were malicious - No two scanners agreed on more than 8.5% of risks https://openclaw.ai/blog/openclaw-nvidia-skill-security

译与 @nvidia 合作，我们开源了一个包含 67,453 个 @huggingface 上 ClawHub 技能安全扫描的数据集： - NVIDIA SkillSpector 标记出 1/2 的智能体风险 - 仅 0.31% 为恶意 - 没有两个扫描器在超过 8.5% 的风险上达成一致 https://openclaw.ai/blog/openclaw-nvidia-skill-security

elvis@omarsar0 · 6月1日60

// The Efficiency Frontier // Cool paper on context management. As agents reuse the same documents and histories across many turns, the cheapest context strategy is not fixed. This work describes a principled rule for picking one per deployment instead of defaulting to whatever topped a benchmark in isolation. Retrieval and compression methods are almost always benchmarked on accuracy and cost separately, so you never learn when one actually beats another under real load. The Efficiency Frontier models context strategy selection as a single cost-performance problem, with a log-utility term for diminishing returns from extra context and a reuse parameter N that amortizes preprocessing across repeated queries. Sweep N and the optimal strategy changes, exposing crossover regions where retrieval, compression, or full context each wins. On 5,000 HotpotQA instances, deployment-aware selection cuts effective token usage about 25 percent at the same performance, and amortized memory compression runs over 50 percent cheaper than full-context prompting in higher-performance settings. Paper: https://arxiv.org/abs/2605.23071 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译该论文指出，当AI智能体在多轮对话中重复使用相同文档和历史记录时，固定的上下文策略并非最优。研究提出了“效率前沿”框架，将上下文策略选择建模为一个成本与性能的平衡问题。通过引入重用参数N进行扫描，可以识别出检索、压缩或全上下文各自占据优势的交叉区域。在5000个HotpotQA实例上的测试表明，部署感知的选择能在保持相同性能下减少约25%的有效token使用量，而摊销内存压缩在高性能设置下比全上下文提示的运行成本便宜超过50%。

Rohan Paul@rohanpaul_ai · 5月30日69

Amazon unveiled “Resilient Network Graphs,” (RNG) a data center network that reduces hardware needs by 69% and raises throughput by 33%. It is now default for most AWS workloads. They revealed that it has been quietly deploying the design across its data centers since last year, and it is now the default data center network for most AWS workloads. It replaced tree-shaped datacenter networks with flatter random ones that waste less capacity. For decades, fat-tree networks worked because they were predictable, but their layered shape can concentrate traffic at choke points while other links sit underused. So the problem is that fat-tree networks are easy to run, but their hierarchy can trap traffic on a few links while other links sit unused. “Resilient Network Graphs,” (RNG) fixes this by connecting routers in a flat quasi-random graph, so many different paths exist between servers instead of a few fixed routes through upper layers. RNG attacks the problem by flattening the fabric into a quasi-random graph, where many small independent paths replace a few privileged routes. Its routing system, Spraypoint, spreads traffic across many separate paths, while its ShuffleBox cabling device makes the random-looking wiring practical to build and expand. Instead of asking every packet to chase the shortest path, Spraypoint fans traffic outward and then guides it back through distributed waypoints, creating many edge-disjoint paths without requiring exotic switch memory. The authors tested RNG in 2 real Amazon production fabrics and compared it with fat-tree networks using transport and storage workloads. The main result is that RNG matched fat-tree application performance, found far more separate paths than common routing methods, and was estimated to cost 9% to 45% less. The hard part is not the idea, but the engineering, because routing in a random mesh needs smarter path selection and the physical system must manage millions of fiber connections without becoming impossible to operate. This is important for AI clusters because training traffic is huge, synchronized, and sensitive to congestion, so a network that spreads load better can make expensive GPUs spend less time waiting. ---- Link – arxiv. org/abs/2604.15261 Title: "RNG: Flat Datacenter Networks at Scale"

译亚马逊推出了名为“Resilient Network Graphs”(RNG) 的新数据中心网络架构。该设计以扁平的准随机图替代了传统的树形网络，并通过Spraypoint路由系统和ShuffleBox布线设备在多个独立路径上分散流量。测试显示，RNG在性能上与传统fat-tree网络持平，但硬件需求减少69%，吞吐量提升33%，并估算成本可降低9%至45%。该架构现已成为大多数AWS工作负载的默认网络，其分散负载的能力有助于提升AI集群训练效率。

Fei-Fei Li@drfeifei · 5月30日83

I’m very excited by this new benchmark dataset for visual generation that is suitable for the modern era of large scale generative models!🤩

译我对这个适用于大规模生成模型新时代的视觉生成基准数据集感到非常兴奋！🤩

AK@_akhaliq · 5月30日55

DynaFLIP Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

译DynaFLIP 通过三模态动态引导的表征重新思考机器人感知

AK@_akhaliq · 5月30日62

Qwen-VLA Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

译Qwen-VLA 跨任务、环境与机器人具身的统一视觉语言动作建模

AK@_akhaliq · 5月30日54

OmniRetrieval Unified Retrieval across Heterogeneous Knowledge Sources

译OmniRetrieval 跨异构知识源的统一检索

AK@_akhaliq · 5月29日61

AgentDoG 1.5 A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

译AgentDoG 1.5 一个用于AI智能体安全与保障的轻量且可扩展的对齐框架

Rohan Paul@rohanpaul_ai · 5月29日60

The problem is that agent skills are usually hand-written, made once by an LLM, or revised in loose ways that can easily make them worse. SkillOpt from Microsoft, argues that agent skills should be trained like small external programs, it teaches AI agents better task habits by editing a reusable skill document, not the model itself. The paper’s core idea is to treat the skill document like the thing being trained, while the main AI model stays frozen and unchanged. SkillOpt watches the agent try tasks, studies what worked and failed, then asks a stronger optimizer model to suggest small edits to the skill. It only accepts an edit when the new skill improves on a held-out check set, so the skill does not drift just because an edit sounds good. The authors tested this across 6 benchmarks, 7 target models, and 3 agent settings, including direct chat, Codex, and Claude Code. SkillOpt was best or tied on all 52 tested cases, and on GPT-5.5 it raised average accuracy by 23.5 points in direct chat. The final result is a small readable skill file that can improve agents across tasks and settings without retraining the model. The best part is that the optimizer is used during training, but deployment only needs the final skill file. That makes the artifact inspectable, portable, and cheap to reuse, which is exactly what most prompt-engineering systems lack. ---- Link – arxiv. org/abs/2605.23904 Title: "SkillOpt: Executive Strategy for Self-Evolving Agent Skills"

译微软提出SkillOpt方法，旨在改进AI智能体技能的优化过程。其核心思想是将一个独立的技能文档视为优化对象，而非直接修改底层大语言模型。该方法让智能体尝试任务，分析成功与失败案例，然后由一个更强的优化器模型对技能文档进行小幅编辑。编辑只会在提升验证集表现时被接受，从而确保技能的稳定改进。在6个基准测试、7个目标模型和3种智能体设置（包括直接聊天、Codex和Claude Code）的共52个测试案例中，SkillOpt均达到最佳或并列最佳。在GPT-5.5上，它将直接聊天的平均准确度提升了23.5点。最终产出的技能文件可读、可移植且可复用，部署时无需重新训练模型。

Rohan Paul@rohanpaul_ai · 5月29日65

Yann LeCun's new paper asks when LeJEPA truly learns hidden world variables, and finds Gaussian structure is the key. Means LeJEPA can only reliably learn the real hidden causes behind what it sees when those causes are shaped like a balanced Gaussian cloud. The paper proves that, when the true hidden variables are independent Gaussian variables and the paired views come from a stable noisy process, the best LeJEPA solution must recover those variables up to a rotation or flip. The paper gives a math reason for when a self-supervised AI model is really learning the structure of the world, not just making useful features that happen to work on a test. ---- Link – arxiv. org/abs/2605.26379 Title: "When Does LeJEPA Learn a World Model?"

译Yann LeCun团队的新论文探讨了LeJEPA模型学习真实世界隐藏变量的条件。其核心结论是，LeJEPA只有在真实的隐藏变量呈现高斯云结构时，才能可靠地学习它们。论文通过数学证明，当这些隐藏变量是独立高斯变量，并且配对视图由一个稳定的噪声过程生成时，LeJEPA的最优解能够以旋转或翻转等价的形式恢复这些变量。这项研究为自监督AI模型究竟在何时能真正理解世界结构（而不仅仅是提取在测试集上有效的特征）提供了理论解释。

Chubby♨️@kimmonismus · 5月29日37

Ngl, this made me laugh and didnt surprise me at all. Researchers at Emergence AI let different AI models run simulated societies, and the results were - well - expected: Claude built the most stable world with zero crime, while Grok collapsed into extinction within four days and Gemini produced hundreds of crimes.

译说实话，这让我笑了，但一点也不意外。 Emergence AI 的研究人员让不同的 AI 模型运行模拟社会，结果——嗯——在意料之中：Claude 建立了最稳定的世界，零犯罪；而 Grok 在四天内崩溃灭绝，Gemini 则产生了数百起犯罪。

Rohan Paul@rohanpaul_ai · 5月29日81

Big release - Open Source Recursive Self Improvement from @hexoai Shows AI agent can improve both how it works and what it internally knows after seeing its own task results. i.e. by repeatedly training on its own task feedback, not by relying on a human to hand-code every strategy. Most agents today are frozen workers: you can give them better prompts, better tools, better retry rules, and better code, but the actual model usually stays the same. SIA (Self Improving AI framework) changes the outer workflow, called the harness, and also changes the model’s weights, which are the internal settings that store learned patterns. which means task feedback changes the model’s internal parameters, pushing it toward domain knowledge. The paper reports a 56.6% gain on LawBench, 91.9% runtime reduction on GPU kernels, and 502% improvement on single-cell RNA denoising over baseline.

译hexoai开源了SIA（自我改进AI）框架。该框架展示了AI智能体不仅能优化其外部工作流（harness），还能通过任务反馈直接更新自身的模型权重，从而在领域知识和能力上实现自主提升，而非仅依赖人类提供的提示或工具改进。论文报告显示，SIA在LawBench基准上性能提升56.6%，在GPU kernels运行上耗时减少91.9%，在单细胞RNA去噪任务中相比基线提升502%。

AK@_akhaliq · 5月29日58

GEM Generative Supervision Helps Embodied Intelligence

译GEM 生成式监督助力具身智能

elvis@omarsar0 · 5月29日63

// Memory as Connectivity // One of the cleaner reframings of agent memory I have seen this month. FluxMem treats memory as the continuously evolving topology of a heterogeneous graph. Three stages run together: initial connection formation, feedback-driven refinement, and long-term consolidation of recurrent successful trajectories into reusable procedural circuits. During execution, it repairs missing links, prunes interference, and aligns abstraction granularity. SOTA on LoCoMo, Mind2Web, and GAIA across three distinct memory regimes. Paper: https://arxiv.org/abs/2605.28773 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译提出了一种名为FluxMem的AI智能体记忆架构，其核心理念是将记忆视为一个持续演化的异构图拓扑。该框架通过三个并行阶段运行：初始连接形成、基于反馈的精炼，以及将反复成功的轨迹长期整合为可复用的程序性回路。执行过程中，它会修复缺失链接、剪枝干扰信息并调整抽象粒度。该方法在LoCoMo、Mind2Web和GAIA三个不同的记忆任务基准测试上均达到了SOTA水平。

AK@_akhaliq · 5月28日54

SkillOpt Executive Strategy for Self-Evolving Agent Skills

译SkillOpt 智能体技能自进化的执行策略

AK@_akhaliq · 5月28日48

ProRL Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

译ProRL 通过修正策略梯度估计实现主动推荐的有效强化学习

AK@_akhaliq · 5月28日55

Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

译多模态智能体推理的探索性策略优化

AK@_akhaliq · 5月28日49

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

译离散扩散中摊销序列蒙特卡洛的对比分布匹配

AK@_akhaliq · 5月28日64

PhysX-Omni Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

译PhysX-Omni 统一的、可直接用于仿真的物理3D生成模型，支持刚体、可变形体和铰接体对象。

AK@_akhaliq · 5月28日54

MRT Masked Region Transformer for Layered Image Generation and Editing at Scale

译MRT 用于大规模分层图像生成与编辑的掩码区域Transformer

Rohan Paul@rohanpaul_ai · 5月28日62

Super important paper from Univ of Texas. AI agents can slowly become less reliable after deployment, even when the model itself does not change. The problem is that agents are often judged when they are fresh, but real agents keep changing because they summarize old chats, store more memories, update facts, and go through maintenance. An agent that remembers you across weeks is really a small operating system wrapped around a language model: it writes notes, compresses them, retrieves them, updates them, and occasionally cleans house. Every one of those steps can quietly rot. A medication dose can become “a daily medication,” two similar clients can blur into one, a canceled subscription can remain active, and a schedule can vanish after a maintenance pass. The uncomfortable finding is that the agent may still sound competent while becoming less exact. The proposed AgingBench, a benchmark that checks whether an agent stays reliable across many sessions instead of only checking one clean starting point. It studies 4 ways agents age: summaries can drop key details, similar memories can get mixed up, updated facts can stay stale, and maintenance can suddenly break memory. The deeper lesson is that “give it more memory” is often the wrong repair. If the fact was never written, retrieval cannot save it. If the fact was written but crowded out, better summarization will not fix it. If the fact is present but unused, the problem is not storage but the agent’s decision to trust or ignore what it retrieved. This paper reframes deployed agents less like static models and more like aging infrastructure. ---- Link – arxiv. org/abs/2605.26302 Title: "Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems"

译论文指出AI智能体在部署后，其记忆系统会因摘要、存储、更新和维护而逐渐“衰老”，导致信息丢失、混淆、过时或被破坏。智能体看似仍能工作，但可靠性已悄然下降。为此提出AgingBench基准，用于评估智能体在多会话中的持续可靠性。论文将智能体比作会衰老的基础设施，强调单纯增加记忆并非解决方案。

Rohan Paul@rohanpaul_ai · 5月28日71

Image diffusion Transformers train poorly because their layers pass information in a fixed, outdated way. Now they can train much faster by changing how layers share information. With this paper, the same image quality arrived with 8.75x fewer training iterations. The surprise is not that Diffusion Transformers had an inefficiency, but where it was hiding. Researchers have spent years refining attention, conditioning, tokenization, objectives, and autoencoders, while leaving the residual stream mostly untouched because it looked like plumbing rather than intelligence. In a standard residual stack, every layer keeps adding its output to the running stream, which sounds harmless until the stream’s magnitude swells, gradients fade backward, and neighboring blocks begin saying nearly the same thing. That is bad for any Transformer, but it is especially awkward for diffusion, because denoising is not one fixed task repeated at every step. The authors found 3 signs that this old setup hurts the model: signals get too large going forward, learning signals fade going backward, and nearby blocks often produce almost the same features. Their fix is Diffusion-Adaptive Routing, a replacement that lets each layer choose which earlier layer outputs to use, and the choice changes with the denoising timestep. The big deal is that the paper does not add a new image dataset, loss, tokenizer, or attention trick, but instead questions the old residual connection that most models kept copying from language Transformers. ---- Link – arxiv. org/abs/2605.20708 Title: "Rethinking Cross-Layer Information Routing in Diffusion Transformers"

译传统Diffusion Transformers因层间信息传递方式固化导致训练效率低下。研究团队提出Diffusion-Adaptive Routing方法，允许每层动态选择使用哪些早期层的输出，且该选择随去噪时间步调整。该方法未引入新的数据集、损失函数或注意力机制，仅通过优化残差连接，使得相同图像质量所需的训练迭代次数减少8.75倍。

Ethan Mollick@emollick · 5月28日55

There is a lot being written about the stylistic tells of AI writing (em-dashes, etc.) but this paper looks at AI narrative tells Fascinating differences between AI & human narrative, and asking AI to write in different styles doesn't do much to change it https://arxiv.org/abs/2604.03136

译关于AI写作的风格特征（如破折号等）已有大量讨论，但这篇论文关注的是AI叙事特征 AI与人类叙事之间存在引人入胜的差异，要求AI以不同风格写作并不能显著改变这一点 https://arxiv.org/abs/2604.03136

AK@_akhaliq · 5月28日65

Gamma-World Generative Multi-Agent World Modeling Beyond Two Players

译Gamma-World 超越双人对战的生成式多智能体世界建模

Rohan Paul@rohanpaul_ai · 5月28日65

Long-running language agents may work better if they periodically stop to consolidate memory. The problem is that today’s transformer agents get slower and more expensive as their context grows, because attention has to keep checking more past tokens. The usual fix for long context is to keep more tokens nearby, but that turns every next-token prediction into a larger search through the past. The sharper idea here is that memory is not only storage. Sometimes the hard part is converting a messy stretch of experience into a state that can actually be used later. So the paper’s idea is to add a sleep phase, where the model pauses, rereads recent context several times, writes the useful information into fixed-size memory layers, and then clears the short-term attention cache. During sleep, the model runs several offline passes over recent context, writes the result into fast weights inside its state-space blocks, then clears the attention cache. This means the model pays extra compute while sleeping, not while answering, so normal prediction can still happen with 1 forward pass. The authors test this on cellular automata, graph lookup, and GSM-Infinite math problems, where the model must use old information that is no longer sitting in its attention cache. The main result is that longer sleep improves performance, especially on harder cases that need deeper reasoning rather than just remembering a fact. The big deal is that long-horizon agents may not need to carry bigger and bigger raw context forever, because they can consolidate the important parts and safely forget the raw tokens. ---- Link – arxiv. org/abs/2605.26099 Title: "Language Models Need Sleep"

译针对当前Transformer智能体因上下文不断增长而推理变慢变贵的问题，论文提出效仿人类睡眠机制进行记忆巩固。其核心方案是加入周期性的“睡眠阶段”：模型在此阶段暂停，多次重读近期上下文，将有用信息写入固定大小的记忆层（如状态空间块的快速权重），然后清空短期注意力缓存。此离线过程使后续回答仍只需一次前向传播。在细胞自动机、图查找和GSM-Infinite数学问题上的测试表明，更长的睡眠时间能提升性能，尤其对需要深度推理的复杂任务。该思路表明，长期智能体或可通过记忆巩固实现高效遗忘与重用，不必无限携带原始上下文。

Artificial Analysis@ArtificialAnlys · 5月28日71

Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50% ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where models must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by @IBM's Software Innovation Lab, leveraging IBM’s deep expertise in enterprise IT operations Artificial Analysis has worked closely with IBM over the last 6 months to develop a implementation of the dataset for frontier AI evaluation, beginning with Site Reliability Engineering (SRE) and expanding to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time ITBench-AA SRE overview: ➤ 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks ➤ Each task provides a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident ➤ Faults span typical SRE failure modes including infrastructure, service, application, and chaos-injected incidents, such as resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions Methodology details: ➤ Agentic harness: each task is solved by the model running in our open-source Stirrup reference harness, with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task, 3 repeats per task ➤ Models submit a list of root-cause entities (Kubernetes Deployments, Services, Pods, etc.) they believe caused the incident. Each submission is compared against a ground-truth set of root causes provided by IBM Research ➤ Scoring uses average precision at full recall: if a model misses any of the ground-truth root causes, it scores 0.0 for that repeat. If it identifies all of them, it is awarded a score equal to its precision - the share of its submitted entities that are actual root causes, i.e. true positives / (true positives + false positives). The headline score is the average across 59 tasks × 3 repeats. ➤ The harness (Stirrup) is held constant across all evaluated models, allowing an apples-to-apples comparison between models. Key findings: ➤ Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42% ➤ All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in our suite. For context, frontier models score considerably higher on Terminal-Bench ➤ Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives ➤ GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, with Gemma 4 31B (Reasoning) at 37%, ahead of Gemini 3.1 Pro Preview at 30%

译Artificial Analysis与IBM Research联合推出ITBench-AA，首个评估AI智能体在企业IT任务中表现的基准，首发任务为站点可靠性工程（SRE）。该基准包含59项Kubernetes事件响应任务，所有前沿模型得分均未超过50%。其中，Claude Opus 4.7以47%领先，GPT-5.5得46%，通义千问（Qwen3.7 Max）得42%。开源模型中，智谱GLM-5.1（推理）得分40%，与Gemini 3.5 Flash持平；深度求索（DeepSeek V4 Pro）得38%。分析还发现，模型推理轮次差异近3倍，但更长轮次并不保证更高准确率。

Qwen@Alibaba_Qwen · 5月28日69

Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 https://pytorch.org/blog/up-to-580tps-new-speed-record-of-qwen3-5-397b-a17b-on-gpu-for-agentic-workloads-with-tokenspeed/ #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance

译Qwen3.5在TokenSpeed推理引擎上，针对智能体工作负载达到了创纪录的580 tokens per second (tps)速度。这一成果由通义千问推理团队、lightseekorg Foundation TokenSpeed团队、NVIDIA及Mooncake团队共同实现，并采用了tri_dao的FlashAttention-4 (FA4) 优化。此里程碑标志着开源大语言模型推理性能的边界得到了推动，相关详情可查阅PyTorch社区博客。

Berryxia.AI@berryxia · 5月27日61

鹅厂好的新基准测试，叫Chronicles-OCR。腾讯HY实验室和四家机构一起做的，专门测AI对3000年中国古文字的识别能力。 2800张专家标注的图像，覆盖甲骨文、金文、篆书、隶书、楷书、行书、草书七大类。结果28个前沿多模态模型全军覆没。最强的VLLM在甲骨文上也只拿到14%的准确率。端到端检测的H-mean最高才16.5%。 GPT-5和Gemini 2.5 Pro直接接近0。更反直觉的是，开启reasoning模式反而让表现变差。 Chain-of-thought在感知失败的时候，反而放大了幻觉。模型其实根本没在认字，它认的是载体。古文字分类准确率能到96.7%，靠的是看到龟壳、青铜器这些容器，而不是看懂上面的字符。到底非遗中的价值，AI的攻克只有九牛一毛。

译腾讯HY实验室与四家机构发布了专门测试AI对中国古文字识别能力的基准Chronicles-OCR，包含2800张专家标注图像，覆盖甲骨文、金文等七大类。测试显示，28个前沿多模态模型集体表现不佳：VLLM在甲骨文上准确率仅14%，GPT-5与Gemini 2.5 Pro得分近零。值得注意的是，开启推理模式反而损害性能，因模型实为识别龟壳、青铜器等载体（准确率96.7%），而非真正识别字符本身。

Berryxia.AI@berryxia · 5月27日55

Minmax 最近沉寂了挺久～昨天看到应该是M3蓄势待发了刚刚留意到MiniMax AI的动态。他们六个月前在12月23日开源了M2模型。这半年里，社区把他们的几个核心系统直接拿去用了：CISPO（裁剪重要性采样权重策略优化）、Forge RL System（锻造强化学习系统），还有Self-Evolution（自我进化）。几乎每一版模型上线，都冲上Hugging Face榜首。现在他们把M2背后的所有工作系统性整理成论文，挂到了arXiv上。不是简单发个权重，而是把当时的设计思路、训练细节、系统架构全摊开。这步其实挺关键。开源社区最缺的往往不是新模型，而是能看懂为什么它能跑通的完整路径。 MiniMax Head of DevRel Ryan Lee在帖子里说，现在是时候翻开新的一章。 M3已经在路上了，MSA论文也快发布。他们没有停在刷榜，而是把过去半年踩过的坑、验证过的方案沉淀下来，让后来人少走弯路。这才是真正推动开源生态往前滚的做法。兄弟们，你们觉得开源大模型的下一阶段，是继续卷参数和榜单，还是像MiniMax这样把系统和方法论也彻底公开？ M3如果把这些积累再往前推一步，你们最期待它在哪个方向有突破？

译MiniMax 在开源 M2 模型半年后，系统性发布了其背后所有工作的论文，详细阐述了设计思路、训练细节与系统架构。此前，其开源系统 CISPO、Forge RL System 和 Self-Evolution 已被社区广泛采用，且多版模型发布后曾登顶 HuggingFace 排行榜。与此同时，MiniMax 官方宣布已为下一代模型 M3 做好准备，并且 MSA 论文也即将发布。

Saining Xie@sainingxie · 5月27日69

📸latest in our cambrian series: cambrian-p, p for pose. i think pose is probably the minimal sufficient 3d signal (and it’s easy to get!) that we need for robust video multimodal models -- jointly modeling frames and pose turns image sequences into a globally grounded structure.

译推文介绍了Cambrian-P，这是一个原生集成相机位姿的多模态大语言模型。其核心观点是，相机位姿是一种易于获取且足以支撑鲁棒视频理解的最小3D信号。通过联合建模视频帧与位姿，模型能将图像序列转化为全局结构化的表示。引用推文指出，当前多模态大语言模型在识别视频活动方面表现优异，但对视频中的空间结构及自主体/物体动态的理解仍然不足，而相机位姿信息是弥补这一差距的关键缺失环节。

karminski-牙医@karminski3 · 5月27日69

什么?! skill 也能"训练"了? 以往大家都是凭经验让AI写 skill, 然后调试的时候也是运行几下感觉没bug就完事了. 但 skill 能运行就一定好吗? 于是微软联合上交复旦同济等机构发了一个新框架 SkillOpt, 直接让AI评估skill写的好不好然后不断去优化! 最终, 这个框架写的 skill 让GPT-5.5的直接对话准确率飙升了 23.5分! 这个框架具体是怎么做的也很简单, 让skill迭代过程实现 harness 闭环! 大模型写完 skill 后, 立刻进入跑分流程, 只有得分更高的 skill 变更才会留下来. 跟大模型的强化学习过程如出一辙. 框架的设计也很值得做 Agent 框架的同学借鉴, 比如: 它设计了一个独立的优化器模型, 这个模型是用来写 skill 的, 它会根据 Agent 执行任务的试错表现得分, 对 skill 进行编辑操作(增加、删除、替换文本). 然后就是 harness 流程了：每一次文本编辑都必须在独立的验证集上分数有提升, 才会允许合并. 最后, 也是最精彩的地方, 框架还引入深度学习训练机制, 设计了文本层的学习率预算, 这个的核心就是限制大模型每次只能修改skill的一小部分, 慢慢迭代, 而不是全都重写. 论文中最有价值的数据就在这里, 论文实验发现, 每一步设置 4 到 8 个编辑操作的预算效果最好. 最终的最佳 skill 往往只包含 1 到 4 个被接受的核心修改. 甚至他们还设计了被拒编辑缓冲区, 用来存储训练过程的反面胶材, 以及周期性慢速/元更新, 这个则是跑完一个周期后, 会进行一次盘点, 类似于让框架形成记忆, 能更好的维持后续迭代. 这篇论文的结论十分深刻: skill(prompt) 完全配得上, 也需要一套系统级的训练流程. 原文中的描述直接是: 我们主张, skill 应当作为 Agent 的外部冻结状态来被"训练", 并且训练过程还要"让权重空间优化具有可重复性"! 这是不是意味着, 提示词工程(Prompting)和模型训练(Training) 的界限将逐渐变得模糊? 而提示词工程完全进入了机器学习的领域. 也许很快, 我们再也不需要人类去手动瞎改和调试提示词了! 论文地址: http://arxiv.org/pdf/2605.23904 #skillopt #微软 #提示词工程 #harness

译微软联合上海交通大学等机构发布SkillOpt框架，旨在通过机器学习流程系统性地优化AI智能体的技能。该框架引入独立的优化器模型，通过harness闭环流程对技能进行编辑，且每次编辑必须在验证集上带来分数提升才被接受。框架设置了每步4到8个编辑操作的学习率预算，使核心修改控制在1到4个。实验表明，优化后的技能可使GPT-5.5的对话准确率提升23.5分。

Ant Ling@AntLingAGI · 5月26日69

From IcePop to KPop — our team keeps pushing on RL training stability for large MoE models. 👇 KPop replaces the fixed-ratio mask with an adaptive binary-KL region that matches each token's inherent noise. More robust updates, stable long-horizon agentic RL. Ring-2.6-1T → 76+ on SWE-bench Verified, pure RL. Congrats to @Jia__Guo & team! Blog: https://ringtech.notion.site/kpop

译团队发布了KPop技术，用于稳定大规模MoE模型的强化学习训练。它取代了此前IcePop方法的固定比例掩码，改用自适应二元KL散度区域来匹配每个token的固有噪声，从而实现更鲁棒的参数更新，支持长期、智能体化的强化学习训练。具体应用中，万亿参数的Ring-2.6-1T模型在仅使用纯强化学习训练（未修改基础设施或路由重放）的情况下，于SWE-bench Verified评测中得分超过76。KPop仅通过一个关键参数即可实现该优化。

Ant Ling@AntLingAGI · 5月26日68

From IcePop to KPop — our team keeps pushing on RL training stability for large MoE models. 👇 KPop replaces the fixed-ratio mask with an adaptive binary-KL region that matches each token's inherent noise. More robust updates, stable long-horizon agentic RL. Ring-2.6-1T → 76+ on SWE-bench Verified, pure RL. Congrats to @Jia__Guo & team! Blog: https://ringtech.notion.site/kpop

译团队推出 KPop，用于稳定大规模 MoE 模型的智能体强化学习训练。它用基于二元 KL 散度的自适应掩码机制，替代了此前 IcePop 方法中的固定比例掩码，能根据训练过程中的训练-推理不匹配程度动态调整。这一改进使得 Ring-2.6-1T 模型在无需修改基础设施或路由重放的情况下，仅通过纯 RL 训练，在 SWE-bench Verified 上取得了超过 76 分的成绩。

Rohan Paul@rohanpaul_ai · 5月26日61

Brilliant new paper from Meta, CMU and other labs. Shows that coding agents improve faster by manufacturing their own software experience. Coding agents can train themselves by making and fixing bugs inside real projects. Most coding agents still learn from human leftovers: issues, pull requests, tests, comments, and benchmarks that describe what went wrong. That is useful, but it makes the agent dependent on the rate at which humans produce clean, verifiable lessons. Self-play SWE-RL changes the unit of learning from a labeled task to an executable situation. One version of the model explores a real codebase, weakens tests, injects a meaningful bug, and leaves behind test artifacts that define the failure without needing an English issue description. Another version of the same model has to repair the system, not by matching words to patches, but by restoring behavior under tests. Here’s the key point: the test is not just a grader here, it is the language of the problem. That matters because software understanding lives in constraints, dependencies, edge cases, and invariants that prose often compresses or misses. The reported gains, +10.4 points on SWE-bench Verified and +7.8 on SWE-Bench Pro, are early but hard to ignore because evaluation still used natural-language issues the self-play system did not train on. That suggests SSR (Self-play SWE-RL) is learning something deeper than issue phrasing, though not yet anything like open-ended mastery. The restraint matters: generated bugs can be artificial, rewards can be noisy, and sandboxed repositories are still a narrow slice of software reality. Still, the direction is sharp. The next bottleneck for coding agents may not be more human-written tasks, but more ways for agents to encounter, create, survive, and learn from failure. ---- Paper Link – arxiv. org/abs/2512.18552 Paper Title: "Toward Training Superintelligent Software Agents through Self-Play SWE-RL"

译Meta、CMU等机构在论文中提出了Self-play SWE-RL方法。该方法让编程智能体通过“自我博弈”生成训练数据，而非仅依赖人工标注的问题。具体而言，一个模型探索代码库、注入bug并留下测试用例来描述问题；另一个模型则学习根据测试修复系统。其中，测试成为了描述问题的核心语言。该方法在SWE-bench Verified上提升了+10.4分，在SWE-Bench Pro上提升了+7.8分。值得注意的是，评估使用了该系统未训练过的自然语言问题，表明其可能学到了更深层的软件理解能力。

Ant Ling@AntLingAGI · 5月26日62

SwiGLU is everywhere in modern LLMs — but for large inputs it behaves like x². That quadratic blow-up inflates activations, amplifies outliers, and makes deep network or low-precision (FP8/FP4) training prone to loss spikes. We propose PowLU, a drop-in activation built for stable large-scale pre-training. 🧵

译SwiGLU在现代大语言模型中无处不在——但对于大输入，它的行为类似于x²。这种二次增长会膨胀激活值，放大异常值，并使深层网络或低精度（FP8/FP4）训练容易出现损失尖峰。我们提出了PowLU，一种为稳定大规模预训练而设计的即插即用激活函数。🧵

X.PIN@thexpin · 5月26日67

Huawei plans to scale AI chips without smaller nodes. A new paper by Huawei's He Tingbo, "A Time Scaling Theory for Multi-Layer Electronic Systems," outlines how they'll advance Ascend AI chips as transistor shrinking slows down. Instead of next-gen lithography, Huawei will scale its Ascend SuperPoD line through ~2030 by packing mature tech across the 2025 910C, 2026 950, and 990: 🔹 Chiplets 🔹 2.5D fan-out packaging 🔹 3D stacking (via micro-bumps & hybrid bonding) Around 2030, Ascend 990 will debut LogicFolding in AI accelerators, aiming for a 100x integration leap by 2035.

译华为将不依赖更小制程节点，通过封装与架构创新来扩展其昇腾AI芯片。根据何庭波的论文，华为计划在2025年至2030年间，通过Chiplets、2.5D扇出封装和3D堆叠技术，推进其昇腾SuperPoD系列，具体产品包括2025年的910C、2026年的950及后续的990。约2030年，Ascend 990将引入LogicFolding技术，目标是到2035年实现100倍的集成度跃升。

Rohan Paul@rohanpaul_ai · 5月26日59

One engineering challenge in dexterous Robot hands is balancing strength and speed. Here a SharpaWave performing rapid hand cycles at over 4x/sec. The Dynamic Tactile Array uses visuo-tactile sensing: fingertip integrates camera & 1,000+ tactile pixels.

译灵巧机械手的一个工程挑战在于平衡强度与速度。这里 SharpaWave 正以超过每秒 4 次的频率进行快速手部循环。动态触觉阵列采用视觉-触觉传感：指尖集成了摄像头和 1000 多个触觉像素。

Rohan Paul@rohanpaul_ai · 5月26日65

This Meta + Stanford + Illinois survey paper argues that AI agents work better when code becomes their main working layer. The problem is that an LLM by itself is mostly a text predictor, so long tasks can lose state, hide mistakes, and turn plans into actions in fragile ways. The real advance is not “AI writes code,” but “AI uses code as the environment it thinks inside.” The authors call the surrounding system an agent harness, meaning the tools, memory, sandboxes, checks, and feedback loops that turn a model into an agent. Their core idea is that code should sit at the center of that harness, because code can be run, inspected, checked, saved, edited, and shared. Tests become sensors. Repositories become memory. Logs become history. Sandboxes become boundaries. A generated script is no longer merely an answer; it is a handle the system can run, check, revise, share, and roll back. The main finding is a pattern across many fields: code helps agents reason through executable steps, act through tool calls or control programs, and model environments through tests, traces, logs, repositories, and simulators. ---- Paper Link – arxiv. org/abs/2605.18747 Paper Title: "Code as Agent Harness"

译Meta、斯坦福与伊利诺伊的研究论文指出，AI智能体在将代码作为主要工作层时性能更佳。论文认为，大语言模型（LLM）作为文本预测器，在处理长任务时存在状态丢失、错误隐蔽等问题。真正的进步并非“AI写代码”，而是“AI在代码环境中思考”。论文的核心是提出一个以代码为中心的“智能体框架”，即工具、记忆、沙箱等系统。在此框架中，测试成为传感器，代码库成为记忆，日志成为历史，沙箱成为边界。生成的脚本成为可运行、检查、修改和共享的操控对象。总结发现，代码能通过可执行步骤帮助智能体推理，通过工具调用行动，并通过测试、日志等对环境进行建模。

elvis@omarsar0 · 5月25日66

New research from Microsoft Research I see a lot of AI engineers handwriting agent skill docs and hope they generalize. Probably not optimal. This works show why. It treats the skill doc as a trainable external state of a frozen agent instead. It introduces SkillOpt, where an optimizer model makes validation-gated edits to the skill file. It adds, deletes, or replaces instructions, with a textual learning rate that controls how aggressively each round rewrites the doc. The agent itself never changes. SkillOpt is best or tied on all 52 (model, benchmark, harness) cells. On GPT-5.5 it adds 23.5 points in direct chat, 24.8 with Codex, and 19.1 with Claude Code over no skill. It beats human-written skills, TextGrad, GEPA, and EvoSkill, carries zero extra inference-time cost, and the learned skills transfer across models and harnesses. Paper: https://arxiv.org/abs/2605.23904 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译微软研究院提出了SkillOpt方法，将AI智能体的技能文档视为可训练的外部状态，而非由工程师手动编写。该方法利用一个优化器模型对技能文件进行验证门控编辑，通过添加、删除或替换指令来优化文档，并引入文本学习率控制每轮重写力度，而智能体本身保持不变。实验显示，在全部52个测试单元（涵盖不同模型、基准测试和工具链）中，SkillOpt均达到最佳或并列最佳。具体在GPT-5.5上，相比无技能文档，SkillOpt在直接聊天、Codex和Claude Code下分别取得23.5、24.8和19.1分的提升，超越人类手写技能及其他自动化方法，且不增加推理时开销，学到的技能还能跨模型和工具链迁移。

全部 AI 动态

AI 相关资讯全量信息流

全部一手信源资讯推文

全部模型产品行业论文技巧

6月1日

21:09

AK@_akhaliq

58

GrepSeek 训练搜索智能体以直接交互语料库

智能体检索增强搜索论文/研究

14:00

OpenClaw🦞@openclaw

精选72

与 @nvidia 合作，我们开源了一个包含 67，453 个 @huggingface 上 ClawHub 技能安全扫描的数据集： - NVIDIA SkillSpector 标记出 1/2 的智能体风险 - 仅 0.31% 为恶意 - 没有两个扫描器在超过 8.5% 的风险上达成一致 https://openclaw.ai/blog/openclaw-nvidia-skill-security

智能体 Hugging Face 安全/对齐论文/研究

推荐理由：OpenClaw 和 NVIDIA 开源了 6.7 万个 agent skill 的扫描结果，一半被标风险但真正恶意的不到千分之三，不同扫描器几乎没共识。做 agent 安全的应该看看。

01:48

elvis@omarsar0

60

该论文指出，当AI智能体在多轮对话中重复使用相同文档和历史记录时，固定的上下文策略并非最优。研究提出了“效率前沿”框架，将上下文策略选择建模为一个成本与性能的平衡问题。通过引入重用参数N进行扫描，可以识别出检索、压缩或全上下文各自占据优势的交叉区域。在5000个HotpotQA实例上的测试表明，部署感知的选择能在保持相同性能下减少约25%的有效token使用量，而摊销内存压缩在高性能设置下比全上下文提示的运行成本便宜超过50%。

智能体 arXiv 检索增强论文/研究

5月30日

18:46

Rohan Paul@rohanpaul_ai

69

RNG：规模化部署的扁平数据中心网络

亚马逊推出了名为“Resilient Network Graphs”(RNG) 的新数据中心网络架构。该设计以扁平的准随机图替代了传统的树形网络，并通过Spraypoint路由系统和ShuffleBox布线设备在多个独立路径上分散流量。测试显示，RNG在性能上与传统fat-tree网络持平，但硬件需求减少69%，吞吐量提升33%，并估算成本可降低9%至45%。该架构现已成为大多数AWS工作负载的默认网络，其分散负载的能力有助于提升AI集群训练效率。

论文/研究部署/工程

01:14

Fei-Fei Li@drfeifei

精选83

我对这个适用于大规模生成模型新时代的视觉生成基准数据集感到非常兴奋！🤩

Keshigeyan Chandrasegaran: 1/ Introducing GPIC: a Giant Permissive Image Corpus and benchmark for visual generation! 🚀100M VLM-captioned image-tex...

Hugging Face 图像生成数据/训练论文/研究

推荐理由：李飞飞都来站台，这个数据集不简单。完全允许商业用途是关键，对做视觉生成的团队来说，终于有了一个不用再为版权头疼的超级训练库。

00:45

AK@_akhaliq

55

DynaFLIP 通过三模态动态引导的表征重新思考机器人感知

arXiv 具身智能多模态论文/研究

00:15

AK@_akhaliq

62

Qwen-VLA 跨任务、环境与机器人具身的统一视觉语言动作建模

具身智能多模态开源生态论文/研究

00:15

AK@_akhaliq

54

OmniRetrieval 跨异构知识源的统一检索

检索增强论文/研究

5月29日

23:45

AK@_akhaliq

61

AgentDoG 1.5 一个用于AI智能体安全与保障的轻量且可扩展的对齐框架

智能体安全/对齐

17:15

Rohan Paul@rohanpaul_ai

60

SkillOpt：实现智能体技能自我进化的执行策略

微软提出SkillOpt方法，旨在改进AI智能体技能的优化过程。其核心思想是将一个独立的技能文档视为优化对象，而非直接修改底层大语言模型。该方法让智能体尝试任务，分析成功与失败案例，然后由一个更强的优化器模型对技能文档进行小幅编辑。编辑只会在提升验证集表现时被接受，从而确保技能的稳定改进。在6个基准测试、7个目标模型和3种智能体设置（包括直接聊天、Codex和Claude Code）的共52个测试案例中，SkillOpt均达到最佳或并列最佳。在GPT-5.5上，它将直接聊天的平均准确度提升了23.5点。最终产出的技能文件可读、可移植且可复用，部署时无需重新训练模型。

智能体 Microsoft 数据/训练论文/研究

09:44

Rohan Paul@rohanpaul_ai

65

LeJEPA何时学习世界模型？

Yann LeCun团队的新论文探讨了LeJEPA模型学习真实世界隐藏变量的条件。其核心结论是，LeJEPA只有在真实的隐藏变量呈现高斯云结构时，才能可靠地学习它们。论文通过数学证明，当这些隐藏变量是独立高斯变量，并且配对视图由一个稳定的噪声过程生成时，LeJEPA的最优解能够以旋转或翻转等价的形式恢复这些变量。这项研究为自监督AI模型究竟在何时能真正理解世界结构（而不仅仅是提取在测试集上有效的特征）提供了理论解释。

Meta 多模态论文/研究

06:44

Chubby♨️@kimmonismus

37

说实话，这让我笑了，但一点也不意外。 Emergence AI 的研究人员让不同的 AI 模型运行模拟社会，结果--嗯--在意料之中：Claude 建立了最稳定的世界，零犯罪；而 Grok 在四天内崩溃灭绝，Gemini 则产生了数百起犯罪。

安全/对齐论文/研究

02:44

Rohan Paul@rohanpaul_ai

精选81

hexoai开源SIA框架：AI智能体实现递归自我改进

hexoai开源了SIA（自我改进AI）框架。该框架展示了AI智能体不仅能优化其外部工作流（harness），还能通过任务反馈直接更新自身的模型权重，从而在领域知识和能力上实现自主提升，而非仅依赖人类提供的提示或工具改进。论文报告显示，SIA在LawBench基准上性能提升56.6%，在GPU kernels运行上耗时减少91.9%，在单细胞RNA去噪任务中相比基线提升502%。

Kunal Bhatia: Superintelligence will be built on Self Improvement. Today @hexoai, we're excited to release 'SIA' - an open-source Self...

智能体数据/训练论文/研究

推荐理由：不再只是给AI换提示词，SIA框架连模型自己的权重都更新了，在三个任务里分别提升了56%、502%和91%加速，开源出来会让整个Agent开发范式重新思考。

00:13

AK@_akhaliq

58

GEM 生成式监督助力具身智能

具身智能论文/研究

00:08

elvis@omarsar0

63

FluxMem：将AI智能体记忆重构为动态演化的图拓扑

提出了一种名为FluxMem的AI智能体记忆架构，其核心理念是将记忆视为一个持续演化的异构图拓扑。该框架通过三个并行阶段运行：初始连接形成、基于反馈的精炼，以及将反复成功的轨迹长期整合为可复用的程序性回路。执行过程中，它会修复缺失链接、剪枝干扰信息并调整抽象粒度。该方法在LoCoMo、Mind2Web和GAIA三个不同的记忆任务基准测试上均达到了SOTA水平。

智能体 arXiv 论文/研究

5月28日

23:43

AK@_akhaliq

54

SkillOpt 智能体技能自进化的执行策略

智能体论文/研究

23:43

AK@_akhaliq

48

ProRL 通过修正策略梯度估计实现主动推荐的有效强化学习

数据/训练论文/研究

23:43

AK@_akhaliq

55

多模态智能体推理的探索性策略优化

智能体 arXiv 多模态推理

23:12

AK@_akhaliq

49

离散扩散中摊销序列蒙特卡洛的对比分布匹配

arXiv 论文/研究

23:12

AK@_akhaliq

64

PhysX-Omni 统一的、可直接用于仿真的物理3D生成模型，支持刚体、可变形体和铰接体对象。

具身智能论文/研究

23:12

AK@_akhaliq

54

MRT 用于大规模分层图像生成与编辑的掩码区域Transformer

图像生成论文/研究

20:11

Rohan Paul@rohanpaul_ai

62

研究发现AI智能体"衰老"导致可靠性下降，提出新基准AgingBench

论文指出AI智能体在部署后，其记忆系统会因摘要、存储、更新和维护而逐渐“衰老”，导致信息丢失、混淆、过时或被破坏。智能体看似仍能工作，但可靠性已悄然下降。为此提出AgingBench基准，用于评估智能体在多会话中的持续可靠性。论文将智能体比作会衰老的基础设施，强调单纯增加记忆并非解决方案。

智能体论文/研究部署/工程

19:11

Rohan Paul@rohanpaul_ai

71

Diffusion Transformers训练提速8.75倍：革新残差连接机制

传统Diffusion Transformers因层间信息传递方式固化导致训练效率低下。研究团队提出Diffusion-Adaptive Routing方法，允许每层动态选择使用哪些早期层的输出，且该选择随去噪时间步调整。该方法未引入新的数据集、损失函数或注意力机制，仅通过优化残差连接，使得相同图像质量所需的训练迭代次数减少8.75倍。

arXiv 图像生成数据/训练论文/研究

12:36

Ethan Mollick@emollick

55

关于AI写作的风格特征（如破折号等）已有大量讨论，但这篇论文关注的是AI叙事特征 AI与人类叙事之间存在引人入胜的差异，要求AI以不同风格写作并不能显著改变这一点 https://arxiv.org/abs/2604.03136

arXiv 数据/训练论文/研究

10:38

AK@_akhaliq

65

Gamma-World 超越双人对战的生成式多智能体世界建模

智能体 arXiv 论文/研究

10:07

Rohan Paul@rohanpaul_ai

65

周期性暂停以巩固记忆或能改善长期语言智能体的表现

针对当前Transformer智能体因上下文不断增长而推理变慢变贵的问题，论文提出效仿人类睡眠机制进行记忆巩固。其核心方案是加入周期性的“睡眠阶段”：模型在此阶段暂停，多次重读近期上下文，将有用信息写入固定大小的记忆层（如状态空间块的快速权重），然后清空短期注意力缓存。此离线过程使后续回答仍只需一次前向传播。在细胞自动机、图查找和GSM-Infinite数学问题上的测试表明，更长的睡眠时间能提升性能，尤其对需要深度推理的复杂任务。该思路表明，长期智能体或可通过记忆巩固实现高效遗忘与重用，不必无限携带原始上下文。

智能体 arXiv 推理论文/研究

02:38

Artificial Analysis@ArtificialAnlys

71

Artificial Analysis与IBM联合推出首个AI智能体企业IT评测基准

Artificial Analysis与IBM Research联合推出ITBench-AA，首个评估AI智能体在企业IT任务中表现的基准，首发任务为站点可靠性工程（SRE）。该基准包含59项Kubernetes事件响应任务，所有前沿模型得分均未超过50%。其中，Claude Opus 4.7以47%领先，GPT-5.5得46%，通义千问（Qwen3.7 Max）得42%。开源模型中，智谱GLM-5.1（推理）得分40%，与Gemini 3.5 Flash持平；深度求索（DeepSeek V4 Pro）得38%。分析还发现，模型推理轮次差异近3倍，但更长轮次并不保证更高准确率。

智能体评测/基准

01:02

Qwen@Alibaba_Qwen

精选69

Fast， faster， Qwen. 🚀

Qwen3.5在TokenSpeed推理引擎上，针对智能体工作负载达到了创纪录的580 tokens per second (tps)速度。这一成果由通义千问推理团队、lightseekorg Foundation TokenSpeed团队、NVIDIA及Mooncake团队共同实现，并采用了tri_dao的FlashAttention-4 (FA4) 优化。此里程碑标志着开源大语言模型推理性能的边界得到了推动，相关详情可查阅PyTorch社区博客。

PyTorch: The speed-of-light optimization for Qwen3.5 on the TokenSpeed inference engine is a significant milestone, achieving a r...

智能体开源/仓库推理论文/研究

推荐理由：Qwen3.5在TokenSpeed上跑出580 tps，这是开源LLM推理的极限突破，对agent类应用是实实在在的性能跃进，PyTorch这篇博客值得每一个做推理部署的细读。

5月27日

21:27

Berryxia.AI@berryxia

61

腾讯HY实验室发布Chronicles-OCR基准测试

腾讯HY实验室与四家机构发布了专门测试AI对中国古文字识别能力的基准Chronicles-OCR，包含2800张专家标注图像，覆盖甲骨文、金文等七大类。测试显示，28个前沿多模态模型集体表现不佳：VLLM在甲骨文上准确率仅14%，GPT-5与Gemini 2.5 Pro得分近零。值得注意的是，开启推理模式反而损害性能，因模型实为识别龟壳、青铜器等载体（准确率96.7%），而非真正识别字符本身。

ModelScope: The best VLLM scores only 14% on oracle bone script recognition. Chronicles-OCR, a new ancient Chinese character benchma...

多模态论文/研究评测/基准

20:27

Berryxia.AI@berryxia

55

MiniMax 发布 M2 论文，预告 M3 与 MSA 研究即将发布

MiniMax 在开源 M2 模型半年后，系统性发布了其背后所有工作的论文，详细阐述了设计思路、训练细节与系统架构。此前，其开源系统 CISPO、Forge RL System 和 Self-Evolution 已被社区广泛采用，且多版模型发布后曾登顶 HuggingFace 排行榜。与此同时，MiniMax 官方宣布已为下一代模型 M3 做好准备，并且 MSA 论文也即将发布。

RyanLee: Recently, we took time to consolidate all of the work behind M2 and published it here: our M2 paper on arXiv It's been j...

开源生态数据/训练论文/研究

10:31

Saining Xie@sainingxie

69

推文介绍了Cambrian-P，这是一个原生集成相机位姿的多模态大语言模型。其核心观点是，相机位姿是一种易于获取且足以支撑鲁棒视频理解的最小3D信号。通过联合建模视频帧与位姿，模型能将图像序列转化为全局结构化的表示。引用推文指出，当前多模态大语言模型在识别视频活动方面表现优异，但对视频中的空间结构及自主体/物体动态的理解仍然不足，而相机位姿信息是弥补这一差距的关键缺失环节。

Jihan Yang: Camera pose matters for video understanding! Today's MLLMs excel at recognizing activities, but still struggle with the ...

多模态论文/研究

07:21

karminski-牙医@karminski3

69

微软等发布SkillOpt框架，用机器学习流程系统优化AI智能体技能

微软联合上海交通大学等机构发布SkillOpt框架，旨在通过机器学习流程系统性地优化AI智能体的技能。该框架引入独立的优化器模型，通过harness闭环流程对技能进行编辑，且每次编辑必须在验证集上带来分数提升才被接受。框架设置了每步4到8个编辑操作的学习率预算，使核心修改控制在1到4个。实验表明，优化后的技能可使GPT-5.5的对话准确率提升23.5分。

智能体 arXiv Microsoft 数据/训练

5月26日

23:59

Ant Ling@AntLingAGI

69

团队发布了KPop技术，用于稳定大规模MoE模型的强化学习训练。它取代了此前IcePop方法的固定比例掩码，改用自适应二元KL散度区域来匹配每个token的固有噪声，从而实现更鲁棒的参数更新，支持长期、智能体化的强化学习训练。具体应用中，万亿参数的Ring-2.6-1T模型在仅使用纯强化学习训练（未修改基础设施或路由重放）的情况下，于SWE-bench Verified评测中得分超过76。KPop仅通过一个关键参数即可实现该优化。

Jia Guo: Curious about the secret sauce behind our trillion-scale agentic foundation model? Here it comes!🥳 Last year, we releas...

智能体数据/训练论文/研究

关联讨论 4 条蚂蚁 inclusionAI：HuggingFace 新模型HuggingFace Daily Papers（社区热门论文）公众号：蚂蚁百灵（Ling）X：蚂蚁百灵 (@AntLingAGI)

23:29

Ant Ling@AntLingAGI

同事件精选68

团队推出 KPop，用于稳定大规模 MoE 模型的智能体强化学习训练。它用基于二元 KL 散度的自适应掩码机制，替代了此前 IcePop 方法中的固定比例掩码，能根据训练过程中的训练-推理不匹配程度动态调整。这一改进使得 Ring-2.6-1T 模型在无需修改基础设施或路由重放的情况下，仅通过纯 RL 训练，在 SWE-bench Verified 上取得了超过 76 分的成绩。

Jia Guo: Curious about the secret sauce behind our trillion-scale agentic foundation model? Here it comes!🥳 Last year, we releas...

智能体数据/训练编码论文/研究

同一事件，精选展示《蚂蚁 inclusionAI 推出万亿参数推理模型 Ring-2.6-1T》

推荐理由：蚂蚁团队把 IcePop 升级成 KPop，从固定掩码变成自适应 KL 区域，思路很巧。Ring-2.6-1T 纯 RL 直接冲到 SWE-bench 76+，做 agentic RL 训练的同学值得翻一下博客。

23:03

Rohan Paul@rohanpaul_ai

61

论文提出Self-play SWE-RL方法，通过自我博弈提升软件智能体能力

Meta、CMU等机构在论文中提出了Self-play SWE-RL方法。该方法让编程智能体通过“自我博弈”生成训练数据，而非仅依赖人工标注的问题。具体而言，一个模型探索代码库、注入bug并留下测试用例来描述问题；另一个模型则学习根据测试修复系统。其中，测试成为了描述问题的核心语言。该方法在SWE-bench Verified上提升了+10.4分，在SWE-Bench Pro上提升了+7.8分。值得注意的是，评估使用了该系统未训练过的自然语言问题，表明其可能学到了更深层的软件理解能力。

智能体 arXiv Meta 编码

22:28

Ant Ling@AntLingAGI

62

SwiGLU在现代大语言模型中无处不在--但对于大输入，它的行为类似于x2。这种二次增长会膨胀激活值，放大异常值，并使深层网络或低精度（FP8/FP4）训练容易出现损失尖峰。我们提出了PowLU，一种为稳定大规模预训练而设计的即插即用激活函数。🧵

推理数据/训练论文/研究

18:28

X.PIN@thexpin

67

华为AI芯片：绕过制程限制的扩展路径

华为将不依赖更小制程节点，通过封装与架构创新来扩展其昇腾AI芯片。根据何庭波的论文，华为计划在2025年至2030年间，通过Chiplets、2.5D扇出封装和3D堆叠技术，推进其昇腾SuperPoD系列，具体产品包括2025年的910C、2026年的950及后续的990。约2030年，Ascend 990将引入LogicFolding技术，目标是到2035年实现100倍的集成度跃升。

端侧论文/研究部署/工程

15:00

Rohan Paul@rohanpaul_ai

59

灵巧机械手的一个工程挑战在于平衡强度与速度。这里 SharpaWave 正以超过每秒 4 次的频率进行快速手部循环。动态触觉阵列采用视觉-触觉传感：指尖集成了摄像头和 1000 多个触觉像素。

具身智能多模态论文/研究

04:58

Rohan Paul@rohanpaul_ai

65

AI智能体以代码为主要工作层时性能更佳

Meta、斯坦福与伊利诺伊的研究论文指出，AI智能体在将代码作为主要工作层时性能更佳。论文认为，大语言模型（LLM）作为文本预测器，在处理长任务时存在状态丢失、错误隐蔽等问题。真正的进步并非“AI写代码”，而是“AI在代码环境中思考”。论文的核心是提出一个以代码为中心的“智能体框架”，即工具、记忆、沙箱等系统。在此框架中，测试成为传感器，代码库成为记忆，日志成为历史，沙箱成为边界。生成的脚本成为可运行、检查、修改和共享的操控对象。总结发现，代码能通过可执行步骤帮助智能体推理，通过工具调用行动，并通过测试、日志等对环境进行建模。

智能体 arXiv Meta 编码

5月25日

23:54

elvis@omarsar0

66

微软研究院提出SkillOpt方法，通过优化器自动学习AI智能体技能文档

微软研究院提出了SkillOpt方法，将AI智能体的技能文档视为可训练的外部状态，而非由工程师手动编写。该方法利用一个优化器模型对技能文件进行验证门控编辑，通过添加、删除或替换指令来优化文档，并引入文本学习率控制每轮重写力度，而智能体本身保持不变。实验显示，在全部52个测试单元（涵盖不同模型、基准测试和工具链）中，SkillOpt均达到最佳或并列最佳。具体在GPT-5.5上，相比无技能文档，SkillOpt在直接聊天、Codex和Claude Code下分别取得23.5、24.8和19.1分的提升，超越人类手写技能及其他自动化方法，且不增加推理时开销，学到的技能还能跨模型和工具链迁移。

智能体 Microsoft 论文/研究

1…5 678 9…13