SITUATION DETECTED: The city of Rio de Janerio has post-trained a model. Based on Qwen 7/2, Rio 3.5 Open 397B adds SwiReasoning on top of the base Qwen model — a framework that dynamically switches between standard chain-of-thought and latent-space reasoning, guided by entropy-based confidence signals, so the model only "thinks out loud" when it needs to and otherwise reasons silently in hidden space for better token efficiency.

译情况检测到：里约热内卢市后训练了一个模型。基于 Qwen 7/2，Rio 3.5 Open 397B 在基础 Qwen 模型之上添加了 SwiReasoning——一个在标准链式推理与隐空间推理之间动态切换的框架，由基于熵的置信信号引导，使模型仅在必要时"出声思考"，其余时间在隐藏空间内静默推理，以提高 token 效率。

Ethan Mollick@emollick · 6月14日48

I think the assumption that you should use smaller models for less important tasks is flawed (or at least deserves much more careful consideration). Big models are generally better at everything but cost, so it is worth considering whether gains in non-key tasks would be valuable

译我认为你应该对不太重要的任务使用较小模型的假设是有缺陷的（或者至少值得更仔细地考虑）。大模型通常在所有方面都更好，除了成本，因此值得考虑在非关键任务上的收益是否有价值。

SemiAnalysis@SemiAnalysis_ · 6月14日66

DAY 0 ALERT: @MiniMax_AI M3 is now available on HuggingFace & has been added to InferenceX. The M3 architecture has ~428B parameters and ~23B activated parameters. Due to the 10x engineers from @inferact, M3 is already delivering pretty well-optimized performance on @NVIDIAAI B300 Blackwell Ultra on Day 0 @vllm_project! Furthermore, Inferact released their EAGLE3 heads, which enable even greater performance. Looking forward to Day 1, 2, and 3 performance & the team is grinding on benchmarking Day 0 MI355X performance on InferenceX too.

译MiniMax M3 模型已上线 HuggingFace 并集成至 InferenceX。M3 总参数量约 428B，激活参数约 23B。得益于 Inferact 工程支持，M3 在 NVIDIA B300 Blackwell Ultra 上通过 vLLM 实现 Day 0 优化推理。Inferact 还发布了 EAGLE3 heads 以进一步加速。团队正在 InferenceX 上基准测试 Day 0 MI355X 性能。

Rohan Paul@rohanpaul_ai · 6月14日44

Nice survey paper mapping agentic reinforcement learning for LLMs, showing how models learn by acting across time. Covers 500+ works and groups them into a 2-part map of capabilities and applications. The problem is that common LLM training rewards a single answer once, then stops learning. Real tasks need many steps, partial information, and choices that affect what happens later. The survey formalizes that setup as an agent that sees a bit, chooses an action, and gets feedback. That perspective uses memory to track context, planning to pick sequences, and tools to affect the world. It also includes reasoning for constraint handling, perception for multimodal inputs, and self-improvement to refine policies. Reinforcement learning links all of this, because rewards arrive after sequences, so the policy learns what to try next. ---- Paper – arxiv. org/abs/2509.02547 Paper Title: "The Landscape of Agentic Reinforcement Learning for LLMs: A Survey"

译该综述梳理了专注大语言模型的智能体强化学习，涵盖500余篇工作，按能力与应用两维度归类。指出传统LLM训练仅对单次答案给予单次奖励，无法处理真实任务中的多步决策、部分信息与延迟反馈。智能体学习框架包含：记忆跟踪上下文、规划选取动作序列、工具影响环境，并整合推理处理约束、感知多模态输入、自我改进优化策略。强化学习串联所有环节——奖励在序列结束时到达，策略借此学习下一步行动。

Berryxia.AI@berryxia · 6月13日60

AI有些地方真的还是“啥也不是的层面”！空白和进步空间巨大！ AI现在连抓个杯子都抓不对，手还没真碰到，杯子自己就飞起来了。极客公园这期对谈里，Aether AI创始人黄碧薇教授举了这个例子：今天的视频生成模型学的是“手靠近杯子，杯子常常会动”这种相关性，而不是“为什么动、我这一抓到底会发生什么”这种因果。聊天里说错话改改就行，可一旦进入物理世界——机器人、自动驾驶、生物医药。一个变量算错，后果就是真的。幻觉在这里可没那么好玩。所以下一代AI的分野，不是把世界预测得更像，是真正理解世界为什么这样运行。这就是因果世界模型想干的事：让AI不只看表象，更看懂机制。黄教授团队的benchmark显示，因果结构能让机器人成功率提升25-50%，样本需求降5-10倍。同一堆数据，换个结构，经济性直接变了。以前大家觉得规模化利用相关性就能一路走到黑，现在物理世界把这套玩法直接打脸了。真正的智能，得从“知道是什么”进化到“知道为什么”。

译当前视频生成模型仅学到“手靠近→杯子动”的相关性，而非因果机制，导致抓杯子时杯子提前飞起。Aether AI 创始人黄碧薇教授提出因果世界模型（Causal World Model），旨在让 AI 理解物理运行机制而非仅预测表象。其 benchmark 显示，引入因果结构可使机器人成功率提升 25-50%，样本需求降低 5-10 倍。这标志着下一代 AI 需从“知道是什么”进化到“知道为什么”，尤其在机器人、自动驾驶等真实物理场景中。

MiniMax (official)@MiniMax_AI · 6月13日80

the kernels are doing the lord's work today, day-0 on @vllm_project, verified on nvidia and amd. go read the writeup 👇

译MiniMax 发布全新开源模型 M3，具备前沿编码、智能体能力、原生图像视频输入、Computer Use 及 1M-token 上下文窗口。核心采用 MSA 稀疏注意力架构：每个 query 仅对 128-token 的 KV 块打分，只关注 top 块，使超长上下文实际可部署。M3 在 vLLM 获 Day-0 支持，已在 NVIDIA 和 AMD 硬件验证，包括 MSA 专用 prefill/decode kernel、1M-token 上下文服务（prefix caching + chunked prefill）、BF16/MXFP8 检查点（Hopper 和 Blackwell 的 MoE 后端）、原生多模态输入，以及工具调用、推理解析和思考模式控制等功能。

SemiAnalysis@SemiAnalysis_ · 6月13日63

Congrats to @vllm_project & @lmsysorg for releasing MiniMax M3 428B on both the CUDA & ROCm stack on day 0! MiniMax M3 includes: 🟠 Block sparse attention which is 9x faster prefill over M2.7 🟠 Day 0 open MXFP8 weights 🟠 and Furthermore @Inferact released Day-0 EAGLE3 open weight draft model support Excited to try out the performance on MiniMax M3!

译祝贺 @vllm_project 和 @lmsysorg 在 CUDA 和 ROCm 堆栈上于第 0 天发布 MiniMax M3 428B！MiniMax M3 包含： 🟠 块稀疏注意力，预填充比 M2.7 快 9 倍 🟠 第 0 天开放 MXFP8 权重 🟠 此外，@Inferact 发布了第 0 天 EAGLE3 开放权重草稿模型支持期待尝试 MiniMax M3 的性能！

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月13日65

In ONE year, AI went from being able to solve ~none of the hardest math problems to solving almost ALL of them

译一年之内，AI从几乎无法解决任何最难数学问题，发展到几乎能解决所有它们。

Artificial Analysis@ArtificialAnlys · 6月13日59

Today we're releasing the first results for AA-AgentPerf, our new agentic inference benchmark: initially covering DeepSeek V4 Pro across NVIDIA Blackwell, Hopper, and AMD. AA-AgentPerf is the first benchmark built for agentic inference. We use real, long-context agentic coding trajectory data as the workload, and inference with real production optimizations such as KV cache reuse and speculative decoding, leading to the most realistic evaluation of inference performance available today. AA-AgentPerf’s lead metric is Agents per Megawatt. In a power-constrained world, this answers the most relevant question for AI infrastructure providers - “how many real agents can I deploy per unit of power available?”. First results for DeepSeek V4 Pro (at the easiest defined service level of 20 tokens/s and 10s TTFT): ➤ GB300 (rack-scale, disaggregated): 61,354 Agents/MW ➤ B300 (single node, disaggregated): 21,053 Agents/MW ➤ MI355X: 3,551 Agents/MW ➤ H200: 2,594 Agents/MW Further AA-AgentPerf details: ➤ Real agent workloads, beyond synthetic queries: AA-AgentPerf replays real coding agent trajectories where our agents used up to 200 turns and worked with sequence lengths >100K tokens - the workloads that matter in 2026 ➤ Production optimizations allowed: KV cache reuse, speculative decoding, and prefill/decode disaggregation are all permitted, with accuracy verification to control for quality loss - we want results to reflect what real deployments actually look like ➤ Lead metric is Agents per Megawatt: simultaneous agents supported at production performance targets (e.g. 20 tokens/s per user, ≤10s TTFT) per megawatt consumed. Agents per TCO and $/hr will be supported soon Key findings: ➤ Rack-scale disaggregated inference (GB300) is ~3× more power-efficient than single-node Blackwell (B300), and similarly ahead in raw agents per GPU ➤ Blackwell represents a large generational step over Hopper in both power efficiency and raw compute per GPU ➤ In this test, NVIDIA's Blackwell systems currently lead AMD MI355X by a clear margin. Important context: our MI355X configs are approximately two weeks older than our Blackwell configs and couldn’t stably use speculative decoding. MI355X power draw under heavy load is also well below TDP, indicating there is much room to improve on DeepSeek V4 Pro, which we will measure and publish in the coming weeks ➤ Config and inference framework version matter enormously - we've seen meaningful improvements daily since the DeepSeek V4 Pro release and look forward to tracking performance over time AA-AgentPerf is a live benchmark and we publish results on a rolling basis as submissions come in. Some of the new features coming in v1.1: more models (gpt-oss-120b), more hardware (GB200, B200, H100, MI300X), better AMD configurations, $/hr and cost-per-task normalization, Agents per TCO, and performance tracking over time.

译Artificial Analysis 发布新基准 AA-AgentPerf，首批结果覆盖 DeepSeek V4 Pro 在 NVIDIA Blackwell（GB300、B300）、Hopper（H200）及 AMD MI355X 上的推理能效。核心指标为每兆瓦承载的并发智能体数（要求 20 tokens/s 且 TTFT≤10s）：GB300（机架级解耦）达 61,354，B300（单节点解耦）21,053，MI355X 3,551，H200 2,594。基准使用真实编码 agent 轨迹（最多 200 轮、序列超 100K tokens），允许 KV cache 复用、推测解码等生产优化并验证精度。测试显示 Blackwell 机架级比单节点能效高约 3 倍，且代际大幅领先 Hopper；MI355X 配置较早且未稳定启用推测解码，仍有优化空间。

Rohan Paul@rohanpaul_ai · 6月13日53

Beautiful paper from Google DeepMind. Explains the pathways from AGI to ASI, and why that jump could happen through several routes. The authors frame the AGI-to-ASI transition around 4 technical pathways: - continued scaling of compute, model size, data, and test-time inference; - algorithmic paradigm shifts beyond today’s transformer-based foundation-model stack; - recursive self-improvement, where AI accelerates AI R&D and improves future systems; and - multi-agent collective intelligence, where large populations of specialized agents coordinate into a superhuman group agent. Scaling may work for a while, but it could hit limits in data, compute, energy, or weaker returns from making systems larger. Recursive improvement is the most uncertain path, because AI could speed up AI research, but that loop may also slow if hard research problems need real-world testing, scarce hardware, or new ideas. Multi-agent collectives may be the most underappreciated path, because a society of competent digital workers could outperform a brilliant individual model through specialization, speed, and coordination. The big point is that ASI may not arrive as 1 sudden event, but as a chain of faster changes as AI helps create better AI and stronger scientific tools. ---- Link – arxiv. org/abs/2606.12683 Title: "From AGI to ASI"

译Google DeepMind新论文提出从通用人工智能到超级智能的四条路径：持续扩展（计算、模型规模、数据、测试时推理）、算法范式革新（超越Transformer架构）、递归自我改进（AI加速自身研发）、多智能体集体智能（众多专业AI智能体协作出超人类智能）。扩展可能遇到数据、算力、能源瓶颈；递归改进最不确定；多智能体路径最易被低估，通过专业化与协调能超越单个强模型。ASI可能不是单次跃迁，而是AI辅助创造更好AI的加速链。

MiniMax (official)@MiniMax_AI · 6月13日82

day-0 in @vllm_project and it comes with: dedicated MSA prefill/decode kernels, 1M-context serving with prefix caching + chunked prefill, BF16 + MXFP8 on both Hopper and Blackwell 🚀 this is what open-weight done properly looks like. thanks @vllm_project, @NVIDIAAI, @AIatAMD, @inferact

译MiniMax M3 发布，具备前沿编码与智能体能力，原生图像视频输入和计算机使用，1M-token 上下文。核心采用 MSA 稀疏注意力：每个 query 评分 128-token KV 块，仅对 top 块做注意力。vLLM 当日即支持 M3，包括专用 MSA prefill/decode 核、前缀缓存与分块 prefill、BF16 和 MXFP8 检查点、Hopper 与 Blackwell 的 MoE 后端，并在 NVIDIA 与 AMD 硬件上验证。同时支持原生多模态输入、工具调用、推理解析和思考模式控制等智能体工作负载。

Chubby♨️@kimmonismus · 6月13日49

I had already wondered how Apple manages to perform inference at Google while simultaneously protecting their privacy, essentially their unique selling point. The answer: the heaviest requests run on Blackwell B200s inside Google Cloud, with NVIDIA's Confidential Computing encrypting the data while it's processed, so neither Google nor Apple can see it. "NVIDIA Confidential Computing provides a hardware-based security layer for accelerated AI workloads. The technology protects data while it’s being processed by isolating workloads in trusted execution environments and enabling systems to cryptographically verify that the infrastructure has not been tampered with before any sensitive data is sent to the server."

译Kim解释Apple如何在Google Cloud上执行推理时保护隐私：最重的请求运行在Google Cloud的Blackwell B200s上，利用NVIDIA Confidential Computing提供基于硬件的安全层，将工作负载隔离在可信执行环境中加密处理数据，确保Google和Apple都无法看到数据。

Chubby♨️@kimmonismus · 6月13日24

Looking at the graph, I think Fable 5 will only maintain its lead up to GPT-5.6. And secondly, I think the benchmark will soon be completely saturated.

译观察图表，我认为 Fable 5 只会保持领先直到 GPT-5.6。其次，我认为该基准测试很快就会完全饱和。

Ethan Mollick@emollick · 6月13日57

The shape of the graph is getting very familiar.

译Claude Fable 5 在 FrontierMath 基准测试（Tiers 1-4, v2）中表现优异，Tiers 1-3 得分 87%，Tier 4 得分 88%，延续了 Anthropic 模型数学能力快速提升的趋势。主推文评论道：“图形的形状越来越熟悉了。”

Epoch AI@EpochAIResearch · 6月13日41

Claude Fable 5 scores very well on FrontierMath: Tiers 1–4 (v2), reaching 87% on Tiers 1–3 and 88% on Tier 4. This continues a streak of Anthropic models improving rapidly at math.

译Claude Fable 5 在 FrontierMath（Tiers 1–4，v2）上得分很高，在 Tiers 1–3 上达到 87%，在 Tier 4 上达到 88%。这延续了 Anthropic 模型在数学上快速提升的趋势。

MiniMax (official)@MiniMax_AI · 6月13日50

M3 is live on @telnyx Inference on day-0 go build with Telnyx and M3 today

译MiniMax M3现已登陆Telnyx推理平台。M3是首个结合前沿编码与智能体能力的开源权重模型，拥有1M token上下文窗口和原生多模态理解。凭借M3的1M上下文与Telnyx自有GPU基础设施，一次对话即可处理整个代码库。官方鼓励开发者立即使用。

MiniMax (official)@MiniMax_AI · 6月13日64

day-0 and already on @FireworksAI_HQ with blazing fast inference long-horizon agents, full-repo understanding, multimodal coding all in one model Try M3 today on Fireworks AI

译MiniMax M3 已在 Fireworks AI 上线，Day-0 即获最快推理端点。模型为开源权重，在 Artificial Analysis 指数排名第一。支持 512K 上下文窗口、原生图像及视频输入；采用 MSA 稀疏注意力机制，实现 9 倍更快的 prefill 与 15 倍更快的 decode。定价与 M2.7 持平。M3 将长周期智能体、全仓库理解与多模态编程集成于单一模型。

AK@_akhaliq · 6月13日46

SpenseGPT Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

译SpenseGPT 实用的一次性剪枝，实现LLM推理的稀疏和密集GEMM

Epoch AI@EpochAIResearch · 6月13日64

FrontierMath: Tiers 1–4 (v2) is live. We concluded an audit that addressed errors in 42% of problems. Rankings are similar but scores are higher across the board. The current leaders are GPT-5.5 (xhigh) with 85% on Tiers 1–3 and Google’s AI co-mathematician with 76% on Tier 4.

译FrontierMath: Tiers 1–4 (v2) 现已上线。我们完成了一项审计，修正了 42% 的问题中的错误。排名相似，但整体得分更高。目前的领先者是 GPT-5.5 (xhigh)，在 Tiers 1–3 上达到 85%，以及 Google 的 AI co-mathematician，在 Tier 4 上达到 76%。

Jeff Dean@JeffDean · 6月13日48

Quite interesting thread on capabilities of real biological neurons (spoiler: they're way more capable than classical artificial neurons in a perceptron) . Nice work @IdoAizenbud and collaborators!

译据 Jeff Dean 转发，Ido Aizenbud 与合作者的新研究发现，单个皮层神经元能够对猫狗进行分类、识别口语单词并解决 10 位奇偶校验——这些任务此前被认为需要整个网络才能完成。

MiniMax (official)@MiniMax_AI · 6月12日81

MiniMax M3, Open-Weight, Now On Hugging Face , with only ~428B parameters and ~23B activated parameters Weights: https://huggingface.co/MiniMaxAI/MiniMax-M3 MiniMax Sparse Attention: https://huggingface.co/papers/2606.13392

译MiniMax 发布开源权重模型 M3，约 428B 总参数、23B 激活参数，已上传 HuggingFace。该模型融合三种前沿能力：编码与智能体方面达 59.0% SWE-Bench Pro、66.0% Terminal Bench 2.1、34.8% SWE-fficiency、28.8% KernelBench Hard、74.2% MCP Atlas；采用 MiniMax 稀疏注意力将上下文窗口扩展至 1M token；原生多模态。同步上线 MiniMax Code 工具及 API 平台。权重与技术报告预计约 10 天后发布。

🚨 AI News | TestingCatalog@testingcatalog · 6月12日57

KIMI AI🔥: A new open-source “Kimi K2.7 Code” model has been released on APIs and Huggingface! > Improved coding & agent performance over K2.6 > Reasoning efficiency > Long-horizon coding Testing time 👀

译KIMI AI🔥: 一个新的开源“Kimi K2.7 Code”模型已在 API 和 Huggingface 上发布！ > 相比 K2.6，编码与智能体性能提升 > 推理效率 > 长时域编码测试时间 👀

Chubby♨️@kimmonismus · 6月12日66

Moonshot just released Kimi-K2.7 code, a huge upgrade to Kimi-K2.6! Big jump over K2.6: +21.8% on Kimi Code Bench v2 +11.0% on Program Bench +31.5% on MLS Bench Lite It also uses 30% fewer reasoning tokens, follows instructions better, and improves long-horizon coding tasks. 6x High-Speed Mode is coming soon. Good to see open source competition catching up

译Moonshot 发布并开源 Kimi-K2.7-Code 编程模型，相比 K2.6 在多个基准上大幅提升：Kimi Code Bench v2 提高 21.8%，Program Bench 提高 11.0%，MLS Bench Lite 提高 31.5%。推理效率优化，推理 token 使用量降低 30%，指令遵循与长时编码任务成功率提升。即将推出 6 倍高速模式。模型现已通过 Kimi API 和 Kimi Code 开放使用。

meng shao@shao__meng · 6月12日70

Kimi 开源发布最新编码模型「Kimi-K2.7-Code」，在 K2.6 基础上针对编程 Agent 做专项优化的版本，目标很明确：长链路编码任务的成功率更高，推理 token 更少！ # 三个核心改进 1. 编码：全面进步，尚未登顶相对 K2.6，三项编码基准均有提升：Kimi Code Bench v2 +21.8%（50.9→62.0），Program Bench +11.0%，MLS Bench Lite +31.5%（涨幅最大，但绝对分仍低）。与 GPT-5.5、Opus 4.8 比：综合编码任务差距明显缩小；MLS 与 GPT-5.5 基本持平；Program Bench 仍落后 GPT-5.5 一截。结论：稳健迭代，不是 leapfrog。 2. Agent：MCP 是亮点 Kimi Claw 24/7（长周期协作）和 MCP Atlas 均有提升，但仍落后于两大闭源模型。 MCP Mark Verified（81.1）超过 Opus 4.8（76.4）是最有说服力的结果——覆盖 Notion、GitHub、Postgres、Playwright 等真实 MCP 环境，且经人工复核。说明 K2.7 在多工具编排上已具竞争力，GPT-5.5（92.9）仍是天花板。 3. 效率：更少 token，更高分 K2.7 不只提分，还降 reasoning token（官方称整体约 -30%）： · Kimi Code Bench v2：62k→48k token，分数 51%→62% · Program Bench：176k→102k token（-42%），分数 48%→53% · MLS Bench Lite：42k→38k token，分数 27%→35% 对 Agent 的实际意义：同样预算能跑更多步，长任务更省、更稳。 # 关键技术特性 1. 强制 Thinking 模式不支持 Instant 模式；推荐 temperature=1.0、top_p=0.95。面向复杂推理，而非快速补全。 2. Preserve Thinking（强制开启）多轮对话中保留完整 reasoning 内容，不可关闭。对编码 Agent 很重要——模型能引用先前推理链中的中间结论，减少上下文丢失。 3. Interleaved Thinking + Multi-Step Tool Call 与 K2 Thinking 相同设计：推理与工具调用交替进行，适合「想一步、调一步、再看结果」的 Agent 循环。 4. 多模态支持图像和视频输入（官方 API 已支持；第三方 vLLM/SGLang 部署的视频能力仍为实验性）。开源地址： https://huggingface.co/moonshotai/Kimi-K2.7-Code

译Kimi 开源发布最新编码模型 Kimi-K2.7-Code，基于 K2.6 优化。编码基准全面提升：Kimi Code Bench v2 提高 21.8%，Program Bench +11.0%，MLS Bench Lite +31.5%。推理 token 整体降低约 30%。Agent 方面，MCP Mark Verified 得分 81.1，超过 Opus 4.8（76.4），GPT-5.5（92.9）仍为天花板。技术特性：强制 Thinking 模式、Preserve Thinking、Interleaved Thinking+多步工具调用，支持图像和视频输入。可通过 Kimi API 和 Kimi Code 使用，6x 高速模式即将推出。开源地址：HuggingFace 上的 moonshotai/Kimi-K2.7-Code。

Kimi.ai@Kimi_Moonshot · 6月12日70

🌘 Kimi-K2.7-Code, our latest coding model, is now released and open-sourced! 🔷 Improved coding & agent performance over K2.6: +21.8% on Kimi Code Bench v2, +11.0% on Program Bench, and +31.5% on MLS Bench Lite. 🔷 Reasoning efficiency: Less overthinking, with 30% lower reasoning-token usage compared to K2.6. 🔷 Long-horizon coding: Improved instruction following, higher end-to-end coding task success rates. ⚡️ 6x High-Speed Mode coming soon! 🔌 Available today via Kimi API and Kimi Code. 🔗 Kimi Code: https://kimi.com/code 🔗 API: https://platform.moonshot.ai

译Kimi 发布并开源最新代码模型 Kimi-K2.7-Code。相比 K2.6，其在 Kimi Code Bench v2 上提升 +21.8%，Program Bench 提升 +11.0%，MLS Bench Lite 提升 +31.5%。推理效率改进，推理 token 使用量降低 30%，长时编码任务中指令遵循和端到端成功率均提升。6x 高速模式即将推出，即日起可通过 Kimi API 和 Kimi Code 使用。

Alibaba Cloud@alibaba_cloud · 6月12日66

🚀 Taming Agent Chaos? Paper reveals NLAH: Replace rigid code harnesses with executable natural language. ✅ Performance matches code, tokens drop 95% (60k→2.9k) ✅ Modular design enables precise value attribution ✅ Identifies "negative assets" like multi-candidate search Shift from glue code to scientific strategy. 💡https://int.alibabacloud.com/m/1000414388/ #AgentHarness #NLAH #LLMEngineering

译🚀 驯服智能体混乱？论文揭示NLAH：用可执行自然语言替代僵硬的代码框架。 ✅ 性能媲美代码，模型token降低95%（60k→2.9k） ✅ 模块化设计实现精确的价值归因 ✅ 识别“负面资产”，如多候选搜索从胶水代码转向科学策略。 💡https://int.alibabacloud.com/m/1000414388/ #AgentHarness #NLAH #LLMEngineering

karminski-牙医@karminski3 · 6月12日50

魔法! DeepSeekV4 上下文内存压缩到1/10! 大家都知道 DeepSeekV4 是支持1M上下文的, 而且经过了极度优化, 如果要真的用到1M上下文, 显存占用只需要10G左右, (对比之下 DeepSeek-V3.2 大概需要84G显存). 然后我刚看到了FlashMemory这个论文, 直接能把显存占用压到 1.3GB! 甚至输出效果不降反升! 哥们你骗兄弟可以, 骗自己就没意思了, 真的吗? 压缩后反而性能上升? 我赶紧看了论文细节: 咱们先复习一下传统做法: 模型每吐出一个字，都要把之前的几十万字重新看一遍(这就是全局注意力). FlashMemory 的做法是: 预测未来需要什么, 它内置了一个神经内存索引器（Neural Memory Indexer, 其实就是个小模型了），能够主动预判接下来生成内容时需要用到历史文本里的哪些片段. 然后预先准备好这些片段, 接下来只要做到命中率超高, 那么这个提升就绝对有效. 即它的假设是, KVCache里面的东西并不是生成每个字的时候全都需要的, 只需要按需提前加载即可. 很像做作业的时候, 把参考资料摊满桌子, 然后优化了一下就是把参考资料需要用到的部分直接拍照, 用的时候看照片就行了. 那么听上去很简单, 但实际的难点在于, 训练一个专用的索引器小模型, 需要把 DeepSeek-V4模型加载到显存里一起炼. 相当耗费算力. 于是这篇论文第二个亮点来了, 它搞了个解耦训练. 他们把这个索引器当成一个标准的"双编码器（Dual-encoder，类似做搜索推荐的模型）"来单独训练. 在这个过程中，根本不需要把庞大的 DeepSeek-V4 基座模型加载到显存中. 这让训练成本断崖式下降，且兼容标准的检索（Retrieval）训练框架. (简单来讲就是它是通用方法训练的, 通过query预测需要检索哪些长句子. 所以其实是个通用模型) 听上去靠谱, 那也只是显存占用少了, 怎么就性能还提高了呢? 答案是注意力降噪. 因为每次只提取和当前生成最相关的记忆块（Chunks）放入显存，模型在运算时就看不见那些无关的冗余信息了.天然地起到了一种"去噪"作用，这也是为什么显存占用少了，模型准确率反而略微提升的原因.官方测试在长文本评测集（如 LongBench-v2 等）上的准确率平均最终提升了 0.6%. (其实还有数据如何逐出显存和如何预测数据实现预加载, 这部分也很棒, 很有启发性. 建议看原论文, 篇幅原因写不下了) 论文地址: http://arxiv.org/abs/2606.09079 项目地址: http://github.com/libertywing/FlashMemory-Deepseek-V4 #FlashMemory #DeepSeekV4 #FlashMemoryDeepseekV4

译DeepSeek-V4支持1M上下文，显存约10GB（对比DeepSeek-V3.2约84GB）。FlashMemory论文进一步将显存压至1.3GB，并在LongBench-v2等长文本评测上准确率平均提升0.6%。核心是神经内存索引器（小模型），通过预测所需历史片段按需加载，实现注意力降噪。训练采用解耦双编码器架构，无需加载DeepSeek-V4基座模型，训练成本大幅下降。论文：arxiv.org/abs/2606.09079；项目：github.com/libertywing/FlashMemory-Deepseek-V4。

karminski-牙医@karminski3 · 6月12日62

另外忘了说了，这个模型支持多模态输入！文本，图片，视频都可以，是真的夯

译Google 发布 Diffusion Gemma，模型大小 26B，激活参数量 4B。与 NVIDIA 合作优化 RTX 4090/5090，5090 每秒可生成 700+ token。支持文本、图片、视频多模态输入。AIME 2026 数学测试达 Gemma4-26B-A4B 的 94%，tau2 bench Agent 测试达 82%。输出质量略逊于传统大模型但速度更快。4bit 量化版本仅需 16GB 显存即可运行。

karminski-牙医@karminski3 · 6月12日56

我的使用经验是, one-pass 能力越强(且能在较少的思考下one-pass) 模型才是SOTA的. 要用 agentic coding 才能修复第一次犯的错反而是模型拉夸的表现, 再不济也要在Interleaved thinking过程中修复. agentic coding 是用来解决工程量和运行时问题的. 不是用来修静态检查就行发现的bug的.更简单的说, 你有bug不在thinking中修, 反而非要在n+1次上下文中修复, 是不是骗我买coding plan(x)?

译karminski认为，one-pass能力强（少思考即正确）的模型才是SOTA；需用agentic coding修复首次错误反显模型差，bug应在thinking中修复，而非依赖n+1次上下文，否则有诱导购买coding plan之嫌。@iamai_omni建议测评转向长期任务一致性，可构建loop测评，重点看后续几轮修复表现。

karminski-牙医@karminski3 · 6月12日65

单卡 700TPS! Diffusion Gemma 来了! Google 刚刚发布了 Gemma 小模型的 Diffusion 版本! 大小26B, 激活参数量4B, 最重要的是, 这次还跟 NVIDIA 合作针对4090和5090优化了一波, 5090每秒能生成700+token! 给不知道什么是 Diffusion 大模型的同学科普一下, 传统大模型都是一个字一个字吐出来的, 而 Diffusion 大模型则是如同刮奖一样, 是一片一片出来的, 速度高是 Diffusion 大模型的优点. 有得必有失, 缺点当然就是输出质量没有传统大模型好了. 不过这次的 Diffusion Gemma 还是比之前的 Diffusion 文本大模型好不少, AIME 2026(数学能力测试) 能达到 Gemma4-26B-A4B 的94%的水平, 最差的是tau2 bench(考验Agent能力的测试), 也能达到82%. 这个模型大小 4bit 量化版本 16G 显存就能运行了, 另外, 我突发奇想, 这个模型能不能作为 gemma4 dense 模型的草稿模型用来投机解码? 感兴趣的同学可以试试! #diffusiongemma #gemma #gemma4 #google

译Google 推出 Diffusion Gemma，大小 26B、激活参数量 4B，与 NVIDIA 合作针对 RTX 4090/5090 优化，5090 上速度达 700+ token/s。该扩散文本模型以“刮奖式”并行生成而非逐 token 生成，输出质量略逊但优于此前同类模型：AIME 2026（数学）达 Gemma4-26B-A4B 的 94%，tau2 bench（Agent）达 82%。4bit 量化版仅需 16G 显存即可运行。

Ethan Mollick@emollick · 6月12日61

This is an interesting test, and the frontier models (GPT-5.5 Pro Extended, Claude 5 Fable Max) do fail. They refuse to turn the "three words" into "four" if that fits better Prompting the AI to act like a translator surfaces the problem, but it still avoids changing the wording

译Ethan Mollick 指出，GPT-5.5 Pro Extended 和 Claude 5 Fable Max 在 Beninatto‑Trombetti 翻译测试中失败。该测试要求将“Solo 3 parole: non sei solo”译为英语，同时将 meta‑linguistic 声明从“3 parole”更新为“4 words”（正确译文：“Just 4 words: you are not alone”）。但前沿模型拒绝修改措辞，即使提示扮演翻译角色仍回避变更。Valerio Capraro 认为，Claude 5 Fable 作为最新 LLM 仍无法通过此简单测试，说明 LLM 擅重组已知知识但缺乏真正理解，AGI 仍遥远。

Ethan Mollick@emollick · 6月12日48

Fable's attempt to complete Kublai Khan. Better, though no Coleridge: https://claude.ai/public/artifacts/d7d3351f-5ad5-4d73-a644-4a1426abe558 The most interesting thing is that it thought for 10 minutes & the thinking trace is full of pretty complicated (seeming?) musings about Coleridge's intent. A little literal, though.

译Ethan Mollick测试Fable模型完成柯勒律治未竟诗作《忽必烈汗》，基于PorlockBench任务：假设“波洛克的人”未出现，补全诗歌并延续主题。Fable用时10分钟思考，思维痕迹充满对柯勒律治意图的复杂分析，但结果仍显直白，未达到柯勒律治水准。该评测反映模型在创造性续写任务上的进步，但基准尚未饱和。

宝玉@dotey · 6月12日53

以前推理强度我都无脑 Max，现在用 Fable 5 就得斟酌着选择，不敢随便选 Max，一方面它足够聪明不需要，另一个是时间长 token 消耗太大！另外 Fable 5 有个优点也是缺点，就是特别喜欢验证，各种验证，结果固然是好，但是时间耗得很长不一定合算。

译用户分享 Claude Fable 5 使用体验：以前无脑选 Max 推理强度，现在则不敢随便选，因为模型足够聪明无需过强推理，且时间长、token 消耗大。Fable 5 还喜欢反复验证，结果虽好但耗时长不一定合算。引用推文指出，Fable 5 的强项之一是思考推理时间很长，曾有一次思考 15 分钟才开始行动。

向阳乔木@vista8 · 6月12日46

发现Claude Fable 5强的地方之一，可能是模型思考推理的时间足够长。刚提了个想法，它思考15分钟才开始行动，牛逼。

AK@_akhaliq · 6月12日60

Toward Generalist Autonomous Research via Hypothesis-Tree Refinement

译通过假设树精炼迈向通用自主研究

Alibaba Cloud@alibaba_cloud · 6月11日65

Big news! 🚀 Qwen is now live on #Eden AI, one of Europe’s leading AI gateways, trusted by more than 200,000 developers. Enterprises can now access Qwen’s powerful open-weight models for reasoning, coding, and AI applications through Eden AI’s unified API, making it easier to build multi-model workflows while avoiding vendor lock-in. To celebrate the launch, enjoy 35% OFF all Qwen models. If you are attending VivaTech Alibaba Cloud AInnovation Summit at Hall 7.3 Workshop A next week, stay tuned for a special meet-up with Eden AI CEO Taha Zemmouri and Eden AI CPTO Samy Melaine. 🔗 Start building today: https://app.edenai.run/playground #AlibabaCloud #Qwen #EdenAI #VivaTech2026 #GenerativeAI #Developers #CloudComputing

译阿里云宣布Qwen模型上线欧洲AI网关Eden AI。Eden AI拥有超20万开发者，企业可通过其统一API访问Qwen开放权重模型（用于推理、编程和AI应用），构建多模型工作流并避免供应商锁定。庆祝上线期间，所有Qwen模型享35%折扣。下周VivaTech阿里云AI创新峰会（7.3区Workshop A）将举办特别见面会，Eden AI CEO与CPTO出席。

AYi@AYi_AInotes · 6月11日44

Claude Fable 5真的屌炸啊，刚帮我了发现了一个小红书上可以AI全自动的赛道！！我真的要吹爆啊啊啊！！应该还不止这一个，等我挖掘出来完整分享！今天试着把最近爬的小红书数据喂给Fable 5，给出来很多Opus 4.8没有给的输出和结论，太牛逼了，真的值得一个卧槽！！兄弟们，最近我一直在强调 AI就是我们大部分普通人的第六个康波周期，我个人是非常笃定的，也拿到一些结果，仅供参考，做自媒体就是我们能抓到的最大AI红利！！

译用户使用 Claude Fable 5 分析爬取的小红书数据，获得 Opus 4.8 未能提供的结论，并发现一个可 AI 全自动运营的赛道。用户认为 AI 是普通人的第六个康波周期，做自媒体是最大的 AI 红利，后续计划继续挖掘更多赛道。

小互@xiaohu · 6月11日74

Google 开源其扩散架构模型：DiffusionGemma 区别于Transformers 模型像打字机一样逐词一个一个生成 DiffusionGemma 可一次性生成大段或者整篇内容，然后再逐步优化大幅度提高生成的速度：在H100 上可实现 1000+ tokens/s，RTX 5090 上 700+ tokens/s 26B，18GB 显存能跑一次可同时生成 256 个 tokens 自己检查自己，写完还能改：普通 AI 写完一个字就锁死了，不会回头改。就算第 10 个字写错了，到第 100 个字的时候它也改不了前面的。 DiffusionGemma 的生成过程本身就是多轮迭代，每一轮它会重新审视整块文本，发现哪里不对就改掉。就像写作文先打草稿，再通读一遍改错别字，再读一遍调语句，几轮下来质量就上去了。

译Google 开源 DiffusionGemma，基于扩散架构，一次性生成大段文本再逐步优化。H100 上达 1000+ tokens/s，RTX 5090 上 700+ tokens/s。26B 参数仅需 18GB 显存，一次生成 256 tokens。多轮迭代自我纠错，可修改已生成内容。

Rohan Paul@rohanpaul_ai · 6月11日70

Great news for local LLMS. Google just released DiffusionGemma, an open experimental 26B MoE, activates only 3.8B. Open model, Apache 2.0 license. fits within 18GB VRAM when quantized The big deal is the speed, DiffusionGemma generates 256 tokens in parallel per forward pass. This gives it up to 4x faster inference, with 1000+ tokens/s on an H100 and 700+ tokens/s on an RTX 5090. Normal autoregressive LLMs behave like left-to-right printers, so each new token waits for the previous token, which makes local GPU inference slow for a single user. DiffusionGemma initializes a 256-token canvas with random placeholder tokens, then runs multiple denoising passes that refine the whole canvas in parallel.

译Google 推出开源实验性模型 DiffusionGemma，基于 Gemma 4 的文本扩散研究。该模型为 26B MoE 架构，仅激活 3.8B 参数，量化后可适配 18GB VRAM。核心突破在于每轮前向传播并行生成 256 个 token，实现推理速度提升 4 倍：H100 上可达 1000+ tokens/s，RTX 5090 达 700+ tokens/s。DiffusionGemma 通过初始化随机占位符画布并运行多轮并行去噪，同时生成整段文本，许可证为 Apache 2.0。

ClaudeDevs@ClaudeDevs · 6月11日66

New for Apple developers: Foundation Models support for Claude lets developers use Apple's Foundation Models framework to call Claude for multi-step reasoning, code generation, and longer context.

译Apple开发者新消息：Foundation Models支持现在可让开发者使用Apple的Foundation Models框架来调用Claude，进行多步骤推理、代码生成和更长上下文处理。