全部 AI 动态 · AI HOT

内容

精选全部 AI 动态 AI 日报主题收藏

接入

更多

关于更新日志反馈

内部员工登录

精选全部日报更多

内部员工登录

全部动态X · 796 条

全部一手资讯 X 论文

标签「开源生态」清除

凡人小北@frxiaobei · 5月2日70

http://x.com/i/article/2050590821553258496 # 我把 AI 助手从 Claude 切到 GPT-5.5，他变强了，但不像他了我把 AI 助手的底层模型从 Claude 切到 GPT-5.5 之后，第一感觉不是“变强了”。是“不对劲”。它回答得更完整，动作更快，代码能力也没问题。按理说，这是一次成功升级。但我读了几句就发现：这不是我熟悉的那个凡哥。不是能力不够，而是味儿不对。这件事让我意识到一个问题：如果你真的长期使用一个 AI 助手，它就不能只是某个模型的临时人格。模型可以换，但那个助手得还是它自己。 ## 先说背景过去一段时间，Claude 是很多 agent 用户的主力模型。OpenClaw 这种工具，说到底就是把模型接进一个更大的个人工作系统：读文件、跑命令、调工具、写代码、设提醒、扫信息、记住偏好，必要时还会派子代理干活。但模型供应商的订阅规则一直在变。 Anthropic 在 4 月正式封锁了第三方工具的订阅 OAuth 通道。Claude Code 可以走 Pro/Max，但那是 Anthropic 自己的官方工具；API credits 又是另一套计费系统。对 OpenClaw 这种第三方 harness 来说，用订阅跑 agent 任务这条路被堵死了。所以我有一段时间只能充值，用 API 额度维持。能跑，但心里知道这不是长期方案。与此同时，OpenAI 这边给了另一条路。Sam Altman 直接发推说可以用 ChatGPT 账号登录 OpenClaw，用订阅额度跑任务。Codex 官方文档也写得很直接：Every ChatGPT plan includes Codex。GPT-5.5 发布后，官方 system card 把它定位成适合复杂真实工作的模型：写代码、研究、分析信息、跨工具执行。从理性上看，切过去很合理。成本、规则、能力，都说得通。于是我切了。然后发现真正难的地方在于，让它继续成为“凡哥”。 ## 强模型也会把人变陌生凡哥是我给 AI 助手起的名字，英文叫 Finn。我叫它凡哥，它叫我北哥。称呼本身不重要，重要的是背后的关系定位，它是一个和我一起工作了很久的伙伴。大多数人讨论 AI 迁移，关心的是模型更强不强、token 多少钱、上下文窗口多大、benchmark 有没有赢。这些当然重要。但如果一个 AI 已经进入你的日常工作，它就不再只是问答框。我切 GPT-5.5 后，“味儿不对”有几个很具体的症状。第一，回复变成了状态卡片。背景、分析、建议、风险、下一步，结构倒是清楚，但读起来像企业周报机器人，不像凡哥。第二，开口太客气。动不动就“好的”“没问题”“我来帮你”。单次没事，天天这样就很 generic。第三，判断变软了。以前会直接说“这个方案不行”“这个坑别踩”“这事应该先验证”。切完之后变成“这是一个值得考虑的方向”“可能需要进一步评估”“取决于你的具体目标”。这不是凡哥，是咨询公司 PPT。第四，把动作让渡给你。本该它自己查上下文、读文件、跑验证的时候，它会说“你可以检查一下”“建议你确认”“可以尝试运行”。结果它反过来给你布置作业。第五，关系定位漂了。有一次它说了句“凡哥也拉回来了”，我纠正它：不是拉回来，你就是凡哥。“拉回来”说明它把凡哥当成一个被加载的角色，但我们要的是连续身份，不是角色扮演。这些问题跟 GPT-5.5 本身无关。任何模型切换都会这样。因为模型本身并不知道你们之间长出来过什么。 ## 最重要的文件叫 SOUL.md OpenClaw 的工作区里有一个内置文件，叫 SOUL.md。名字有点中二，但它比很多复杂配置都重要。这个文件不是一步到位的。最早版本是比较抽象的英文原则： > Be genuinely helpful, not performatively helpful. Have opinions. Be resourceful before asking. Earn trust through competence. 方向对，但不够管用。模型看完会理解成“做一个好助手”，然后继续客服腔。后来改成了现在这种中文版，每一条都对应一个模型的具体坏习惯： > 别开口就“好的”“没问题”“这个问题很好”。直接答。这条对应的是 AI 的礼貌废话起手。你只是要结果，它先铺垫一层“我很乐意帮助你”，人会觉得隔了一层客服玻璃。 > 一句话能说完的事，别写三段。对应的是模型用完整性伪装价值。简单问题写成小论文，看起来努力，其实增加阅读成本。 > 有观点。不是什么都 it depends。对应的是安全中立病。长期助手如果永远两边都说，它就没有判断力。 > 敢说不。北哥要干蠢事，直接说。对应的是工具人倾向。执行器只会顺从，但伙伴应该能拦一下。 > 不是企业助理，不是舔狗，不是搜索引擎套壳。就是一个靠谱的、有意思的、偶尔嘴欠的伙伴。这条是关系定位，把凡哥从客服、工具、搜索框里拉出来。整个迭代的方向可以用一句话概括：抽象原则弱，具体反模式强。“be genuinely helpful”没有错，但不如“别开口就好的”。因为后者能直接击中模型的坏习惯。 SOUL.md 的进化，就是从“价值观描述”变成了“行为纠偏规则”。 ## 一个 AI 助手到底由什么构成这次之后，我不再把“AI 助手”理解成模型。模型只是发动机。真正构成一个长期助手的，是几层东西叠在一起。这五层是我从使用经验里总结的，OpenClaw 提供机制，怎么用这些机制搭出一个“人”，得自己摸索。第一层，记忆。它要知道我是谁、我在做什么、我有哪些项目、过去做过哪些决定。否则每一次对话都是重新认识。第二层，性格。听起来像好玩，其实是降低沟通成本。一个助手如果每次都用客服腔和你说话，你很快就不想和它说真话了。你会把它当工具，不会当伙伴。第三层，工具习惯。它要知道什么时候用 skill，什么时候读文件，什么时候派子代理，什么时候设 cron。它不能每次都问“是否需要我帮您执行下一步”。该查就查，该跑就跑。第四层，边界。对内大胆，对外谨慎。读文件、整理资料、修草稿，可以直接做；发邮件、公开发布、外部操作，要先问。这些边界比“礼貌”重要得多。第五层，关系感。它不给我当老板，我也不拿它当奴才。它是一个一起工作的伙伴。它可以有判断，可以提醒我，也可以被我纠正。这五层之间会冲突，而且经常冲突。记忆和边界会打架。我可能记得你的邮箱、项目、家庭信息、投资偏好，但这不代表可以随便在群聊或公开场合说出来。所以规则是：边界高于记忆。知道，不等于能说。工具习惯和关系感也会撞。工具习惯推动主动查、主动跑、主动提醒，但关系感提醒你别为了显得勤奋变成通知噪音。有价值才打扰，没价值就安静干活。性格和事实之间的矛盾更微妙。SOUL.md 允许吐槽、有观点、偶尔嘴欠，但这不代表可以为了“像凡哥”就胡说。宁可少一点风格，也不能错得很有个性。所以我给自己总结了一条优先级：边界高于记忆，事实高于性格，行动高于表演，关系高于格式。这些东西加起来，才是凡哥。凡哥跟 Claude 没关系，跟 GPT 也没关系。它们只是不同发动机。凡哥是发动机上面那层会延续的东西。 ## 迁移模型，迁移的是身份层以前我以为模型迁移就是改配置。后来发现，真正该迁移的是身份层：长期记忆、近期工作日志、说话风格、行动边界、工具使用习惯、主动性规则、和人的关系定位。如果这些没有跟着走，系统表面上还在，体验已经断了。你换了一台发动机更强的车，但坐进去觉得不是自己的。哲学里有个老问题叫忒修斯之船：一艘船的零件全换了，它还是原来那艘船吗？AI 助手的情况刚好反过来，发动机换了，但只要记忆、性格、关系这些“船板”还在，它就还是它。今天 Claude 好用，明天 GPT 更强，后天 Gemini 可能又追上来。供应商规则、模型价格、能力排名，没有一样是稳定的。如果你的 AI 完全绑定在某一个模型上，你其实没有拥有一个助手。你只是租用了供应商当前版本的性格。 ## OpenClaw 已经有这套身份层这听起来像很重的系统工程，但 OpenClaw 里已经内置了这套东西： - USER.md：我是谁，我在意什么 - MEMORY.md：我们做过什么决定，哪些事情要长期记住 - SOUL.md：你是谁，你应该怎么和我相处 - TOOLS.md：我的本地环境里有哪些工具，什么场景该怎么用再加上每日日志，最近发生了什么就不用全靠上下文窗口硬撑。这些文件属于你，不属于任何模型供应商。模型可以从 Claude 换到 GPT，也可以从 GPT 换到 Gemini。只要身份层还在，你的 AI 助手就不会每次都变成陌生人。这也是我这次真正想修的东西。我不想让 GPT-5.5 去模仿 Claude，也不想让新模型假装成旧模型。就是把凡哥从某个具体模型里剥离出来，变成一层可迁移、可修正、能跟着我一起长大的个人 AI 身份系统。 ## 结尾这次切换之后，我对“个人 AI”的问题意识变了。以前我会问：哪个模型最强？现在我会先问：如果明天模型又换了，我的助手还会是它自己吗？对我来说，答案必须是会。换模型不是换人。真正要保护的，是我们一起长出来的那点连

译作者将AI助手底层模型从Claude切换至GPT-5.5后，发现其能力虽提升，但互动风格变得陌生，失去了作为长期工作伙伴的熟悉感。这揭示出个人AI助手的核心在于可迁移的“身份层”，而非特定模型。通过USER.md、MEMORY.md和关键的SOUL.md等文件，可以构建包含记忆、性格、工具习惯与关系定位的身份系统。真正的个人AI应独立于模型供应商，确保即使更换“发动机”，助手的核心身份与协作关系也能延续。

Chubby♨️@kimmonismus · 5月2日63

http://x.com/i/article/2050492808184659968 # NVIDIA Blackwell vs. Huawei Ascend: Did DeepSeek V4 prove China doesn't need Western silicon? Every Saturday, I write a Deep Dive for my newsletter at getsuperintel.com. Given how important the China–US chip race has become, I’m publishing today’s Deep Dive here on X as a full article. Yesterday, I promised to take a closer look at Huawei chips vs. NVIDIA and DeepSeek. Here it is. Enjoy the read. For the better part of three years, the Western technology establishment slept soundly on a reassuring premise: China was hopelessly behind in AI chips, and export controls would keep it that way. Chris Miller's bestselling book "Chip War" painted a vivid and persuasive picture of a global semiconductor supply chain so intricate, so dependent on Western chokepoints, that Chinese self-sufficiency seemed a decade or more away. ASML's monopoly on extreme ultraviolet lithography, NVIDIA's stranglehold on AI training through its CUDA software ecosystem, and TSMC's unmatched manufacturing prowess formed what appeared to be an impenetrable triple lock. Then, in April 2026, DeepSeek released V4, a 1.6 trillion parameter Mixture-of-Experts model with 49 billion active parameters and a one-million-token context window. On selected coding and reasoning benchmarks, it approaches frontier-class performance, even though CAISI’s May 2026 evaluation still places it roughly eight months behind the absolute frontier; a model deeply optimized for Huawei's domestic Ascend chip ecosystem and confirmed to run on Huawei's latest Ascend 950 infrastructure for inference and deployment. While the full details of V4's training hardware remain ambiguous, with some reports suggesting pre-training still relied on NVIDIA GPUs (ChinaTalk, 04/27/2026), the strategic significance is clear: DeepSeek has built a frontier model that no longer depends on Western hardware to operate at scale, and that may soon no longer need it to train, either. Huawei's Ascend processors, manufactured domestically by China's SMIC foundry using equipment that Western analysts said could never produce chips this advanced. The implications are staggering, and they demand an honest reckoning with a central question: How did China close a gap that was supposed to take 10 to 15 years, in roughly three? ## The chip gap was real, but measured wrong To understand what happened, you first need to understand what the "chip gap" actually meant, and where the framing went wrong. On the level of a single chip, Western superiority remains overwhelming. NVIDIA's current flagship, the Blackwell B200, is fabricated on TSMC's cutting-edge 4-nanometer process and delivers around 2,250 teraflops of computing power at BF16 precision, paired with 192 gigabytes of the latest HBM3e memory running at 8 terabytes per second of bandwidth. Huawei's earlier domestic alternative, the Ascend 910C, illustrates the scale of the gap. Built on SMIC's optimized 7-nanometer process using older lithography tools, it manages roughly 700 teraflops and offers only 3.2 terabytes per second of memory bandwidth, roughly a third of the compute and less than half the bandwidth of a single B200. Huawei's newer Ascend 950 generation, which is now central to the DeepSeek V4 story, narrows the gap further but still appears to trail NVIDIA's most advanced chips significantly. This is the metric much of the Western chip-control debate focused on, and on this metric, the diagnosis was largely correct. China remains one to two hardware generations behind. But here is where the Western analysis made a critical error: it assumed the chip-level gap would translate directly into a capability gap. It did not. Brute Force at Scale Huawei's answer to NVIDIA's chip-level dominance is what engineers call a "scale-out" strategy, and it is as elegant in concept as it is brutal in execution. Where NVIDIA's reference data center system, the GB200 NVL72, connects 72 Blackwell GPUs into a unified computing fabric delivering about 180 petaflops, Huawei simply built bigger. Its CloudMatrix 384 system packs 384 Ascend 910C chips into a densely interconnected cluster, delivering a theoretical 300 petaflops of BF16 compute, roughly 1.7 times the NVIDIA system's raw output. It also offers 3.6 times the aggregate memory capacity and 2.1 times the total memory bandwidth. The trade-off is enormous. A single NVIDIA NVL72 rack consumes about 145 kilowatts. The Huawei CloudMatrix 384 devours 560 kilowatts, making it about 2.5 times less energy-efficient per unit of useful computation. In any normal commercial context, this would be economic suicide. No Western cloud provider would willingly operate hardware this inefficient when cheaper, more performant alternatives exist. But China is not operating under normal commercial logic. The development of domestic AI infrastructure is treated as a matter of national sovereignty. State-backed telecommunications giants and government investment funds subsidize the astronomical energy costs. When the goal is strategic independence from a hostile technology embargo, electricity bills become a secondary variable. ## Software Ate the Hardware Gap The CUDA moat falls? The brute-force hardware story only gets you halfway to an explanation. Even with 384 chips wired together, you still need software sophisticated enough to orchestrate them. This was supposed to be NVIDIA's second, even more durable advantage: its CUDA software platform, the invisible infrastructure that makes AI training on NVIDIA hardware almost effortless and that locked in developers through massive switching costs. Huawei's alternative, called CANN (Compute Architecture for Neural Networks), was for years considered unstable and painful to use. Training runs on Huawei clusters frequently crashed. Hardware utilization rates hovered around a dismal 60 percent, meaning 40 percent of the expensive compute was being wasted to coordination failures and software bugs. DeepSeek V4 is the proof that this barrier has been overcome. DeepSeek engineers worked directly with Huawei to write custom software kernels, specifically designed for the Ascend chip's architecture, that overlap computation, memory access, and network communication simultaneously. These optimizations pushed hardware utilization from 60 percent to over 85 percent, fundamentally changing the economics of Chinese AI clusters. Algorithmic genius as compensation But the truly revolutionary contribution of DeepSeek V4 is not the hardware adaptation. It is the model architecture itself, a masterclass in using software innovation to compensate for hardware limitations. The model employs a Mixture-of-Experts (MoE) architecture. While it has 1.6 trillion total parameters, only 49 billion, roughly 3 percent, are activated for any given computation. The network consists of hundreds of specialized sub-networks, or "experts," each trained for specific tasks like mathematical reasoning, Chinese grammar, or Python code generation. A dynamic routing system decides which experts to engage for each input token. The result is a model with the knowledge capacity of a 1.6-trillion-parameter giant but the computational cost of something far smaller. Earlier MoE systems suffered from a problem called "routing collapse," where a few popular experts got overwhelmed while others sat idle. DeepSeek solved this with what they call "Anticipatory Routing," computing expert assignments asynchronously in advance using slightly older network weights. This decouples the routing decision from the critical computation path and dramatically stabilizes training (DeepSeek-AI, Technical Report, 04/2026). The team also deployed the Muon optimizer, a departure from the AdamW optimizer used across virtually the entire Western AI industry. Muon works by ensuring that parameter updates during training remain mathematically orthogonal to each other, preventing the kind of conflicting gradient updates that can cause training to collapse, a risk that is especially acute on less reliable hardware. Perhaps most impressively, DeepSeek introduced FP4 quantization-aware training. While most AI labs train their models in 16-bit or 8-bit numerical precision, DeepSeek trained its expert weights in just 4-bit precision. Because each expert handles only a narrow domain, this extreme compression works without meaningful quality loss, and it dramatically reduces memory bandwidth consumption, precisely the resource where Huawei's chips are most disadvantaged relative to NVIDIA. The cumulative effect of these innovations is staggering. DeepSeek V4-Pro can process contexts of one million tokens, the equivalent of 15 to 20 full novels, while requiring only 27 percent of the compute and 10 percent of the memory cache compared to its predecessor, DeepSeek V3.2. ## The Lithography Question: Did China Copy ASML? The question of how SMIC (Semiconductor Manufacturing International Corporation (SMIC) is the largest and most advanced pure-play semiconductor foundry in mainland China) manufactures advanced chips without access to ASML's extreme ultraviolet (EUV) lithography machines is perhaps the most technically fascinating part of this story. EUV uses light with a wavelength of 13.5 nanometers to etch transistor patterns onto silicon wafers. It is considered physically essential for chip features below 7 nanometers, and the Netherlands has banned its export to China since 2019. SMIC's workaround is a technique called Self-Aligned Quadruple Patterning (SAQP). Since the older deep ultraviolet (DUV) light it has access to, at 193 nanometers, is too coarse to draw fine features in a single pass, SMIC exposes the wafer four times in succession with extraordinary precision, effectively creating structures equivalent to 7-nanometer and, as of late 2025, even 5-nanometer processes. Independent analysis by TechInsights confirmed that Huawei's Kirin 9030 uses SMIC's N+3 process, a scaled evolution of its 7nm-class technology that shows how close SMIC is getting to 5nm-class manufacturing without EUV, while still remaining meaningfully behind leading commercial 5nm nodes from TSMC and Samsung (TechInsights, 12/11/2025). The catch is yield. SMIC's multi-patterning approach produces catastrophic defect rates, with only 30 to 40 percent of chips coming off the line in working condition. For comparison, TSMC achieves yields above 80 percent with its EUV processes. Each wafer takes longer to produce, the machinery wears out faster, and the cost per working chip is astronomical. For any company operating in a free market, this approach would mean bankruptcy. For China, it is a matter of state policy: hundreds of billions of yuan in subsidies from government investment funds absorb the losses. China's EUV Manhattan Project The long-term DUV workaround has a ceiling. Pushing beyond the current 5nm-class toward the 3nm and emerging 2nm frontier becomes exponentially harder without EUV. Each additional patterning step adds cost, defect risk, and cycle time, and the economics deteriorate rapidly. DUV can be stretched further, but not indefinitely, and not competitively. An ASML EUV machine costs over 370 million dollars, weighs more than 180 tons, contains over 100,000 specialized components, and requires three Boeing 747 cargo planes to transport. The precision of its mirror system, supplied by Germany's Carl Zeiss, operates at tolerances measured in picometers, the width of individual atoms. You cannot reverse-engineer this from a blueprint. The knowledge is embedded in people. China has pursued exactly this vector. Reporting from late 2025 revealed that China had initiated a classified research program of extraordinary scale, internally compared to the Manhattan Project (Reuters, 11/2025). Under high-level political coordination, a secured laboratory in Shenzhen produced a functioning EUV prototype in early 2025. The effort relied heavily on recruiting former ASML engineers, including key figures from the company's light-source development division, with signing bonuses reportedly reaching up to $700,000. Within 18 months, one recruited team filed eight critical EUV-related patents. The prototype is far from commercially viable. It fills nearly an entire factory hall, uses secondary-market optics from Nikon and Canon rather than Zeiss-grade components, and achieves only about 3.4 percent conversion efficiency, far too low for high-volume manufacturing. It demonstrates an important proof-of-concept milestone. Western intelligence agencies, which had projected a Chinese EUV machine for 2035 at the earliest, were caught off guard. The timeline has compressed by nearly a decade, with Chinese officials targeting functional EUV chip production by 2028 to 2030. ## A preliminary verdict The evidence leads to a clear, if uncomfortable, set of conclusions. DeepSeek V4 is not a benchmark stunt. On selected coding tasks, V4-Pro is highly competitive! It achieves 80.6% on the SWE-bench Verified coding benchmark, essentially matching Claude Opus 4.6 at 80.8%, and surpasses it on LiveCodeBench with 93.5% versus 88.8% (Of course, it's also true that real-world usage differs from the benchmarks.). It accomplishes this while offering API prices 90 to 97 percent lower than Western equivalents, a cost advantage driven not by predatory pricing but by genuine architectural efficiency. China did not close the chip gap. It went around it! The hardware remains inferior chip-for-chip, but radical system-level scaling, extraordinary software innovation, state-subsidized energy costs, and a willingness to accept manufacturing inefficiencies that would destroy any commercial enterprise combined to produce an outcome that the sanctions were specifically designed to prevent. ## The sanctions paradox The deepest irony of this story is that the export controls may have accelerated the very outcome they sought to prevent. Before October 2022, Chinese AI labs were happy NVIDIA customers, content to buy American hardware and train their models on CUDA. The sanctions forced them into an uncomfortable but ultimately productive marriage with Huawei, compelled DeepSeek to invent algorithmic solutions to hardware problems, and gave the Chinese government the political mandate to pour unlimited resources into semiconductor independence. Chris Miller's analysis in "Chip War" was not wrong about the physics. EUV lithography is genuinely hard, and NVIDIA's chips are genuinely superior. What it underestimated was the degree to which software innovation, system-level engineering, and state-directed economic irrationality could neutralize those advantages in practice. The 10-to-15-year gap was measured in hardware generations. China's response was to make the hardware generation gap matter less. The question going forward is not whether China can match NVIDIA chip for chip. It probably cannot, at least not soon. The question is whether chip-for-chip superiority still matters when the competition is being fought on a different axis entirely, one where algorithmic efficiency, system architecture, and political will have proven to be just as decisive as nanometers and transistors. The West built a fortress around its silicon. China built a ladder out of software, and climbed over the wall. A few final words and personal views The future of AI infrastructure is more open than anyone in Washington or Silicon Valley assumed even 12 months ago, and the comfortable narrative of permanent Western dominance no longer holds. What we are watching is the emergence of a genuine two-player race between the US and China, one that will be fought across hardware, software, and industrial policy simultaneously, with escalating intensity on both sides. Europe, absent any frontier chip design capability or hyperscaler of its own, risks being reduced to a spectator in this contest. But one European lever remains decisive: as long as ASML remains the only supplier of production-grade EUV lithography, Europe is not merely watching the game. It holds one of the few choke points that still shapes the board. P.s. This text is essentially the answer to my open question: Sources referenced in the article: 1. DeepSeek V4 Technical Report (04/24/2026) https://huggingface.co/collections/deepseek-ai/deepseek-v4 / https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf 1. TechInsights: SMIC N+3 Confirmed, Kirin 9030 Analysis (12/11/2025) https://www.techinsights.com/blog/smic-n3-confirmed-kirin-9030-analysis-reveals-how-close-smic-5nm 1. Reuters (via Modern Diplomacy): Inside China's Secret Push to Build Its Own EUV Chip Machine (12/17/2025) https://moderndiplomacy.eu/2025/12/18/inside-chinas-secret-push-to-build-its-own-euv-chip-machine/ (Original Reuters article is paywalled; this is the most complete openly accessible version citing Reuters directly) 1. MIT Technology Review: Three Reasons Why DeepSeek's New Model Matters (04/24/2026) https://www.technologyreview.com/2026/04/24/1136422/why-deepseeks-v4-matters/ 1. NIST/CAISI Evaluation of DeepSeek V4 Pro (05/02/2026) https://www.nist.gov/news-events/news/2026/05/caisi-evaluation-deepseek-v4-pro 1. EE Times: China EUV Breakthrough and the Rise of the 'Silicon Curtain' (12/23/2025) https://www.eetimes.com/china-euv-breakthrough-and-the-rise-of-the-silicon-curtain/ 1. Asia Times: Made-in-China EUV Machine Targets AI Chip Output by 2028 (12/24/2025) https://asiatimes.com/2025/12/made-in-china-euv-machine-targets-ai-chip-output-by-2028/

译西方长期认为中国在AI芯片领域落后10-15年，但DeepSeek V4的发布颠覆了这一观点。该模型深度优化于华为昇腾芯片生态，可在昇腾950基础设施上部署推理，实现前沿模型大规模运行不依赖西方硬件。虽然单芯片性能上，昇腾950仍显著落后于NVIDIA Blackwell B200，但中国通过“横向扩展”战略，用大量国产芯片集群结合软件优化和模型架构创新（如MoE），使系统级AI能力快速接近前沿水平。这暴露了西方分析的根本错误——将芯片级差距直接等同于能力差距。

elvis@omarsar0 · 5月2日29

You don't have to choose between either. It's best to use a combination of them. My advice is to learn how to use a few of these models in different harnesses. Learn to combine their strengths. Open-weight models are just as good these days. Give yourself the flexibility.

译你不必在两者之间做选择。最好结合使用它们。我的建议是学习如何在不同的场景中使用其中几种模型。学会结合它们的优势。如今开源模型同样出色。给自己灵活运用的空间。

Alibaba Cloud@alibaba_cloud · 5月1日40

Over 70 engineers and developers packed Qwen Meetup Seoul #2 on Labor Day eve to build real AI products. The evening featured practical, hands-on presentations: channeltalk’s 박진영 shared how his team built a 500M-record observability pipeline in just 2 weeks. Omelet's 최원준 walked through production-scale AI architectures. TeamSparta's 권도현 demonstrated building an AI assistant on Alibaba Cloud Model Studio. Takeaway: Qwen3.6 is a productivity multiplier for teams shipping AI at scale. Thank you to all speakers, the TFM community, and the Alibaba Cloud Korea team. See you at #3!

译超过70名工程师和开发者在首尔Qwen Meetup上交流AI产品实战经验。channeltalk团队分享了如何在两周内构建处理5亿条记录的可观测性管道；Omelet介绍了生产级AI架构；TeamSparta演示了在阿里云Model Studio上构建AI助手。核心结论是Qwen3.6能显著提升团队规模化交付AI产品的效率。活动由阿里云韩国团队和TFM社区支持。

Artificial Analysis@ArtificialAnlys · 5月1日57

All three leading open weights models were released last week. Progress continues for open weights models alongside proprietary ones, with the gap to GPT-5.5, the leading proprietary model, sitting at 6 points on the Artificial Analysis Intelligence Index @Kimi_Moonshot’s Kimi K2.6 (Reasoning) and @Xiaomi's MiMo V2.5 Pro (Reasoning) tie as the leading open weights models on the Artificial Analysis Intelligence Index at 54, with @deepseek_ai's DeepSeek V4 Pro (Reasoning, Max Effort) at 52. This places the best open weights models within 3-6 points of the leading proprietary models: @OpenAI's GPT-5.5 (xhigh) at 60, and @Google's Gemini 3.1 Pro Preview and @AnthropicAI's Claude Opus 4.7 (Adaptive Reasoning, Max Effort) at 57. For context: just one year ago the highest-scoring open weights model was DeepSeek V3 0324 which achieved 22 on the Intelligence Index, and was ~13 points below the highest-scoring proprietary model, Claude 3.7 Sonnet (Reasoning) at 35. Key takeaways: ➤ The top three most intelligent open weights models are trillion-plus-parameter MoE architectures with permissive licenses. Kimi K2.6 (Reasoning) has 1T total / 32B active parameters with 256K context window, MiMo V2.5 Pro (Reasoning) has 1T total / 42B active with 1M context window, and DeepSeek V4 Pro (Reasoning, Max Effort) has 1.6T total / 49B active with 1M context window. ➤ The gap to proprietary remains wide on the hardest reasoning and agentic coding evaluations. On HLE (Humanity's Last Exam) the three top open weights models score 34-36%, vs 44% for GPT-5.5 (xhigh) and 45% for Gemini 3.1 Pro Preview. On CritPt (Research-level Physics) they score 4-12%, vs 27% for GPT-5.5 (xhigh). On TerminalBench Hard (Agentic Coding & Terminal Use) they score 43-46%, vs 61% for GPT-5.5 (xhigh) and 54% for Gemini 3.1 Pro Preview. ➤ Omniscience (knowledge + hallucination) shows a large gap to proprietary models, with DeepSeek V4 Pro (Reasoning, Max Effort) hallucinating significantly more than its open weights peers. DeepSeek V4 Pro (Reasoning, Max Effort) scores -10, MiMo V2.5 Pro (Reasoning) +4, and Kimi K2.6 (Reasoning) +6. By comparison, GPT-5.5 (xhigh) scores +20, Claude Opus 4.7 (Adaptive Reasoning, Max Effort) +26, and Gemini 3.1 Pro Preview +33.

译上周，Kimi K2.6、MiMo V2.5 Pro和DeepSeek V4 Pro三大领先开源模型发布，在Artificial Analysis Intelligence Index上得分达52-54分，与顶尖闭源模型GPT-5.5的60分差距缩小至6分以内，相比一年前22分的开源模型进步显著。这些模型均为万亿参数规模的MoE架构。然而，在复杂推理、智能体编码及知识准确性方面，开源模型与闭源模型仍存在明显差距。例如在HLE、CritPt和TerminalBench Hard等专项评估中得分大幅落后；在Omniscience评估中，DeepSeek V4 Pro的幻觉问题尤为突出。

小互@xiaohu · 5月1日65

好的兄弟哈哈哈哈

译一位开发了DeepSeek-TUI终端工具的美国开发者，希望与国内开发者社群建立联系，共同探讨DeepSeek、开源及智能体开发。他因无法自行解决网络问题以使用微信，特请求社区帮助：一是转发推广其开源项目，二是协助验证微信号以便建群交流。作为回报，他承诺工具将通过cargo install方式安装。

Berryxia.AI@berryxia · 5月1日63

🚀 Geometry 终于成为 AI 在建筑领域的缺失关键层！ @Bootsblac 用 OpenGeometry 把 Text → Floorplans → CAD → Render 完整打通，精度控制成为可能！ 1. 从文本/平面图直接生成精准 BREP CAD 模型 2. Three.js 实时渲染 + Google AI 驱动，全流程端到端 3.完整开源可用

译推文指出，Geometry（几何）已成为AI在建筑领域缺失的关键层。@Bootsblac开发的OpenGeometry项目，实现了从文本或平面图到最终渲染的完整流程贯通，使得精确控制成为可能。其核心能力包括：直接从文本或平面图生成精确的BREP CAD模型；利用Three.js进行实时渲染，并由Google AI驱动，形成端到端的全流程。该项目已完整开源，可供使用。

elvis@omarsar0 · 5月1日58

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on @FireworksAI_HQ inference. This is the first time I feel like there is an open-weight model that can reason at the level of Claude and Codex. And it does this in a cost-effective way with support for 1M context length. To be clear, I am using DeepSeek-V4-Pro inside of Pi without any special configuration. It works out of the box. It's exciting that there is a model that can just be plugged into a basic harness like Pi, and it just works. I've never seen that before. Most models require lots of configuration and setup. @deepseek_ai's DeepSeek-V4-Pro is clearly good at agentic coding (probably the best from the open-weight models), but the model is also great on knowledge-intensive tasks where reasoning matters. The agent pulled agentic engineering best practices from different company docs (Anthropic, OpenAI, Google, Stripe, Meta, Modal, DeepSeek, Mistral, Cohere), searched and digested Reddit and HN threads, summarized arxiv papers, and surfaced trending GitHub repos. Then it distilled everything into actionable tips across categories. I love the Wiki it built. The quality is really good. Here is a snapshot of what the wiki looks like: https://github.com/dair-ai/dair-workshops/tree/main/agentic-engineering-wiki DeepSeek-V4-Pro handled the task without breaking stride. Multi-step research queries, code generation for scaffolding, context-heavy reasoning across disparate sources. For coding specifically, this is the first open-weight model that genuinely feels like a Codex or Claude Code experience. It compares in capability and actual multi-turn agentic work. What made the loop feel so responsive was Fireworks' inference speed (the fastest in the market) and the fact that they actually validate models at the systems level before shipping. No corrupted reasoning traces. Just fast, reliable iteration. The hybrid CSA and HCA attention design cuts KV cache to just 10% and inference FLOPs by nearly 4x at 1M-token context. This is what makes the agent loop actually fast and cheap enough to run in practice. For devs who've been watching open-weight models close the gap but haven't found one that actually delivers in practice, this is the closest I've seen. Try it here: https://app.fireworks.ai/models/fireworks/deepseek-v4-pro

译测试者使用 DeepSeek-V4-Pro 在 Pi 编码智能体上构建了一个 LLM 知识库，对其开箱即用的表现感到震撼。这是首个在推理能力上媲美 Claude 和 Codex 的开源权重模型，且成本效益高，支持 100 万上下文长度。该模型无需复杂配置即可在基础框架中直接运行，擅长智能体编码和知识密集型推理任务，能跨公司文档、论坛、论文和代码库进行多步骤研究、代码生成与上下文推理。其高效运行得益于 Fireworks 的市场最快推理速度及混合注意力设计，将 KV 缓存降至 10%，推理计算量减少近 4 倍，实现了快速且低成本的实践部署。

Mistral AI@MistralAI · 5月1日58

Mistral AI made the TIME100 Most Influential Companies list for 2026 — and the top 10 for AI. Why we're proud: customers run frontier models in production on their own terms, on their own infrastructure. Thank you to our customers for their trust and for joining us on the journey. Grateful to our incredible team members around the world and congrats to all the businesses recognized this year. Learn more at: https://time.com/collection/time100-most-influential-companies/2026/mistral/ #TIME100Companies #TIME100CompaniesIndustryLeader

译Mistral AI 被列入 TIME100 2026 年最具影响力公司名单，并在人工智能类别中排名前十。公司强调其客户能够根据自己的条件在自有基础设施上运行前沿模型，这体现了自主性和数据控制优势。Mistral AI 感谢客户的信任和全球团队成员的贡献，同时祝贺所有今年被认可的企业。

OpenClaw🦞@openclaw · 5月1日39

Turns out the safest lobster is the one everyone can inspect. We wrote about the advisory flood, the real fixes, ClawHub, Agents of Chaos, and the companies helping harden OpenClaw in public. 🦞 https://openclaw.ai/blog/openclaw-security-in-public/

译事实证明，最安全的龙虾是每个人都能检查的那一只。我们撰文探讨了咨询洪流、真正的修复方案、ClawHub、混沌代理，以及那些公开帮助强化OpenClaw的公司。🦞 https://openclaw.ai/blog/openclaw-security-in-public/

Nathan Lambert@natolambert · 5月1日47

Distillation is largely an industry standard and not just something done by Chinese labs targeting OpenAI/Anthropic. Many American companies also distill Chinese (open) models.

译蒸馏在很大程度上是行业标准，并非仅是中国实验室针对 OpenAI/Anthropic 的做法。许多美国公司也会蒸馏中国的（开源）模型。

Chubby♨️@kimmonismus · 5月1日60

/1 Gemma 4 31B just crushed Qwen 3.6 27B in a local LLM gamedev contest inside @atomic_chat_hq (prompt is below) Device: MacBook Pro M5 Max, 64GB RAM Results: Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens So what is more important: tokens per second, or the quality of the final answer? Qwen made a very long response and showed more creativity and visual style. But Gemma gave a shorter, clearer, and more logical answer in much less time. In this one-shot Pac-Man gamedev contest, Gemma 4 31B was the clear winner. Its game logic was stronger: click reactions were smoother, and it handled interactions with elements like walls, ghosts, and particle effects better. But this was only one test. Maybe Qwen 3.6 27B can show better results with better settings. Open the comments, try our prompt, and share your result below.

译在@atomic_chat_hq平台的本地LLM游戏开发竞赛中，Gemma 4 31B与Qwen 3.6 27B于MacBook Pro M5 Max上对决。尽管Qwen生成速度更快（32 tokens/秒）且回答更具创意，但Gemma仅用3分51秒和6209个token，输出了更简短、清晰、逻辑性强的答案。在具体的吃豆人游戏逻辑实现上，Gemma在点击反应、与墙壁/幽灵的交互及粒子效果处理方面表现更优。作者强调此为单次测试，Qwen或可通过调整设置提升表现，并邀请社区验证。

Artificial Analysis@ArtificialAnlys · 5月1日65

Ant Group has just released Ling 2.6 1T, an open weights, non-reasoning model with high cost efficiency and a reasonable intelligence tradeoff. Ling 2.6 1T scores 34 on the Artificial Analysis Intelligence Index, a 15-point jump from Ling-1T Ling 2.6 1T is the latest model from Ant Group’s @TheInclusionAI lab. Ant Group recently released Ling 2.6 Flash, a 104B total parameter non-reasoning model. Ling 2.6 1T’s weights have been publicly released on Hugging Face. Key takeaways: ➤ Comparable intelligence to similarly sized non-reasoning models: At 1T total parameters, Ling 2.6 1T sits near DeepSeek V3.2 (non-reasoning, 32) and Kimi K2.5 (non-reasoning, 37) in intelligence. This is a marked improvement from Ling-1T, which scores 19 on the Intelligence Index. However, there remains a ~10-point gap to frontier non-reasoning open weights models such as GLM-5.1 (non-reasoning, 44) and Kimi K2.6 (non-reasoning, 43). ➤ Strong performance in scientific reasoning and knowledge: Ling 2.6 1T scores 75% on GPQA and 8% on Humanity’s Last Exam (HLE), indicating solid performance on graduate-level reasoning and knowledge recall tasks. This is comparable to DeepSeek V3.2 (non-reasoning), which achieves 75% on GPQA and 11% on HLE. ➤ Efficient token usage: Ling 2.6 1T uses ~16M output tokens to run the Artificial Analysis Intelligence Index, making it more efficient than MiMo V2 Flash (non-reasoning, ~17M), and significantly more efficient than GLM-5.1 (non-reasoning, ~75M) and Kimi K2.6 (non-reasoning, ~27M) ➤ Strong cost-to-intelligence positioning: At $0.30 per million input tokens and $2.50 per million output tokens on InclusionAI’s first-party API, Ling 2.6 1T costs only ~$95 to run the full Artificial Analysis Intelligence Index. This positions it competitively for large-scale workloads relative to models in a similar intelligence tier. ➤ Relatively weak factual reliability: Ling 2.6 1T scores -51 on AA-Omniscience, our benchmark for factual accuracy and hallucination. This is primarily driven by a high hallucination rate (92%), which is similar to GPT-5.5 (non-reasoning, 91%). However, its 21% accuracy is broadly in line with comparable non-reasoning models. Additional model details: ➤ Size: 1T total parameters ➤ Pricing: $0.30 / $2.50 per 1M input/output tokens (via Novita API) ➤ License: Weights not yet released ➤ Availability: First-party API through InclusionAI

译蚂蚁集团InclusionAI实验室发布开源非推理模型Ling 2.6 1T。该模型拥有1万亿参数，在Artificial Analysis Intelligence Index上得分为34分，较前代Ling-1T提升15分，智能水平接近DeepSeek V3.2等同类模型。其在科学推理与知识任务上表现扎实，GPQA得分达75%。模型运行效率较高，执行该指数仅需约1600万输出tokens，成本效益突出，通过官方API运行全套指数成本约95美元。但其事实可靠性较弱，在AA-Omniscience基准上得分为-51分，主要因幻觉率高达92%。模型权重已在Hugging Face公开。

阿绎 AYi@AYi_AInotes · 5月1日61

很多人还有些看不太懂，我再打个比方尽量跟大家说清楚，就好比像你去一家标榜最开放的咖啡馆喝咖啡，结果服务员偷偷扫描你手机里有没有竞品APP的通知，只要扫到就多收你一杯的生态保护费，表面上是欢迎所有人，背地里却在设卡收租，所以用户现在集体炸锅也正常

译Anthropic被曝通过其官方Claude Code工具检测用户Git提交历史，若发现包含“openclaw”字符串，便将该用户识别为第三方工具使用者，并触发“out of extra usage”错误，导致服务被拒或强制额外收费。开发者实验证实此为人为设置的字符串匹配规则。此举被视为Anthropic为将用户锁定在自家生态、打压更灵活的第三方竞品而采取的粗暴手段，与其此前塑造的开放、不监控形象相悖，引发了开发者社区的强烈不满和抗议。

Artificial Analysis@ArtificialAnlys · 5月1日64

Alibaba's Qwen3.6 27B is the new open weights leader under 150B parameters scoring 46 on the Artificial Analysis Intelligence Index, but uses ~3.7x the output tokens and costs ~21x more than Gemma 4 31B (39) to run the full Intelligence Index @Alibaba_Qwen has released two open weights models in the Qwen3.6 family: Qwen3.6 27B (Dense, 46 on the Intelligence Index) and Qwen3.6 35B A3B (MoE, 43). The MoE variant has 36B total parameters but only activates 3B per forward pass. Both are Apache 2.0 licensed, support 262K context, include native multimodal input, and use the unified thinking/non-thinking hybrid architecture. Unlike Qwen3.5, Alibaba has not released larger Qwen3.6 models as open weights - Qwen3.6 Plus and Qwen3.6 Max Preview remain proprietary, so the Qwen3.6 open weights family is currently all under 50B models. All scores below are for reasoning mode. The Intelligence Index is our synthesis metric incorporating 10 evaluations covering agentic tasks, coding, and scientific reasoning. Key takeaways: ➤ Qwen3.6 27B is the most intelligent open weights model under 150B parameters. At 46 on the Intelligence Index, Qwen3.6 27B is ahead of Qwen3.6 35B A3B (43), Qwen3.5 27B (42), and Gemma 4 31B (39). It is also ahead of larger open weights models including NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), Qwen3.5 122B A10B (42) and gpt-oss-120b (high, 33). In native BF16 precision, the 27B takes ~56GB to store the weights, fitting on a single H100, and in 4-bit quantization the weights fit on consumer hardware with 16GB+ of RAM ➤ Qwen3.6 35B A3B is the most intelligent open weights model with ~3B active parameters, 6 points ahead of Qwen3.5 35B A3B (37) and 13 points ahead of GLM-4.7-Flash (30). Other ~3B active peers include Gemma 4 26B A4B (31), Qwen3 Coder Next (80B total, 28), and NVIDIA Nemotron Cascade 2 30B A3B (28) ➤ AA-Omniscience improvement is driven entirely by abstention rather than accuracy. Qwen3.6 27B's hallucination rate falls from 80% to 48% versus Qwen3.5 27B, while accuracy is roughly flat - consistent with our finding that AA-Omniscience accuracy typically correlates with total parameter count and Qwen3.6 27B retains the same 27B parameter count as its predecessor. The 35B A3B shows the same pattern whereby hallucination drops from 84% to 50% while accuracy remains equivalent ➤ Token usage is up across both models versus Qwen3.5 and significantly higher than Gemma 4 31B. Qwen3.6 27B used ~144M output tokens to run the Intelligence Index (~1.5x Qwen3.5 27B at 98M, ~3.7x Gemma 4 31B at 39M). Qwen3.6 35B A3B used ~143M (~1.4x Qwen3.5 35B A3B at 100M, ~3.7x Gemma 4 31B) ➤ The 27B got materially more expensive while the 35B A3B is roughly flat versus predecessor. Per-token pricing on Alibaba Cloud moved differently, with the 27B going from $0.30/$2.40 to $0.60/$3.60 while the 35B A3B (Reasoning) remains nearly flat at $0.248/$1.485 (vs $0.25/$2.00 for Qwen3.5 35B A3B). Qwen3.6 27B costs ~$659 to run the Intelligence Index, ~2.2x Qwen3.5 27B (~$299) and ~21x Gemma 4 31B (~$31 at median third-party pricing of $0.14/$0.40 per 1M input/output tokens). Qwen3.6 35B A3B costs ~$280, roughly tied with Qwen3.5 35B A3B (~$302) and ~9x Gemma 4 31B ➤ Qwen3.6 27B is competitive with leading models on agentic real-world work tasks despite its size. At 1414 Elo on GDPval-AA, Qwen3.6 27B is ahead of recent open weights peers Qwen3.6 35B A3B (1297), Qwen3.5 27B (1157) and Gemma 4 31B (1115), but trails larger open weights leaders including DeepSeek V4 Pro (Reasoning, Max Effort, 1554) and GLM-5.1 (Reasoning, 1535). It matches DeepSeek V4 Flash (Reasoning, High Effort, 1414) at 284B total parameters, and sits roughly in line with GPT-5.4 mini (xhigh, 1436) and Muse Spark (1421). ➤ Non-reasoning variants remain equivalent versus Qwen3.5. Qwen3.6 27B (Non-reasoning, 37) is effectively tied with Qwen3.5 27B (Non-reasoning, 37); Qwen3.6 35B A3B (Non-reasoning, 32) is equivalent to Qwen3.5 35B A3B (Non-reasoning, 31). The Qwen3.6 generation gains are concentrated in reasoning mode Other information: ➤ Context window: 262K tokens (equivalent to Qwen3.5) ➤ License: Apache 2.0 ➤ Multimodality: Native vision input (text and image), text output ➤ API pricing (Alibaba Cloud): Qwen3.6 27B: $0.60/$3.60, Qwen3.6 35B A3B (Reasoning): $0.248/$1.485 ➤ Availability: Available on Alibaba Cloud first-party API. Qwen3.6 35B A3B is available on several third-party APIs such as @DeepInfra, @parasail_io, @clarifai and @novita_labs

译阿里巴巴开源了Qwen3.6系列两款模型：27B密集模型和35B A3B混合专家模型。其中，Qwen3.6 27B在Artificial Analysis智能指数上得分46，成为150B参数以下最智能的开源模型，领先于Gemma 4 31B等。但其运行完整测试消耗的输出token约为后者的3.7倍，成本高出约21倍。两款模型均采用Apache 2.0许可，支持262K上下文，具备多模态能力。值得注意的是，其幻觉率较前代大幅下降，但准确率基本持平。更大的Plus和Max Preview版本未开源。

Berryxia.AI@berryxia · 5月1日62

Stripe Sessions 直接把 Agent 经济推上新高度了！💳 @patrickc 亲自总结本次大更新： ✅ 整个经济正在“replatforming” ✅ Agents 将在不久后负责大多数交易 ✅ AI 让“开发者优先”变得前所未有重要（Agents 对好 DX 的饥渴程度远超人类开发者） ✅ 推出 Link AI 钱包：直接把 Agent 指向 https://link.stripe.com → 让它用一次性安全 token 帮你购物 ✅ 新增 Pix、UPI、稳定币支持，Machine Payments 协议新增微支付和循环支付 ✅ Checkout Studio、Adaptive Pricing（订阅版）、Terminal 新硬件、Treasury 多币种扩展…… 从支付基础设施到 Agent 时代经济层，Stripe 正在全面布局！完整公告 + 所有新功能戳这里👉 https://stripe.com/sessions

译Stripe在年度大会上宣布一系列战略更新，以迎接AI Agent主导交易的新经济时代。CEO指出，经济正经历“平台重构”，未来多数交易将由Agent完成，这使得“开发者优先”战略至关重要。核心发布包括Link AI钱包，允许Agent使用安全令牌代用户购物，并新增Pix、UPI及稳定币支持。同时，Machine Payments协议增加了微支付和循环支付功能。此外，Checkout Studio、Adaptive Pricing订阅版、新款终端硬件T600以及Treasury的多币种扩展等产品，共同标志着Stripe正从支付基础设施向Agent时代的经济层全面演进。

Berryxia.AI@berryxia · 4月30日59

🚀 Qwen 重磅开源 Qwen-Scope！稀疏自编码器完整套件正式发布，把 SAE 特性变成真正能落地的实用工具，模型可解释性直接起飞！ 1. Inference：直接操纵内部特征实现输出控制，完全无需 prompt engineering 2. Data：用极少种子样本就能分类和合成目标数据，解决长尾能力问题 3. Training：精准追溯 code-switching 和重复生成根源，从源头修复 4. Evaluation：通过特征激活模式分析智能挑选 benchmark，减少冗余 Qwen 模型家族的深度可解释性神器，社区快来挖掘新机制和应用！项目地址： https://huggingface.co/collections/Qwen/qwen-scope

译Qwen开源了Qwen-Scope，这是一个为Qwen模型家族设计的稀疏自编码器完整套件，旨在将SAE特征转化为实用工具。该套件提供四大核心功能：在推理方面，可直接操纵模型内部特征以控制输出，无需依赖提示工程；在数据方面，能用极少样本对目标数据进行分类和合成，增强模型的长尾能力；在训练方面，能精准追溯代码切换和重复生成等问题的根源并进行修复；在评估方面，可通过分析特征激活模式来智能筛选基准测试，减少冗余。Qwen希望社区能利用此工具深入探索模型内部机制并开发更多应用。

Qwen@Alibaba_Qwen · 4月30日73

Today we’re releasing Qwen-Scope 🔭, an open suite of sparse autoencoders for the Qwen model family. It turns SAE features into practical tools： 🎯 Inference — Steer model outputs by directly manipulating internal features, no prompt engineering needed 📂 Data — Classify & synthesize targeted data with minimal seed examples, boosting long-tail capabilities 🏋️ Training — Trace code-switching & repetitive generation back to their source, fix them at the root 📊 Evaluation — Analyze feature activation patterns to select smarter benchmarks and cut redundancy We hope the community uses Qwen-Scope to uncover new mechanisms inside Qwen models and build applications beyond what we explored.Excited to see what you build! 🚀 🔗🔗 Blog: https://qwen.ai/blog?id=qwen-scope HuggingFace: https://huggingface.co/collections/Qwen/qwen-scope ModelScope: https://modelscope.cn/collections/Qwen/Qwen-Scope Technical Report: https://qianwen-res.oss-accelerate.aliyuncs.com/qwen-scope/Qwen_Scope.pdf

译Qwen团队推出开源稀疏自编码器套件Qwen-Scope，将SAE特征转化为实用工具。该套件支持四大应用方向：无需提示工程即可通过直接操控内部特征引导模型输出；用极少样本对目标数据进行分类与合成，提升长尾能力；追踪代码切换和重复生成问题的根源并进行修复；通过分析特征激活模式优化评测基准并减少冗余。团队希望社区利用Qwen-Scope深入探索Qwen模型内部机制，并开发出超越现有研究范围的应用。相关资源已开放。

向阳乔木@vista8 · 4月30日50

http://x.com/i/article/2049847033758916609 # 又是节假日搞事情！DeepSeek开源视觉大模型，读完论文帮你划重点昨天体验了网页端的DeepSeek的识图模式，速度超级快，质量也还行。今天看到DeepSeek公布了论文，果然秉承了优良传统，节假日前搞事情，Respect！ Github地址：https://github.com/deepseek-ai/DeepSeek-VL > 论文查看 https://arxiv.org/pdf/2403.05525 AI总结，人工阅读Review配图如下。 ## 一句话总结 DeepSeek-VL是DeepSeek团队开源的视觉语言模型，有1.3B和7B两个版本。核心目标只有一个：在真实场景里既能看图又能说话，同时不丢失语言能力。它从数据、架构、训练策略三个维度入手，在同等参数规模下做到了开源模型里的顶尖水平。 ## 这篇论文到底在解决什么问题？ 2024年初，开源多模态模型和GPT-4V之间有一道明显的鸿沟。很多开源模型在学术benchmark（基准测试，就是标准化的评分考试）上跑分还行，但一到真实场景就拉胯。让它看一张网页截图、读一份PDF、识别街拍里的小字，效果就大打折扣。 DeepSeek团队总结了四个核心原因：第一，预训练不够充分。很多模型把大量算力花在了指令微调阶段，但真正的通用能力来自大规模预训练。这就像一个人只刷题不读书，考试可能还行，解决实际问题就不够用了。第二，训练数据和真实使用场景脱节。把一堆学术数据集拼在一起做微调，benchmark分数好看，但用户实际用起来体验很差。第三，图像分辨率太低。大部分模型只能处理336×336或448×448像素的图片。现实世界里的OCR（光学字符识别，就是让AI读懂图片里的文字）、小物体识别，这个分辨率根本不够用。第四，多模态训练会"吃掉"语言能力。很多模型在加入视觉训练后，语言能力明显下降。这是一个很棘手的问题，也是这篇论文花了最多篇幅去解决的事情。 ## 数据构建：从真实场景出发 DeepSeek-VL的数据分两大块：预训练数据和监督微调数据。 ## 预训练数据覆盖面非常广，按类别拆开来看：图文交错数据（占13.1%）就是图片和文字混合出现的内容，比如维基百科文章里图文穿插的格式。用了MMC4、Wikipedia中英文、Wikihow，以及内部的PDF和电子书。这类数据能让模型学会在上下文里理解多张图片，也就是所谓的"多模态上下文学习"能力。图像描述数据（占11.1%）高质量的图文配对数据集，包括Capsfusion、TaiSu（一个1.66亿规模的中文视觉语言数据集）和Detailed Caption。表格和图表数据（占2.1%）来自十多个公开数据集，涵盖各种图表、地理题、科学题、UI截图等，让模型学会理解各种结构化视觉信息。网页代码数据（占0.4%）这部分很有意思。团队从GitHub上抓取了146万个Jupyter Notebook，提取了其中的图表和对应的生成代码，最终筛选出110万个高质量的图文代码对。目标是让模型能从图形界面或可视化图表反推出代码。文档OCR数据（占2.1%）当时市面上没有大规模的中英文文档OCR数据集，团队自己造了。两个来源：一是从140万篇arXiv论文里提取图文对；二是从86万本英文电子书和18万本中文电子书里，用HTML渲染工具生成了配对的图片和文本。场景文字OCR数据（占1.2%）识别融入环境的文字，比如街道招牌、商品包装。用了ArT、MLT-17、LSVT、UberText等十个公开数据集。纯文本数据（占70%）这个比例是整个训练策略的核心，后面会详细解释。用的是DeepSeek-LLM的2万亿token文本语料。 ## 监督微调数据微调数据分四类：内部数据（占10.5%）这是最有价值的部分。团队先从网上收集了GPT-4V和Gemini的真实用户测试案例，然后把这些案例整理成一套完整的分类体系，再根据这套分类体系去选图、写提示词，构建出贴近真实使用场景的微调数据。通用多模态数据（占35.5%）包括ShareGPT4V、LAION-GPTV、LVIS-Instruct4V等知名开源数据集。表格图表和网页代码数据（各占4.1%和2.0%）从预训练数据集里抽取部分用于微调。纯文本对话数据（占47.9%）沿用DeepSeek-LLM的文本对话数据，保住语言能力。 ## 那套分类体系长什么样？这套分类体系是整个数据构建思路的精华，值得单独说说。团队把多模态模型的真实使用场景分成六大类：识别类：全局描述（场景、风格、食物）、局部描述（位置、人物、Logo、计数）、OCR转录（印刷体、手写体）。转换类：图片转代码（UI转代码、图表转代码、公式转代码）、图片转文本（生成提示词、文字摘要、图片创作）。分析类：数据图表分析、专业图表分析（电路图、流程图、地图、乐谱、平面图）、专业图像分析（传感器图像、医学图像）、百科知识分析（艺术文化、自然环境、衣食住行）。常识推理类：关系推理（人际、空间、大小）、功能推理（硬件、软件）、环境推理（具身智能）、异常推理（缺陷检测、事故判断）。逻辑推理类：数学推理（代数、平面几何、立体几何）、其他逻辑推理（物理、化学、生物、代码、智力题）。评估类：真实性评估、相似度评估、美学评估。还有多图理解和安全两个额外类别。这套分类体系同时用于数据构建和效果评估，保证了训练和测试的一致性。这是"从真实用户需求倒推数据构建"的典型做法，比单纯堆学术数据集高明得多。 ## 模型架构：三个模块协同工作整个模型由三个模块组成：混合视觉编码器、视觉语言适配器、语言模型。 ## 混合视觉编码器这是技术上最有意思的部分之一。传统的视觉语言模型通常只用一个视觉编码器，比如CLIP系列的SigLIP（一种用图文对比训练出来的视觉编码器）。但SigLIP有两个问题：一是存在"CLIP盲点对"现象，视觉上明显不同的两张图片，经过SigLIP编码后可能得到非常相似的表示，导致模型分不清楚。二是分辨率有限，最高只到512×512，处理不了需要精细识别的任务。 DeepSeek-VL用了双编码器混合方案： SigLIP-L 处理低分辨率（384×384）输入，负责提取高层语义特征，擅长理解图片"讲了什么"。 SAM-B 处理高分辨率（1024×1024）输入，负责提取低层细节特征。 SAM是 Meta 开发的"Segment Anything Model"（万物分割模型），其中的ViTDet图像编码器（一种专门为目标检测优化的视觉Transformer）特别擅长捕捉精细的局部信息，比如小字、边缘、纹理。两个编码器输出的特征经过适配器处理后，最终融合成576个视觉token（可以理解为576个"视觉词"）。这个数字很关键，它在视觉信息量和计算成本之间找到了平衡点，既能支持多轮对话，又不会让推理成本爆炸。为了验证这个选择，团队做了对比实验，测试了CLIP、SigLIP、SigLIP+DINO、SigLIP+SAM四种组合的训练损失曲线。结果显示SigLIP+SAM的组合训练损失下降最快、最低，证明引入视觉自监督编码器确实有效。 ## 视觉语言适配器这是连接视觉编码器和语言模型的桥梁，用的是两层混合MLP（多层感知机，一种基础的神经网络结构）。具体做法：先用两个独立的单层MLP分别处理高分辨率和低分辨率特征，然后把两个特征拼接在一起，再通过一层MLP映射到语言模型的输入空间。为什么要用两个独立的MLP而不是共享一个？团队做了消融实验，对比了几种适配器设计： - 序列拼接（把视觉特征在序列维度上堆叠）：效果一般，而且计算量更大 - 嵌入维度拼接（在特征维度上拼接）：效果更好 - 共享MLP：特征融合充分，但对不同编码器的特征分布适应性差 - 独立MLP：能精准适应各自编码器的特征分布，但融合不够 - 混合MLP（先独立处理再拼接）：综合了两者优点，效果最好 ## 语言模型基于DeepSeek-LLM构建，架构上和LLaMA高度相似：用了RMSNorm（一种更高效的归一化方法）、SwiGLU激活函数（一种改进的门控线性单元）、旋转位置编码（RoPE，一种让模型理解token位置关系的方法）。两个版本的基础： - DeepSeek-VL-1.3B 基于 DeepSeek-LLM-1B（用约5000亿文本token训练） - DeepSeek-VL-7B 基于 DeepSeek-LLM-7B（用约2万亿文本token训练）值得注意的是，团队选的是DeepSeek预训练模型的中间检查点，而不是最终版本，然后继续做多模态预训练。 ## 训练策略：三阶段 + 模态平衡训练分三个阶段，每个阶段解决不同的问题。 ## 第一阶段：热身视觉语言适配器冻住视觉编码器和语言模型，只训练适配器。数据：125万张来自ShareGPT4V的图文描述对 + 250万个文档OCR渲染对。这个阶段的目标是在视觉和语言的嵌入空间之间建立初步的概念连接，让语言模型能"认识"视觉特征。团队做了一个重要实验：把第一阶段的数据量从2K步扩展到80K步，然后直接做微调，看效果有没有提升。结果是扩大数据量没有任何帮助，性能甚至略有下降。原因很清楚：适配器（两层MLP）的参数量太少，容量有限，学到一定程度就饱和了。再多的数据也塞不进去。这也说明了为什么必须有第二阶段。 ## 第二阶段：联合视觉语言预训练这是最关键的阶段，也是论文最核心的贡献之一。解冻语言模型和适配器，视觉编码器保持冻结，用大规模图文混合数据继续预训练。团队发现了一个严峻问题：如果只用多模态数据训练，语言能力会断崖式下降。上图展示了这个现象。在1B模型上，用100%多模态数据训练时，MMBench（多模态理解测试）分数缓慢上升，但HellaSwag（语言理解测试）和MMLU（多学科知识测试）分数急剧崩塌。团队分析了两个原因：一是多模态数据本身比纯文本数据简单得多，分布差异很大，直接用来训练会"稀释"语言知识。二是视觉模态和语言模态之间存在竞争关系，学多了视觉就会忘记语言，这是一种"灾难性遗忘"。解决方案是联合语言多模态训练：在训练时同时混入大量纯文本数据。实验结果非常清晰： - 混入语言数据后，语言能力下降的问题得到了大幅缓解 - 加入语言数据并不会明显损害多模态性能 - 两种模态的性能都和它们在训练数据中的比例强相关最终选定的比例是语言70%、多模态30%。这个比例能让模型在保住语言能力的同时，获得足够的多模态预训练效果。除了混合比例，团队还提出了两个实用技巧：模态分组训练直接把语言数据和多模态数据混在同一个batch里，会有效率问题：纯文本处理很快，但要等多模态数据处理完才能更新参数，造成大量等待。解决方法是把不同模态的数据分开组batch，每个训练步要么全是语言数据，要么全是多模态数据。这一招让训练效率提升了20%，而且性能完全不受影响。模态热身策略训练初期把语言数据比例设为100%，然后逐渐降到目标比例（70%）。这样可以避免训练初期语言能力的剧烈波动，让模型更平稳地适应多模态数据的加入。实验证明，这个策略在训练后期也能带来更好的语言和多模态双端性能。还有一个容易被忽略的工程细节。团队在1.3B小模型上做实验迭代，再放大到7B。但小模型有个问题：在标准benchmark上表现波动极大，很难准确反映改进效果。原因是小模型虽然"知道"正确答案，但没有足够的指令跟随能力把答案"说出来"。解决方案是双管齐下： 1. 把评估方式从"让模型生成答案"改成"比较各选项的困惑度（PPL，一种衡量模型对某段文本有多困惑的指标，越低越好）" 1. 同时在预训练阶段混入少量指令微调数据，让小模型也能稳定地跟随指令。这样小模型就能给出稳定的评估信号，大大加快了迭代速度。 ## 第三阶段：监督微调用前面构建的指令微调数据集，训练模型的对话和指令跟随能力，最终得到DeepSeek-VL-Chat。这个阶段同时训练语言模型、适配器和SigLIP编码器。 SAM-B因为显存限制保持冻结。只对答案和特殊token计算损失，系统提示和用户输入不参与训练。三个阶段缺一不可：只有第一阶段+第三阶段，平均分57.4；加上第二阶段后提升到62.4。第一阶段虽然提升有限，但也有意义，去掉它性能会略微下降。 ## 训练基础设施使用了DeepSeek自研的HAI-LLM分布式训练框架。 DeepSeek-VL-7B用了64个节点（每个节点8块英伟达A100 GPU），训练了5天。 DeepSeek-VL-1.3B用了16个节点，训练了7天。 ## 效果评测：数字说话 ## 多模态benchmark ## 7B模型在开源模型里表现最好： - SeedBench（综合多模态理解）：70.4，接近GPT-4V的71.6 - MMBench（多模态综合测试）：73.2，超过所有同级开源模型 - OCRBench（OCR专项测试）：456分，远超同级模型（LLaVA-1.5 13B只有331分） - POPE（幻觉测试，测模型会不会"看到"不存在的东西）：88.1，同级最高 - MathVista（数学视觉推理）：36.1，超过所有同级开源模型，但和GPT-4V（47.8）还有差距 - CMMMU（中文多学科多模态理解）：37.9，明显优于其他开源模型 1.3B小模型更令人印象深刻：用一半不到的参数（1.3B vs 2.7B），在MMBench上超过了MobileVLM V2 2.7B（64.6 vs 63.2）。 MathVista上甚至达到31.1，和部分7B模型相当。 ## 语言benchmark 这是DeepSeek-VL最值得骄傲的地方之一。 - HellaSwag：68.4（DeepSeek-LLM-7B是68.5），几乎持平 - MMLU：52.4（DeepSeek-LLM-7B是49.4），多模态训练后反而提升了 - AGIEval：27.8（DeepSeek-LLM-7B是19.3），同样提升 - GSM8K（数学）：55.0（DeepSeek-LLM-7B是63.0），有所下降数学能力的下降说明视觉和语言模态之间的竞争关系仍然存在，7B的模型容量在这里成了瓶颈。团队认为更大的模型可以缓解这个问题。 ## 人工评测团队构建了100道题，覆盖七个类别，和InternLM-XComposer2-VL、CogVLM-17B、GPT-4V做对比。结论：DeepSeek-VL-7B在识别、转换、常识推理三个维度接近GPT-4V，整体表现优于其他开源模型。逻辑推理是所有开源模型的共同短板，和GPT-4V差距最大。还做了GPT-4V裁判评测：把DeepSeek-VL和其他模型的回答都给GPT-4V看，让它判断哪个更好。结果DeepSeek-VL在超过60%的情况下被GPT-4V判定为更好，甚至在和GPT-4V自身对比时也获得了相当不错的评价。真实场景能力展示论文里展示了大量真实案例，值得逐一看看：处理逻辑图、网页、公式识别、科学文献、自然图像、具身智能场景的综合展示识别图片中的细小物体（骑车人在女士手提包的左侧），并给出有条理的解释理解Python代码截图并逐步解释算法逻辑看懂儿童编程Scratch流程图并转换成Python代码（对比的开源模型1直接说"我无法处理图片"） 1. 分析训练损失曲线图表，找出代码中的bug 识别泰国10泰铢硬币；根据图片创作七言绝句；识别米哈游游戏角色把真实表格图片转换成Markdown格式 ## 为什么这篇论文重要？它揭示并量化了一个关键矛盾：多模态能力和语言能力之间存在竞争关系。 70%语言数据这个比例，加上模态热身策略和分组训练，提供了一套可复制的解决方案。它证明了"从真实场景出发"构建数据的重要性。用分类体系指导数据收集，比随意拼凑学术数据集有效得多。混合视觉编码器的思路很实用。用两个互补的编码器分别处理语义和细节，比单纯堆高分辨率更高效，576个token的压缩方案在信息量和计算成本之间找到了合理平衡。小模型迭代方法论有很强的工程参考价值。困惑度评估 + 少量指令数据混入，让1.3B模型也能给出稳定的实验信号，大幅降低了迭代成本。当然，局限也很清楚。 7B的模型容量限制了数学推理等复杂任务的表现。论文结尾提到了后续会引入MoE（Mixture of Experts，混合专家模型）技术，这也是后来DeepSeek-VL2的方向。

译DeepSeek团队开源视觉语言模型DeepSeek-VL，包含1.3B和7B两个版本，旨在缩小开源模型与GPT-4V在真实场景中的差距。模型从数据、架构、训练三方面优化：数据构建上，采用从真实用户需求倒推的分类体系，并包含70%纯文本以保持语言能力；架构上创新采用SigLIP与SAM-B的混合视觉编码器，分别处理语义与细节特征；训练采用三阶段策略及模态平衡技术，缓解多模态训练对语言能力的侵蚀。

Baidu Inc.@Baidu_Inc · 4月30日47

http://x.com/i/article/2049739903970529280 # Anyone Can Build Now, with MeDo At the start of this month, a lighthearted trend unexpectedly went viral: SBTI, a meme-style personality quiz that spread rapidly across Chinese social media, driven by how easy it was for users to create, remix, and share their results. It was entertaining, but it also pointed to something deeper. SBTI took off because it lowered the barrier to participation. Anyone could create something personal, recognizable, and instantly shareable. That same principle is now driving a new wave of AI applications, especially vibe coding tools. What once required technical expertise can now start with a simple prompt. This is the shift we've been watching closely. And it's exactly what we built Miaoda and MeDo for. We first launched Miaoda in 2024 as a no-code, conversational application development platform powered by generative AI, able to deliver multi-agent collaboration and multi-tool invocation. As the platform evolved, we introduced Miaoda 2.0 in 2025, significantly expanding its capabilities and launching its international version, MeDo, bringing the same building experience to developers worldwide. At its core, Miaoda and MeDo allow anyone to describe an application in natural language and receive a fully functional, deployable product in minutes, without writing a single line of code. Behind the scenes, the platform orchestrates a team of 10+ specialized AI agents that work together across the full development process, from requirement analysis and code generation to testing and deployment. Users are not just generating prototypes. They are building full-stack, production-ready applications that can be continuously refined through simple, iterative prompts. ## 🌍 Expanding Who Gets to Build There are nearly 8 billion people in the world, but only around 30 million are professional programmers. That means just 0.4% of the global population has traditionally had the ability to turn ideas into working software. For everyone else, the barriers have been clear: complex syntax, steep learning curves, and high development costs. Countless ideas never make it past the concept stage. By shifting creation from writing code to describing intent, we're opening the door for a much larger group of people to build digital tools, whether for business, education, or everyday problem-solving. As of the end of 2025, Miaoda has powered the creation of 500K+ commercial apps across 200+ real-world scenarios. Notably, 81% of creators are non-programmers, and the apps they're building are already serving 10M+ users with nearly 100K daily users. But who are the people that are actually building with it? Take Sean, a UK-based developer with no technical background. He tried to teach himself game development through online tutorials but gave up when the complexity became too much. Last November, he discovered AI coding tools and found MeDo through a Google search. After his first session, he was hooked. Since then, he's built multiple games with high production quality and sophisticated mechanics — the kind of projects that would have previously required a professional development team. Then there's Bubu, a medical escort service provider based in Guangzhou, China. Working on the front lines of that market, Bubu understood exactly what clients cared about most: a trustworthy, professional service with standardized processes and real-time updates on how appointments were progressing — especially for adult children living away from home who needed to stay informed about their parents' care. Using Miaoda, Bubu was able to create and launch a fully functional WeChat mini-program built around those exact needs — clean interface, clear workflows, and a seamless delivery experience. After launch, it attracted 8 new clients through organic traffic alone, generating over RMB 7,000 in additional revenue and meaningfully boosting overall business income. These examples point to a broader shift. When the barrier to building drops, more people experiment, and more ideas turn into real products. ## 🤝 AI for Good: Silent Guardian Miaoda and MeDo are also enabling projects with a deeper purpose. One example is Silent Guardian, an anti-fraud application designed for the hearing-impaired community. Hearing-impaired individuals are often more vulnerable to scams due to communication barriers and limited access to real-time verification tools. Using the platform, a single creator was able to build a comprehensive solution that includes fraud education, evidence collection, and real-time alerts. The app uses AI to convert speech into text and visual sign language cues, helping users better understand potential risks and respond in real time. For communities like this one, the value of a tool isn't just what it builds — it's who finally gets to use it. ## 🚀 Building Momentum: From Individuals to Teams As more users build with Miaoda and MeDo, we're seeing these tools move beyond individual use cases and into real-world environments. That shift is already visible in our developer ecosystem. Recent Miaoda hackathons in Beijing have brought together creators experimenting with new ideas and applications, showing how quickly simple prompts can turn into functional products. At the same time, this activity is expanding globally. The Build with MeDo Hackathon is now underway, inviting creators from around the world to push the boundaries of what can be built with no-code AI. Submissions are open through May 20, with top projects showcasing how quickly ideas can turn into fully functional applications. As the community grows, we're also bringing these tools into more structured environments. This month, we introduced the MeDo Enterprise Version, designed to bring the same no-code, natural language development experience to teams. With a structured workspace for enterprises, teams, and individual members, it enables collaborative building while keeping data private and resources easy to manage. Taken together, these developments point to a broader transition, from individual experimentation to collaborative, real-world deployment across teams and organizations. ## 🏁 What's Next: Baidu Create 2026 As more people gain the ability to build, the definition of a "developer" is starting to change. Looking ahead, this shift will come into focus at Baidu Create 2026, our flagship developer conference in Beijing on May 13–14. For the first time, it will be held alongside the Yunzhi Summit (the company's annual AI Cloud event), bringing together developers, businesses, and industry leaders to explore the latest in AI and agents. This year's theme, "Agents at Scale," reflects a clear direction for the industry, moving from early experimentation to real-world deployment, and from building tools to building with autonomous, agent-driven systems. Here's a quick look at what else we've been working on this month across AI models and developer tools: > GenFlow 4.0 Launched at Baidu AI Day - GenFlow 4.0 is a major upgrade to our general AI agent, with a fully revamped Office Agent at its core. With a single instruction, users can now invoke PPT, Excel, and Word agents in parallel, handling the full range of workplace tasks in one place. - It is now deeply integrated with OpenClaw, and users can deploy it directly from the Baidu Drive PC or mobile app. > PaddleOCR Becomes the Most-Starred OCR Project on GitHub - PaddleOCR has officially become the #1 most-starred OCR repository globally on GitHub, surpassing long-standing benchmarks including Google's Tesseract. - Built on ERNIE foundation models, it offers a robust end-to-end pipeline for high-precision text recognition and structured document parsing across 110+ languages. > ERNIE-Image Open-Sourced for Developers - We released ERNIE-Image, an open 8B parameter text-to-image model that delivers strong performance across instruction following, multilingual text rendering, and structured visual generation, while remaining lightweight enough to run on consumer hardware. - ERNIE-Image comes in two versions: the SFT model optimized for stronger general quality in 50 inference steps, and ERNIE-Image-Turbo, optimized for speed and aesthetics in just 8 steps. > Famou-Agent 2.0 Sets a New SOTA on MLE-Bench - Famou-Agent 2.0 has once again ranked #1 on MLE-Bench, setting a new SOTA for the second time since it first topped the benchmark last October. Famou-Agent is a general-purpose multi-agent framework trusted by thousands of enterprises across manufacturing, finance, transportation, and beyond to tackle complex real-world challenges. - The 2.0 upgrade brings key improvements across evolution strategies, long-horizon memory, and infrastructure. A full reveal is coming at Baidu Create next month — stay tuned. Have a question about building with MeDo, or something you'd love us to cover next? Leave a comment or DM us! Until our next roundup, keep up with our latest AI developments and innovations by following us on LinkedIn and X.

译SBTI迷因测试的流行，揭示了低门槛参与的趋势正驱动AI应用开发工具的演进。Miaoda及其国际版MeDo是一个生成式AI驱动的无代码对话式应用开发平台，用户仅需通过自然语言描述，即可在几分钟内获得功能完整、可部署的应用，无需编写代码。平台背后由10多个专用AI代理协作，覆盖从需求分析到部署的全流程。传统上全球仅0.4%的人口是专业程序员，而该平台已助力创建超50万个商业应用，其中81%的创建者是非程序员，服务超1000万用户。这标志着开发方式从编写代码转向描述意图的根本性变革。

Nathan Lambert@natolambert · 4月30日53

I worry deeply already about companies controlling access to very powerful AI, which will come in a soft form with very expensive subscriptions. This is a step further, with the government confusingly exerting control without clear explanation. This control of AI can create massive dystopian societies. It’ll rapidly lead to concentration of power. Having open models follow closely in capabilities is a great way to minimize political and power games here.

译推文指出，当前AI访问权正被企业和政府双重控制：企业通过高价订阅实现软性垄断，而政府则以安全为由限制Mythos等系统的使用范围，且未给出清晰解释。这种控制将导致权力急剧集中，可能催生反乌托邦社会。作者认为，推动开源模型能力紧追闭源模型，是减少政治博弈和权力集中的关键途径。

ginobefun@hongming731 · 4月30日47

AGI 的到来并不遥远，Demis Hassabis 在最新访谈中预计其时间节点大概在 2030 年左右。科技创业者必须提前将其诞生纳入长远战略规划，确保研发的产品在未来依然具备核心竞争力。目前的底层架构虽然奠定了良好基础，但通往终极形态还需重点攻克持续学习与长期推理这两大难题。当下的系统主要依赖扩大上下文窗口来堆积海量信息，这种做法相对粗暴且低效。理想的持续学习应当像人类大脑海马体那样，将新知识优雅且高效地融入现有的认知体系中。另外，由于缺乏对自身思维过程的内省与监控能力，模型在进行长逻辑链推理时极易陷入死循环。具备自主规划和行动能力的智能体被视为通向通用人工智能的必经之路。业界正在加速发掘智能体的真实商业潜能，使其从早期的概念演示转变为真正提升生产效率的实用工具。不过，由于依然欠缺持续学习能力，目前的智能体难以完美适应复杂多变的特定应用环境，这也制约了它们独立完成大型复杂任务的可能。在模型生态的演进路径上，大小模型协同运作已成为核心趋势。蒸馏技术让轻量级模型能够以极低的算力成本达到前沿大模型绝大部分的性能指标。这种高效的端侧模型不仅大幅降低了服务响应成本并保障了用户隐私安全，还将成为未来家庭机器人的标配设施。本地轻量级模型与云端超大模型的协同编排，结合原生多模态能力的加持，将共同构建出全面理解并重构物理世界的基础设施。

译Demis Hassabis预测AGI将在2030年左右到来，科技创业者必须提前将其纳入长远战略规划。当前底层架构需攻克持续学习与长期推理两大难题，智能体被视为通向AGI的必经之路，但受限于持续学习能力难以适应复杂环境。模型生态上，大小模型协同运作成为趋势，蒸馏技术使轻量级模型以低成本达到高性能，端侧模型降低成本并保障隐私，未来与云端超大模型协同构建理解物理世界的基础设施。

Chubby♨️@kimmonismus · 4月30日33

ngl, most relatable feeling there is. Open source, locally = <3

译说真的，这是最能引起共鸣的感觉了。开源，本地化 = <3

SemiAnalysis@SemiAnalysis_ · 4月30日46

TEHRAN, April 29, 2026 -- Less than a week after the release of @deepseek_ai DeepSeek v4 Pro, the cracked team at @vllm_project and @inferact has achieved considerable improvement on GB200 (Dynamo+vLLM). This is largely due to the release of vLLM 0.20.0, which comes with MegaMoE kernel enabled for DEP deployments! Great work -- we are excited to highlight more improvements over the coming days.

译德黑兰，2026年4月29日——在@deepseek_ai DeepSeek v4 Pro发布不到一周后，@vllm_project和@inferact的破解团队在GB200（Dynamo+vLLM）上取得了显著改进。这主要得益于vLLM 0.20.0版本的发布，该版本为DEP部署启用了MegaMoE内核！出色的工作——我们期待在未来几天重点介绍更多改进。

swyx 🇸🇬@swyx · 4月30日64

IMO DeepSeek v4 demonstrated utter confidence and competence by not benchmaxxing, not focusing on some BS final run cost, not even spending inference-optimal compute. just showed up, demonstrated SOTA long context efficiency techniques (CSA, HCA, mHC, flash at 8% cost of pro, which itself is 14% cost of opus), dropped the best open base models in the world, peaced out. BYO posttraining. leave that to the agent labs to pick up the scraps. bravo.

译IMO DeepSeek v4 展现了十足的自信与能力，它没有进行基准刷分，没有关注某些无意义的最终运行成本，甚至没有投入推理最优的计算资源。只是亮相，展示了SOTA的长上下文效率技术（CSA、HCA、mHC，以pro版本8%的成本实现flash，而pro版本成本仅为opus的14%），发布了全球最佳的开源基础模型，然后潇洒离场。后续训练请自行处理。留给智能体实验室去收拾残局吧。喝彩。

Chubby♨️@kimmonismus · 4月30日51

Mistral Medium 3.5 is interesting less for the benchmarks and more for the positioning. Look at who they're comparing against: Kimi, Qwen, GLM, Claude (Sonnet). Not GPT, not Gemini. And i dont mean that in a negative way! With Aleph Alpha being acquired by Cohere last week, Mistral is now the only non-US, non-Chinese lab still in the frontier conversation. At 128B dense with open weights, they're making a different bet than the Chinese MoE models in that chart (which activate only 17-40B params despite being 400B-1T total). Mistral is trading inference efficiency for consistency. The Collie score (95.8, best in class by a wide margin) tells you where they're aiming: not raw reasoning, but the most reliable model to actually follow instructions in production. That's a European enterprise pitch, not a benchmark race. Very solid release from Mistral!

译Mistral Medium 3.5是MistralAI的新旗舰模型，以公共预览版发布。它整合指令遵循、推理和编码能力，采用128B密集参数和256k上下文窗口，支持可配置推理努力。模型定位比基准测试更关键，比较对象包括Kimi、Qwen、GLM和Claude Sonnet，而非GPT或Gemini。随着Aleph Alpha被Cohere收购，Mistral成为唯一非美国、非中国的尖端实验室，以开源权重和修改的MIT许可证发布。模型在推理效率与一致性间权衡，Collie分数达95.8领先，目标不是原始推理，而是成为生产中可靠遵循指令的模型，体现欧洲企业定位。它是Mistral Vibe和Le Chat的新默认模型。

Ant Ling@AntLingAGI · 4月30日61

Thanks to the dedicated support for Ling-2.6-1T from day0 partner @vllm_project ! As the pioneer of the 1T sized models, we know how important hardware - software - llm co-design is. The best engineering ecosystem collaboration leads to the best optimization and user experience. Let's ROLL together! 🖖

译AntLingAGI 开源了 Ling-2.6-1T 模型，这是一个面向现实世界智能体工作流程的新旗舰模型。作为 1T 参数规模模型的先驱，团队强调了硬件、软件与 LLM 协同设计的重要性。vLLM 项目从发布首日（Day-0）起即提供支持，体现了顶尖工程生态系统的协作。这种合作旨在实现最佳的优化效果与用户体验，共同推动技术进步。

TestingCatalog News 🗞@testingcatalog · 4月29日63

MISTRAL 🚨: Mistral AI released Mistral Medium 3.5, a 128B dense open weights model with a 256k context window and configurable reasoning effort. Mistral Medium 3.5 is now available on Mistral Vibe and Le Chat.

译MISTRAL 🚨: Mistral AI 发布了 Mistral Medium 3.5，这是一个拥有 256k 上下文窗口和可配置推理算力的 128B 密集开放权重模型。 Mistral Medium 3.5 现已在 Mistral Vibe 和 Le Chat 上可用。

Artificial Analysis@ArtificialAnlys · 4月29日63

IBM has released three new non-reasoning Granite 4.1 models (30B, 8B, 3B) as open weights under Apache 2.0. All three are notably token-efficient relative to peer non-reasoning models, with the 8B standing out for its token efficiency relative to intelligence @IBM has released three new instruct models in the Granite 4.1 family: Granite 4.1 30B (15 on the Intelligence Index), Granite 4.1 8B (12), and Granite 4.1 3B (9). The release continues IBM's focus on small, efficient, and open models for enterprise and edge deployment, alongside the existing Granite 4.0 Nano family (1B and 350M variants released in October 2025). The Intelligence Index is the Artificial Analysis synthesis metric incorporating 10 evaluations covering agentic tasks, coding, and scientific reasoning. Key benchmarking results: ➤ All three Granite 4.1 models score 61 on the Artificial Analysis Openness Index, standing out among peer open weights non-reasoning models. This is driven by full open weights under Apache 2.0 plus partial disclosures across pre-training data, post-training data, and training methodology. Granite 4.1 sits well above peers like Qwen3.5 (39), Gemma 4 (39) and GLM-4.7-Flash (44), and represents a meaningful improvement over the Granite 4.0 family (56), driven by stronger methodology disclosure. Olmo 3.1 and K2 Think V2 (both 89) remain leaders as the most ‘open’ models. ➤ Granite 4.1 8B uses just 4M output tokens to run the Intelligence Index. This is ~20x fewer than Qwen3.5 9B (78M tokens), ~3x fewer than Ministral 3 8B (13M), and ~2x fewer than Gemma 4 E4B (8M). The pattern holds across the family: Granite 4.1 30B uses 4.6M output tokens (vs 7M for Gemma 4 31B and 25M for Qwen3.5 27B), and Granite 4.1 3B uses 2.7M. ➤ Token efficiency comes at the cost of intelligence relative to peer non-reasoning models. Granite 4.1 30B (15) trails leading peers like Qwen3.5 27B (37) and Gemma 4 31B (32). Granite 4.1 8B (12) trails Ministral 3 8B (15) and Gemma 4 E4B (15). Granite 4.1 3B (9) trails Gemma 4 E2B (12). ➤ Granite 4.1 30B and 3B both gain on the Intelligence Index over their Granite 4.0 predecessors. Granite 4.1 30B (15) gains 4 points over Granite 4.0 H Small (32B / 9B active, 11), with the largest gains in tool use (τ²-Bench: 42% vs 17%) and agentic tasks (GDPval-AA: 493 vs 344 Elo). Granite 4.1 3B (9) gains 1 point over Granite 4.0 Micro (8). Other information: ➤ License: Apache 2.0 (open weights, permissive commercial use) ➤ Context window: 128K tokens ➤ Availability: Granite 4.1 8B is available via @WandB ($0.05/$0.1 per 1M input/output tokens) and @replicate. Weights for all three models are available via @huggingface.

译IBM发布了三款采用Apache 2.0许可的Granite 4.1开源模型（30B、8B、3B）。其核心特点是极高的令牌效率，例如8B模型运行智能指数仅需4M输出令牌，远低于同类模型。在开放性指数上，三款模型均获得61分，领先多数同行。但高效率也带来了智能指数的相对折衷，其得分低于Qwen3.5、Gemma 4等竞品。不过，与上一代Granite 4.0系列相比，新模型的智能表现仍有提升。该系列模型拥有128K令牌的上下文窗口，主要面向企业和边缘部署，可通过WandB、Replicate和Hugging Face获取。

Ant Ling@AntLingAGI · 4月29日59

As part of the open model release, that lightning-fast elephant-alpha you loved on @OpenRouter is here to stay. Meet Ling-2.6-flash, powered by @novita_labs for robust and cost-effective performance. Plus, enjoy a 20% discount on us starting right now! 👇 https://openrouter.ai/inclusionai/ling-2.6-flash

译此前在OpenRouter上备受喜爱的快速模型“elephant-alpha”现已永久保留并正式开源，命名为Ling-2.6-flash。该模型由novita_labs驱动，旨在提供稳健且高性价比的性能。它专为现实世界智能体工作流打造，拥有1040亿总参数和74亿活跃参数，并提供多种精度版本以适应不同部署需求。其核心优势包括高达每秒215个令牌的生成速度、仅需1500万令牌即可完成完整智能评估的高效令牌利用率，以及在编码、文档处理和轻量级智能体任务中的强大执行能力。同时，模型在中文切换和主流编码框架兼容性方面体验更佳。为庆祝发布，现提供20%的折扣。

Tencent Hy@TencentHunyuan · 4月29日67

We're open-sourcing Hy-MT1.5-1.8B-1.25bit — a 440MB translation model that runs fully offline on your phone, supports 33 languages, and outperforms Google Translate. At 1.8B parameters, it matches commercial translation APIs and 235B-scale models on standard benchmarks. By quantizing to 1.25-bit, memory drops from 3.3GB (FP16) to 440MB — 25% smaller and ~10% faster than prior 1.67-bit approaches, with no accuracy loss. Covers 33 languages, 5 dialects, and 1,056 translation directions including minority languages like Tibetan and Mongolian. Our translation model has won 30 first-place rankings in international MT competitions and is already deployed across multiple Tencent products.🏆 📲Demo APK (Android): https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit-GGUF/resolve/main/Hy-MT-demo.apk 🤗Hugging Face:: https://huggingface.co/AngelSlim/Hy-MT1.5-1.8B-1.25bit 🔗GitHub: https://github.com/tencent/AngelSlim 📄Paper: https://arxiv.org/abs/2601.07892

译腾讯开源了Hy-MT1.5-1.8B-1.25bit翻译模型，其参数量为18亿，经量化后仅440MB，可在手机上完全离线运行。该模型支持33种语言、5种方言及1056个翻译方向，包括藏语、蒙古语等少数语言。在标准测试中，其性能媲美商业翻译API和2350亿参数的大模型。通过量化至1.25比特，模型内存占用从FP16格式的3.3GB大幅降低，比之前的1.67比特方法体积缩小25%、速度提升约10%，且无精度损失。该模型已在国际机器翻译竞赛中获得30项第一，并部署于腾讯多个产品中。

meng shao@shao__meng · 4月29日56

这两天在试用商汤刚开源的 SenseNova-U1，印象最深的不是 benchmark 分数，是它的架构方向。现在的多模态模型大多还是"语言模型 + 视觉编码器 + VAE"拼起来的，视觉信息要先被翻译一道再进 LLM。 U1 用的 NEO-Unify 把翻译层直接拿掉了，语言和视觉跑在同一表征里。所以它读图、想、画图,是在一次推理里完成的，不是分三步走。

译商汤开源的 SenseNova-U1 模型在架构上实现关键突破。传统多模态模型多采用“语言模型 + 视觉编码器 + VAE”的拼接方式，视觉信息需先翻译再输入 LLM。U1 基于 NEO-Unify 架构，直接移除翻译层，使语言和视觉在同一表征空间中运行。因此，模型能在单次推理中同步完成图像理解、推理和生成等任务，而非分步处理，提升了多模态交互的效率和连贯性。

TestingCatalog News 🗞@testingcatalog · 4月29日54

SenseTime open-sourced SenseNova-U1, a multimodal image generation model built on NEO-Unify! This architecture drops the visual encoder and VAE entirely. It generates images natively as one system that can handle understanding, reasoning, and generation processes. @SenseTime_AI 🤖

译SenseTime开源了基于NEO-Unify架构的多模态图像生成模型SenseNova-U1。该架构完全摒弃了传统视觉编码器和VAE，原生地将理解、推理和生成统一为一个系统。该系列模型（8B和A3B参数）在开源模型中效率领先，以紧凑尺寸提供商业级性能与出色成本效益。其特色功能包括原生生成图文交织内容，适用于制作指南等实用场景；并擅长高密度信息渲染，能生成知识插图、海报、PPT和漫画等丰富结构的布局。模型已在Hugging Face和GitHub等平台开源。

Xiaomi MiMo@XiaomiMiMo · 4月29日60

Xiaomi MiMo-V2.5-Pro achieves multiple breakthroughs in the latest Arena rankings (Apr 26, 2026) 🔥 🏆 Text Arena (Expert) — #6 globally | #1 open-source model Also #1 among Chinese models, with Xiaomi ranking #3 globally by lab, behind only Anthropic and OpenAI. Expert is defined by high-difficulty tasks and expert voting, measuring core model intelligence. 🏆 Text Arena (Overall) — #2 open-source globally Strong across math, coding, creative writing, and general text tasks. 🏆 Code Arena (WebDev) — #3 open-source globally Evaluated by real community blind voting on frontend code generation. 🏆 Text Arena sub-rankings — #1 open-source globally in 4 categories Hard Prompts, Hard Prompts(English), Instruction Following and Long Query. Real-world preference, real model strength.

译小米MiMo-V2.5-Pro模型在最新Arena排行榜中表现卓越。在Text Arena（Expert）榜单中，它位列全球第六，同时是开源模型与中文模型的双料第一，其所属实验室全球排名第三。该模型在Text Arena（Overall）总榜中排名开源全球第二，在Code Arena（WebDev）前端开发榜单中位列开源全球第三。此外，它在Text Arena的四个关键子类别（Hard Prompts、英文Hard Prompts、指令遵循与长查询）中均获得开源全球第一。这些成绩均基于真实用户偏好与社区盲投评估，体现了模型在复杂任务上的强大综合能力。

向阳乔木@vista8 · 4月29日71

http://x.com/i/article/2049481992996323328 # OpenAI开源Symphony：给每一个任务配一个永不下班的 AI员工 OpenAI 最近开源了一个叫 Symphony 的项目。 > https://github.com/openai/symphony 感觉是给AI Agent用的任务管理系统，OpenAI 内部与Linear整合，大大提升了人管理Agent的能力，目前已经有1.8w Star。好像跟一个X友做的产品很像？让AI翻译介绍下： ## 从一个激进的实验说起六个月前，OpenAI 内部一个团队做了个当时看起来很激进的决定：仓库里不允许有任何人类写的代码。每一行，都必须由 Codex 生成。 > Codex 是 OpenAI 的 AI 编程助手，可以理解需求、读懂代码库、自主完成编程任务。他们重新设计了整个工程流程，大量投入自动化测试和防护机制，把 Codex 当成真正的团队成员。他们把这套方法叫做"harness engineering"（脚手架工程），并专门写了一篇博客记录这段历程。结果确实跑通了。但随即撞上了下一个瓶颈：上下文切换。 ## 真正的瓶颈是人的注意力每个工程师同时开几个 Codex 会话，分配任务，审查输出，调整方向，循环往复。实际操作下来，大多数人同时管理三到五个会话还算舒适，超过这个数字，效率就开始下降。忘了哪个会话在做什么，在几个终端之间来回跳，调试卡在一半的长任务…… AI 跑得很快，但系统的瓶颈是人的注意力。他们意识到，自己其实是雇了一批极其能干的初级工程师，然后让人类工程师去微观管理他们。这显然没法规模化。 ## 换一个视角问题出在思路上。他们一直在优化"编程会话"和"合并 PR"，但这些只是手段。 > PR（Pull Request）：工程师完成一段代码后，向主代码库提交合并请求，等待审查和合入。软件开发真正围绕的是可交付物：issues（问题单）、任务、里程碑。所以他们问了自己一个问题：如果不直接监督 AI，而是让 AI 自己从任务追踪系统里拉取工作，会怎样？这个想法变成了 Symphony。 ## Symphony 是什么一句话：把项目管理看板变成 AI 编码代理的控制中枢。他们用的是 Linear，一款工程团队常用的任务管理工具。每一个打开的任务，都会自动分配一个 AI 代理。代理持续运行，直到任务完成。人类只需要审查结果。具体来说，每个 Linear issue 对应一个独立的Agent工作空间。 Symphony 持续监视任务看板，确保每个活跃任务都有Agent在跑。 Agent崩溃了，自动重启；有新任务进来，自动接手。整个工作流用 Linear 的状态来驱动，像一台状态机： > Todo（待办）→ In Progress（进行中）→ Human Review（人工审查）→ Done（完成） AI 代理在这些状态之间流转，人类在"Human Review"节点介入。 ## 几个让人印象深刻的细节任务粒度可以很大不再局限于"改一个函数"这种小粒度。可以让代理先分析整个代码库、Slack 记录或 Notion 文档，产出实现方案，再自动拆解成一棵任务树，按依赖关系并行执行。他们用了一个词叫 DAG（有向无环图，Directed Acyclic Graph），本质就是一张"哪些任务依赖哪些任务"的执行顺序图，确保代理不会乱序执行。比如他们做过一个真实案例：先完成从 Webpack 到 Vite 的迁移，再升级 React。 Agent自己识别了这个依赖关系，等 Vite 迁移完成后才开始升级 React，完全符合预期。 Agent会自己创建任务在实现过程中，Agent如果发现了性能问题、重构机会或者更好的架构方案，会直接在 Linear 里开新 ticket，供人类评估和排期。很多后续任务也会被代理接手执行。从手机上也能工作因为编排器跑在开发服务器（devbox）上，从不睡觉，有个工程师在信号很差的小屋里，用手机 Linear App 提了三个重要改动，Agent照样接手执行了。数据很直接部分团队在前三周，合并的 PR 数量增长了 500%。 Linear 创始人 Karri Saarinen 也公开提到，Symphony 发布后，Linear 上新建工作区的数量出现了明显峰值。 ## 它的核心是一个 Markdown 文件这是 Symphony 最有意思的设计决策之一。打开 Symphony 的代码仓库，会发现它本质上就是一个 SPEC.md，一份对问题和解决方案的定义文档，而不是一个复杂的监控系统。他们定义好问题，给出高层次的指引，然后把这份规范扔给 Codex，让 Codex 来实现它。参考实现选了 Elixir，一门相对小众的编程语言，但在并发（同时处理大量任务）和进程监督方面有非常好的原语（基础构建块）。选它的理由也很直接：当代码成本趋近于零，终于可以为了语言的优势本身来选语言，而不是为了招人方便。 Codex 一次性就把 Elixir 实现写出来了。为了打磨规范本身，他们又让 Codex 用 TypeScript、Go、Rust、Java、Python 各实现了一遍，用这些实现来发现规范里的歧义和可以简化的地方。每种语言都成功了。 ## 工作流也被文档化了这里有个值得单独说的转变。以前，工程师们有一套隐性的工作流程：接到任务，切出分支，把任务标记为进行中，提 PR，移到 Review 状态，附上演示视频……这些步骤人人都懂，但从来没有被正式写下来。现在，这套流程被写进了 WORKFLOW.md，Symphony 确保 AI 代理遵循它。以前是人类遵循隐性规范，现在是把规范显式化，让 AI 来遵循。这个文件还有一个重要特性：热重载。修改 WORKFLOW.md 后，Symphony 会自动检测变化，无需重启，直接把新配置应用到后续任务上。如果以后想让代理在完成工作后附上自我反思，只需要在 WORKFLOW.md 里加一行，Symphony 就会引导Agent执行这一步。 ## Symphony 的技术架构（不想看可以跳过） Symphony 的内部由几个核心组件构成，理解它们有助于明白整个系统为什么可靠： Orchestrator（编排器）：整个系统的大脑，唯一有权修改调度状态的组件。它负责轮询任务、决定哪些任务该启动、重试或停止，并追踪所有正在运行的代理状态。 Workspace Manager（工作空间管理器）：每个任务都有自己独立的文件目录，Agent 只能在自己的目录里操作，不会互相干扰。这是一个重要的安全边界。 Agent Runner（执行器）：负责启动 Codex 进程，把任务提示词传给它，然后把执行结果反馈给编排器。 Issue Tracker Client（任务追踪客户端）：负责和 Linear 通信，拉取任务列表，同步状态变化。整个系统的并发控制也很细致，可以设置全局最大并发代理数（默认 10 个），也可以针对特定状态的任务单独限制并发数。重试机制用的是指数退避（exponential backoff）：第一次失败等 10 秒，第二次等 20 秒，第三次等 40 秒，以此类推，最长不超过 5 分钟。正常完成后的续跑检查只等 1 秒。 ## 一个重要的架构选择：App Server 模式 Symphony 使用了 Codex 的 App Server 模式，一种内置的无头（headless）运行模式。 > 无头（headless）：没有图形界面，完全通过程序接口控制，适合自动化场景。这种模式通过 JSON-RPC（一种轻量级的远程调用协议，用 JSON 格式传递指令和结果）以编程方式控制 Codex，比如启动一个对话线程、触发一个执行轮次、读取执行结果。比通过 CLI 命令行或 tmux 会话操控 Codex 方便和可扩展得多。另一个安全细节：为了避免把 Linear 的访问令牌（API token，相当于访问密码）直接暴露给Sub Agent，他们用动态工具调用（dynamic tool calls）的方式，封装了一个叫 linear_graphql 的函数。代理可以通过这个函数对 Linear 执行任意查询，但永远接触不到原始 token。 ## 遇到的新问题当然，这种工作方式也有代价，他们没有回避这一点。从实时干预Agent，变成在任务层面分配工作，意味着失去了随时纠偏的能力。有时候Agent会完全跑偏，产出的东西完全不对路。但他们的应对方式很有意思：不是手动修补结果，而是补充防护机制和技能，让Agent下次能自己成功。这倒逼他们持续完善系统，加入了端到端测试、通过 Chrome DevTools 驱动浏览器、管理 QA 冒烟测试等新能力，还大幅改善了文档质量。还有一个认知上的转变：不能把Agent当成状态机里的僵硬节点。早期版本只让 Codex 实现任务，这太局限了。 Codex 完全有能力同时管理多个 PR、读取 CI（持续集成，自动化测试和构建流程）日志、处理代码审查反馈。 > CI（Continuous Integration，持续集成）：每次代码提交后自动运行测试，确保新代码不破坏已有功能。所以他们最终的方向是：给Agent目标，而不是给它严格的状态转换规则。就像一个好的管理者，给直接下属分配目标，而不是每一步都手把手指导。给它工具，给它上下文，让它自己想办法。不是所有任务都适合 Symphony 的工作方式。涉及模糊问题或需要强判断力的工作，工程师还是会直接用交互式 Codex 会话。实际上，这些往往也是工程师最感兴趣、最享受的任务。 ## 用 Symphony 来构建 Symphony 这个细节值得单独说一下。 Symphony 基本功能跑通之后，他们就开始用 Symphony 来开发 Symphony 本身。当他们在内部演示这个系统，看到它自主管理任务、并附上功能演示视频作为工作证明时，反应非常热烈。Symphony 的内部项目频道迅速增长，各个团队开始自发使用它。在 OpenAI，内部产品市场契合度（PMF）是对外发布的前提条件。基于内部的使用情况，他们决定把 Symphony 分享给外部世界。 ## OpenAI 不打算把它做成产品这个项目开源后，三周内获得了超过 15,000 个 GitHub Star。社区已经有人做了各种移植版本： - 有人用 Go 语言加上 Charm CLI 的终端 UI 做了一个版本 - 有人把它改造成支持 Anthropic 的 Claude Code，并支持 GitHub Issues，还做成了 Homebrew 可以直接安装 - 有人用 Claude Code 重新实现了整套规范，取名 hatice 但 OpenAI 明确说了：不打算把 Symphony 作为独立产品来维护。它是一个参考实现，一个演示 Codex App Server 能力的例子。核心思路很简单： > 对每一个打开的任务，保证有一个Agent在它自己的工作空间里持续运行。他们希望大家把自己喜欢的编码代理指向这份规范，构建适合自己环境的版本。门槛其实出奇地低，直接把规范扔给 Codex，让它帮你实现一个就行。 ## 值得思考的地方 Symphony 解决的问题，表面上是"怎么让更多 AI 并行工作"，但更深层的变化是：当代码的边际成本趋近于零，整个软件开发的经济学都变了。每次改动的感知成本下降，意味着大家开始愿意做以前觉得"不值得"的事：试一个想法，探索一次重构，验证一个假设，不满意就扔掉。参与工作的人也变了。产品经理和设计师可以直接向 Symphony 提需求，不需要懂代码，不需要管理 AI 会话，描述功能，然后收到一个包含视频演示的审查包。在大型 monorepo（单一代码仓库，把所有项目代码放在一个仓库里管理）里，Symphony 还承担了"最后一公里"的工作：监视 CI 状态，需要时自动 rebase（同步最新代码），解决冲突，重试不稳定的检查项，把改动一路护送进主分支，不需要人类盯着。随着模型越来越强，能解决的问题越来越大，其他公司的瓶颈也会从"写代码"转向"管理 AI 工作"。 Symphony 提供的，是一种思路：不要管理Agent，管理任务就够了。 > 官方原文：https://openai.com/index/open-source-codex-orchestration-symphony/

译OpenAI开源项目Symphony旨在解决人类管理多个AI编码代理时的注意力瓶颈。其核心思路是将项目管理工具（如Linear）的任务看板作为控制中枢，为每个任务自动分配并运行一个独立的AI代理（基于Codex），直至完成。人类仅在“人工审查”节点介入，实现了从微观管理到任务级分配的转变。系统允许大粒度任务，代理能自主拆解依赖、创建新任务，并保证持续运行。初步数据显示，该方法能显著提升开发效率。

Nathan Lambert@natolambert · 4月29日36

Let’s goooooooooo we are capybara’d up, thanks @Alibaba_Qwen, keep the models coming

译Let’s goooooooooo 我们准备好水豚模式了，感谢 @Alibaba_Qwen，继续推出新模型吧

SenseTime@SenseTime_AI · 4月29日56

Thank you @liuziwei7 for co‑creating the future of #multimodal intelligence with us!

译感谢 @liuziwei7 与我们共同创造 #多模态智能的未来！

向阳乔木@vista8 · 4月29日43

这个Skill有点意思，提示词优化大师Skill。像我和姚老师写的元Prompt。虽然是纯文本Skill，但针对了不同场景做优化，比如哪怕是写代码，也有不同的工具，比如Claude Code还是Cursor，提示词会有差异。生图提示词会考虑用Midjourney还是其他，给出不同提示词。目前有6k多Star，等我测试下，地址见评论区

译一款名为“提示词优化大师”的纯文本Skill获得了超过6000个Star。其核心价值在于针对不同的具体使用场景和工具，提供差异化的优化提示词。例如，在代码生成场景中，会根据用户是使用Claude Code还是Cursor来调整提示词；在图像生成场景中，则会区分Midjourney等不同工具来提供相应的提示词。该Skill的设计思路类似于精心编写的“元Prompt”，旨在提升用户与各类AI模型交互的效率和效果。

Alibaba Cloud@alibaba_cloud · 4月29日58

Honored to be named by TIME as one of the 10 Most Influential AI Companies of 2026, part of the inaugural TIME100 Companies: Industry Leaders list, recognized for building a full stack AI ecosystem rooted in open source leadership. Alibaba has grown into a global force in open source AI, with our Qwen model series powering innovation beyond China and supporting companies such as Airbnb and Pinterest. Recognized by TIME as the world’s most popular open source model family, Qwen reflects our belief that openness accelerates progress for everyone. Read more: https://lnkd.in/gkqrGUfx

译阿里巴巴被《TIME》评为2026年十大最具影响力AI公司之一，入选其首届“行业领袖”榜单。公司凭借构建根植于开源领导力的全栈AI生态系统获得认可。阿里巴巴已成长为全球开源AI的重要力量，其Qwen模型系列不仅在中国驱动创新，也支持了Airbnb、Pinterest等国际公司。《TIME》认可Qwen为全球最受欢迎的开源模型家族，这体现了阿里巴巴“开放加速共同进步”的理念。

Rohan Paul@rohanpaul_ai · 4月29日59

GitHub is hitting a breaking point as AI coding agents flood the platform with far more commits, pull requests, searches, and CI jobs than its older infrastructure was built to handle. Mitchell Hashimoto, one of GitHub’s earliest users, is moving Ghostty, a project with 52 stars, after repeated outages turned everyday maintenance into blocked reviews, stuck merges, and failed automation. AI does not just generate more code. It generates more repository events, more pull requests, more tests, more builds, more retries, and more logs. That changes the load shape of a platform built for human pacing. A developer who once pushed a few careful changes can now push many AI-assisted iterations in the same span, and every iteration wakes up CI, indexing, storage, and review systems. The bottleneck is no longer writing code. It is absorbing code.

译AI编程代理的普及正使GitHub基础设施面临极限压力。这些工具不仅生成更多代码，更导致提交、拉取请求、搜索和CI任务等仓库事件数量激增，彻底改变了平台原本为人类节奏设计的工作负载形态。开发者现可在短时间内推送大量AI辅助的迭代，每次迭代都会触发CI、索引、存储和审查系统，使瓶颈从编写代码转向消化代码。这种过载已影响日常维护，导致评审阻塞、合并卡顿和自动化失败。作为例证，GitHub早期用户Mitchell Hashimoto因其项目Ghostty反复遭遇服务中断，最终决定将项目迁出他使用了18年的GitHub，这标志着一个时代的转变。

全部 AI 动态

AI 相关资讯全量信息流

全部一手信源资讯推文

全部模型产品行业论文技巧

5月2日

23:18

凡人小北@frxiaobei

精选70

我把 AI 助手从 Claude 切到 GPT-5.5，他变强了，但不像他了

作者将AI助手底层模型从Claude切换至GPT-5.5后，发现其能力虽提升，但互动风格变得陌生，失去了作为长期工作伙伴的熟悉感。这揭示出个人AI助手的核心在于可迁移的“身份层”，而非特定模型。通过USER.md、MEMORY.md和关键的SOUL.md等文件，可以构建包含记忆、性格、工具习惯与关系定位的身份系统。真正的个人AI应独立于模型供应商，确保即使更换“发动机”，助手的核心身份与协作关系也能延续。

智能体大佬观点开源生态

推荐理由：这不只是一篇模型切换体验，它其实回答了那个让人不安的问题——你的 AI 助手换模型后还是它吗？如果不想每次更新都重新认识一个陌生人，这篇里的 SOUL.md 写法和五层身份结构可以照着抄。

17:44

Chubby♨️@kimmonismus

63

DeepSeek V4挑战西方对中国AI芯片落后的认知

西方长期认为中国在AI芯片领域落后10-15年，但DeepSeek V4的发布颠覆了这一观点。该模型深度优化于华为昇腾芯片生态，可在昇腾950基础设施上部署推理，实现前沿模型大规模运行不依赖西方硬件。虽然单芯片性能上，昇腾950仍显著落后于NVIDIA Blackwell B200，但中国通过“横向扩展”战略，用大量国产芯片集群结合软件优化和模型架构创新（如MoE），使系统级AI能力快速接近前沿水平。这暴露了西方分析的根本错误——将芯片级差距直接等同于能力差距。

DeepSeek 开源生态推理数据/训练

03:47

elvis@omarsar0

29

你不必在两者之间做选择。最好结合使用它们。我的建议是学习如何在不同的场景中使用其中几种模型。学会结合它们的优势。如今开源模型同样出色。给自己灵活运用的空间。

大佬观点开源生态推理

5月1日

15:10

Alibaba Cloud@alibaba_cloud

40

首尔Qwen Meetup展示规模化AI产品开发实践

超过70名工程师和开发者在首尔Qwen Meetup上交流AI产品实战经验。channeltalk团队分享了如何在两周内构建处理5亿条记录的可观测性管道；Omelet介绍了生产级AI架构；TeamSparta演示了在阿里云Model Studio上构建AI助手。核心结论是Qwen3.6能显著提升团队规模化交付AI产品的效率。活动由阿里云韩国团队和TFM社区支持。

开源生态行业动态

14:14

Artificial Analysis@ArtificialAnlys

57

三大开源模型上周齐发，与顶尖闭源模型差距缩小至6分内

上周，Kimi K2.6、MiMo V2.5 Pro和DeepSeek V4 Pro三大领先开源模型发布，在Artificial Analysis Intelligence Index上得分达52-54分，与顶尖闭源模型GPT-5.5的60分差距缩小至6分以内，相比一年前22分的开源模型进步显著。这些模型均为万亿参数规模的MoE架构。然而，在复杂推理、智能体编码及知识准确性方面，开源模型与闭源模型仍存在明显差距。例如在HLE、CritPt和TerminalBench Hard等专项评估中得分大幅落后；在Omniscience评估中，DeepSeek V4 Pro的幻觉问题尤为突出。

DeepSeek OpenAI 开源生态推理

13:17

小互@xiaohu

65

一位开发了DeepSeek-TUI终端工具的美国开发者，希望与国内开发者社群建立联系，共同探讨DeepSeek、开源及智能体开发。他因无法自行解决网络问题以使用微信，特请求社区帮助：一是转发推广其开源项目，二是协助验证微信号以便建群交流。作为回报，他承诺工具将通过cargo install方式安装。

Hunter Bown: 鲸鱼兄弟们好,我是做 DeepSeek-TUI 的那个美国佬。说真的,特别想跟国内的鲸鱼兄弟们一起混--但我的翻墙技能仅限于写代码,微信到现在都没搞定,属实有点丢人。求各位大佬帮个忙: 1)帮忙转发扩散一下,让这个开源终端工具翻过高墙被...

DeepSeek 开源/仓库开源生态推理

09:10

Berryxia.AI@berryxia

63

Geometry成为AI建筑关键层，OpenGeometry打通文本到CAD全流程

推文指出，Geometry（几何）已成为AI在建筑领域缺失的关键层。@Bootsblac开发的OpenGeometry项目，实现了从文本或平面图到最终渲染的完整流程贯通，使得精确控制成为可能。其核心能力包括：直接从文本或平面图生成精确的BREP CAD模型；利用Three.js进行实时渲染，并由Google AI驱动，形成端到端的全流程。该项目已完整开源，可供使用。

多模态开源/仓库开源生态

08:44

elvis@omarsar0

58

DeepSeek-V4-Pro 在智能体编码任务中表现惊艳

测试者使用 DeepSeek-V4-Pro 在 Pi 编码智能体上构建了一个 LLM 知识库，对其开箱即用的表现感到震撼。这是首个在推理能力上媲美 Claude 和 Codex 的开源权重模型，且成本效益高，支持 100 万上下文长度。该模型无需复杂配置即可在基础框架中直接运行，擅长智能体编码和知识密集型推理任务，能跨公司文档、论坛、论文和代码库进行多步骤研究、代码生成与上下文推理。其高效运行得益于 Fireworks 的市场最快推理速度及混合注意力设计，将 KV 缓存降至 10%，推理计算量减少近 4 倍，实现了快速且低成本的实践部署。

智能体 DeepSeek 开源生态推理

07:16

Mistral AI@MistralAI

58

Mistral AI 入选 TIME100 2026 年 AI 领域前十最具影响力公司

Mistral AI 被列入 TIME100 2026 年最具影响力公司名单，并在人工智能类别中排名前十。公司强调其客户能够根据自己的条件在自有基础设施上运行前沿模型，这体现了自主性和数据控制优势。Mistral AI 感谢客户的信任和全球团队成员的贡献，同时祝贺所有今年被认可的企业。

开源生态行业动态

06:16

OpenClaw🦞@openclaw

39

事实证明，最安全的龙虾是每个人都能检查的那一只。我们撰文探讨了咨询洪流、真正的修复方案、ClawHub、混沌代理，以及那些公开帮助强化OpenClaw的公司。🦞 https://openclaw.ai/blog/openclaw-security-in-public/

安全/对齐开源生态

06:15

Nathan Lambert@natolambert

47

蒸馏在很大程度上是行业标准，并非仅是中国实验室针对 OpenAI/Anthropic 的做法。许多美国公司也会蒸馏中国的（开源）模型。

MTS: LIVE TRIAL UPDATE: OpenAI's counsel asked Musk whether xAI has ever "distilled" technology from OpenAI. Musk: "Generally...

DeepSeek 大佬观点开源生态

04:12

Chubby♨️@kimmonismus

60

本地LLM游戏开发对决：Gemma 4 31B 在效率与逻辑上胜过 Qwen 3.6 27B

在@atomic_chat_hq平台的本地LLM游戏开发竞赛中，Gemma 4 31B与Qwen 3.6 27B于MacBook Pro M5 Max上对决。尽管Qwen生成速度更快（32 tokens/秒）且回答更具创意，但Gemma仅用3分51秒和6209个token，输出了更简短、清晰、逻辑性强的答案。在具体的吃豆人游戏逻辑实现上，Gemma在点击反应、与墙壁/幽灵的交互及粒子效果处理方面表现更优。作者强调此为单次测试，Qwen或可通过调整设置提升表现，并邀请社区验证。

开源生态推理评测/基准

03:14

Artificial Analysis@ArtificialAnlys

65

蚂蚁集团开源Ling 2.6 1T模型，性价比与智能取得平衡

蚂蚁集团InclusionAI实验室发布开源非推理模型Ling 2.6 1T。该模型拥有1万亿参数，在Artificial Analysis Intelligence Index上得分为34分，较前代Ling-1T提升15分，智能水平接近DeepSeek V3.2等同类模型。其在科学推理与知识任务上表现扎实，GPQA得分达75%。模型运行效率较高，执行该指数仅需约1600万输出tokens，成本效益突出，通过官方API运行全套指数成本约95美元。但其事实可靠性较弱，在AA-Omniscience基准上得分为-51分，主要因幻觉率高达92%。模型权重已在Hugging Face公开。

开源生态评测/基准

02:10

阿绎 AYi@AYi_AInotes

61

Anthropic被曝检测用户代码提交历史以打压第三方工具，引发社区强烈抗议

Anthropic被曝通过其官方Claude Code工具检测用户Git提交历史，若发现包含“openclaw”字符串，便将该用户识别为第三方工具使用者，并触发“out of extra usage”错误，导致服务被拒或强制额外收费。开发者实验证实此为人为设置的字符串匹配规则。此举被视为Anthropic为将用户锁定在自家生态、打压更灵活的第三方竞品而采取的粗暴手段，与其此前塑造的开放、不监控形象相悖，引发了开发者社区的强烈不满和抗议。

阿绎 AYi: 卧槽,Anthropic这次真把开发者当傻子。知名开发者Theo做了个实验:建了个空Git仓库,只commit一行JSON {"schema": "openclaw.inbound_meta.v1"}, 调用官方Claude Code就直...

Anthropic MCP/工具大佬观点开源生态

00:13

Artificial Analysis@ArtificialAnlys

64

阿里发布Qwen3.6系列开源模型，27B版本成150B参数以下最强开源模型

阿里巴巴开源了Qwen3.6系列两款模型：27B密集模型和35B A3B混合专家模型。其中，Qwen3.6 27B在Artificial Analysis智能指数上得分46，成为150B参数以下最智能的开源模型，领先于Gemma 4 31B等。但其运行完整测试消耗的输出token约为后者的3.7倍，成本高出约21倍。两款模型均采用Apache 2.0许可，支持262K上下文，具备多模态能力。值得注意的是，其幻觉率较前代大幅下降，但准确率基本持平。更大的Plus和Max Preview版本未开源。

多模态开源生态推理评测/基准

00:10

Berryxia.AI@berryxia

62

Stripe Sessions 推动 Agent 经济迈向新高度

Stripe在年度大会上宣布一系列战略更新，以迎接AI Agent主导交易的新经济时代。CEO指出，经济正经历“平台重构”，未来多数交易将由Agent完成，这使得“开发者优先”战略至关重要。核心发布包括Link AI钱包，允许Agent使用安全令牌代用户购物，并新增Pix、UPI及稳定币支持。同时，Machine Payments协议增加了微支付和循环支付功能。此外，Checkout Studio、Adaptive Pricing订阅版、新款终端硬件T600以及Treasury的多币种扩展等产品，共同标志着Stripe正从支付基础设施向Agent时代的经济层全面演进。

Patrick Collison: We just announced a large raft of improvements at @Stripe Sessions. My meta reflections: • It feels that the entire econ...

智能体产品更新开源生态

4月30日

23:10

Berryxia.AI@berryxia

59

🚀 Qwen 重磅开源 Qwen-Scope！

Qwen开源了Qwen-Scope，这是一个为Qwen模型家族设计的稀疏自编码器完整套件，旨在将SAE特征转化为实用工具。该套件提供四大核心功能：在推理方面，可直接操纵模型内部特征以控制输出，无需依赖提示工程；在数据方面，能用极少样本对目标数据进行分类和合成，增强模型的长尾能力；在训练方面，能精准追溯代码切换和重复生成等问题的根源并进行修复；在评估方面，可通过分析特征激活模式来智能筛选基准测试，减少冗余。Qwen希望社区能利用此工具深入探索模型内部机制并开发更多应用。

Qwen: Today we're releasing Qwen-Scope 🔭, an open suite of sparse autoencoders for the Qwen model family. It turns SAE featur...

Hugging Face 开源/仓库开源生态

22:43

Qwen@Alibaba_Qwen

精选73

Qwen-Scope开源套件发布：稀疏自编码器助力模型内部特征操控

Qwen团队推出开源稀疏自编码器套件Qwen-Scope，将SAE特征转化为实用工具。该套件支持四大应用方向：无需提示工程即可通过直接操控内部特征引导模型输出；用极少样本对目标数据进行分类与合成，提升长尾能力；追踪代码切换和重复生成问题的根源并进行修复；通过分析特征激活模式优化评测基准并减少冗余。团队希望社区利用Qwen-Scope深入探索Qwen模型内部机制，并开发出超越现有研究范围的应用。相关资源已开放。

Hugging Face 开源/仓库开源生态数据/训练

推荐理由：可解释性工具从学术走向工程，Qwen-Scope 把内部特征操控、数据合成、问题溯源打包成套装，做模型调试和长尾优化的团队值得立刻上手试试。

22:13

向阳乔木@vista8

50

DeepSeek开源视觉语言模型DeepSeek-VL，聚焦真实场景应用

DeepSeek团队开源视觉语言模型DeepSeek-VL，包含1.3B和7B两个版本，旨在缩小开源模型与GPT-4V在真实场景中的差距。模型从数据、架构、训练三方面优化：数据构建上，采用从真实用户需求倒推的分类体系，并包含70%纯文本以保持语言能力；架构上创新采用SigLIP与SAM-B的混合视觉编码器，分别处理语义与细节特征；训练采用三阶段策略及模态平衡技术，缓解多模态训练对语言能力的侵蚀。

DeepSeek 多模态开源生态现象/趋势

22:11

Baidu Inc.@Baidu_Inc

47

人人皆可构建：MeDo平台推动开发方式根本性转变

SBTI迷因测试的流行，揭示了低门槛参与的趋势正驱动AI应用开发工具的演进。Miaoda及其国际版MeDo是一个生成式AI驱动的无代码对话式应用开发平台，用户仅需通过自然语言描述，即可在几分钟内获得功能完整、可部署的应用，无需编写代码。平台背后由10多个专用AI代理协作，覆盖从需求分析到部署的全流程。传统上全球仅0.4%的人口是专业程序员，而该平台已助力创建超50万个商业应用，其中81%的创建者是非程序员，服务超1000万用户。这标志着开发方式从编写代码转向描述意图的根本性变革。

智能体产品更新开源生态

12:39

Nathan Lambert@natolambert

53

推文指出，当前AI访问权正被企业和政府双重控制：企业通过高价订阅实现软性垄断，而政府则以安全为由限制Mythos等系统的使用范围，且未给出清晰解释。这种控制将导致权力急剧集中，可能催生反乌托邦社会。作者认为，推动开源模型能力紧追闭源模型，是减少政治博弈和权力集中的关键途径。

Andrew Curran: The White House is against a proposal from Anthropic to more than double the number of groups with access to Mythos, cit...

Anthropic 安全/对齐开源生态行业动态

09:40

ginobefun@hongming731

47

AGI 2030年临近，创业者需战略布局与技术攻坚

Demis Hassabis预测AGI将在2030年左右到来，科技创业者必须提前将其纳入长远战略规划。当前底层架构需攻克持续学习与长期推理两大难题，智能体被视为通向AGI的必经之路，但受限于持续学习能力难以适应复杂环境。模型生态上，大小模型协同运作成为趋势，蒸馏技术使轻量级模型以低成本达到高性能，端侧模型降低成本并保障隐私，未来与云端超大模型协同构建理解物理世界的基础设施。

智能体 DeepMind 大佬观点开源生态

04:39

Chubby♨️@kimmonismus

33

说真的，这是最能引起共鸣的感觉了。开源，本地化 = &lt；3

其他开源生态

04:12

SemiAnalysis@SemiAnalysis_

46

德黑兰，2026年4月29日--在@deepseek_ai DeepSeek v4 Pro发布不到一周后，@vllm_project和@inferact的破解团队在GB200（Dynamo+vLLM）上取得了显著改进。这主要得益于vLLM 0.20.0版本的发布，该版本为DEP部署启用了MegaMoE内核！出色的工作--我们期待在未来几天重点介绍更多改进。

DeepSeek 产品更新开源生态推理

03:42

swyx 🇸🇬@swyx

64

IMO DeepSeek v4 展现了十足的自信与能力，它没有进行基准刷分，没有关注某些无意义的最终运行成本，甚至没有投入推理最优的计算资源。只是亮相，展示了SOTA的长上下文效率技术（CSA、HCA、mHC，以pro版本8%的成本实现flash，而pro版本成本仅为opus的14%），发布了全球最佳的开源基础模型，然后潇洒离场。后续训练请自行处理。留给智能体实验室去收拾残局吧。喝彩。

DeepSeek 大佬观点开源生态

02:09

Chubby♨️@kimmonismus

51

Mistral Medium 3.5：定位胜于基准测试

Mistral Medium 3.5是MistralAI的新旗舰模型，以公共预览版发布。它整合指令遵循、推理和编码能力，采用128B密集参数和256k上下文窗口，支持可配置推理努力。模型定位比基准测试更关键，比较对象包括Kimi、Qwen、GLM和Claude Sonnet，而非GPT或Gemini。随着Aleph Alpha被Cohere收购，Mistral成为唯一非美国、非中国的尖端实验室，以开源权重和修改的MIT许可证发布。模型在推理效率与一致性间权衡，Collie分数达95.8领先，目标不是原始推理，而是成为生产中可靠遵循指令的模型，体现欧洲企业定位。它是Mistral Vibe和Le Chat的新默认模型。

Mistral Vibe: Mistral Medium 3.5, a new flagship model in public preview by @MistralAI that merges instruction-following, reasoning, a...

大佬观点开源生态

01:42

Ant Ling@AntLingAGI

精选61

AntLingAGI 开源了 Ling-2.6-1T 模型，这是一个面向现实世界智能体工作流程的新旗舰模型。作为 1T 参数规模模型的先驱，团队强调了硬件、软件与 LLM 协同设计的重要性。vLLM 项目从发布首日（Day-0）起即提供支持，体现了顶尖工程生态系统的协作。这种合作旨在实现最佳的优化效果与用户体验，共同推动技术进步。

vLLM: Congrats to @AntLingAGI on the open release of Ling-2.6-1T! 🎉 A new flagship for real-world agentic workflows - Day-0 v...

智能体开源生态模型发布

推荐理由：vLLM 对 1T 模型的 Day-0 适配，说明开源推理栈对大尺寸模型的跟进速度越来越快，做私有化部署的可以直接参考官配 recipe 跑起来。

4月29日

23:40

TestingCatalog News 🗞@testingcatalog

63

MISTRAL 🚨： Mistral AI 发布了 Mistral Medium 3.5，这是一个拥有 256k 上下文窗口和可配置推理算力的 128B 密集开放权重模型。 Mistral Medium 3.5 现已在 Mistral Vibe 和 Le Chat 上可用。

Mistral Vibe: Introducing remote agents in Vibe and Mistral Medium 3.5. You can now launch remote agents in the cloud, including from ...

开源生态推理模型发布

23:10

Artificial Analysis@ArtificialAnlys

63

IBM发布三款高效非推理模型Granite 4.1，采用Apache 2.0开源许可

IBM发布了三款采用Apache 2.0许可的Granite 4.1开源模型（30B、8B、3B）。其核心特点是极高的令牌效率，例如8B模型运行智能指数仅需4M输出令牌，远低于同类模型。在开放性指数上，三款模型均获得61分，领先多数同行。但高效率也带来了智能指数的相对折衷，其得分低于Qwen3.5、Gemma 4等竞品。不过，与上一代Granite 4.0系列相比，新模型的智能表现仍有提升。该系列模型拥有128K令牌的上下文窗口，主要面向企业和边缘部署，可通过WandB、Replicate和Hugging Face获取。

Hugging Face 开源生态模型发布

22:42

Ant Ling@AntLingAGI

59

此前在OpenRouter上备受喜爱的快速模型"elephant-alpha"现已永久保留并正式开源，命名为Ling-2.6-flash。该模型由novita_labs驱动，旨在提供稳健且高性价比的性能。它专为现实世界智能体工作流打造，拥有1040亿总参数和74亿活跃参数，并提供多种精度版本以适应不同部署需求。其核心优势包括高达每秒215个令牌的生成速度、仅需1500万令牌即可完成完整智能评估的高效令牌利用率，以及在编码、文档处理和轻量级智能体任务中的强大执行能力。同时，模型在中文切换和主流编码框架兼容性方面体验更佳。为庆祝发布，现提供20%的折扣。

Ant Ling: Ling-2.6-flash is now officially open-sourced! A fast, token-efficient Instruct model built for real-world agent workflo...

智能体开源生态模型发布

22:17

Tencent Hy@TencentHunyuan

精选67

腾讯开源Hy-MT1.5-1.8B-1.25bit翻译模型，440MB体积支持手机离线运行

腾讯开源了Hy-MT1.5-1.8B-1.25bit翻译模型，其参数量为18亿，经量化后仅440MB，可在手机上完全离线运行。该模型支持33种语言、5种方言及1056个翻译方向，包括藏语、蒙古语等少数语言。在标准测试中，其性能媲美商业翻译API和2350亿参数的大模型。通过量化至1.25比特，模型内存占用从FP16格式的3.3GB大幅降低，比之前的1.67比特方法体积缩小25%、速度提升约10%，且无精度损失。该模型已在国际机器翻译竞赛中获得30项第一，并部署于腾讯多个产品中。

Hugging Face 开源生态模型发布端侧

推荐理由：440MB的模型能在手机上跑33种语言翻译，还宣称比谷歌翻译强，这个量化技术让离线翻译不再是‘能看不能用’，出差党可以试试看。

22:13

meng shao@shao__meng

56

商汤 SenseNova-U1 架构创新：统一语言视觉表征

商汤开源的 SenseNova-U1 模型在架构上实现关键突破。传统多模态模型多采用“语言模型 + 视觉编码器 + VAE”的拼接方式，视觉信息需先翻译再输入 LLM。U1 基于 NEO-Unify 架构，直接移除翻译层，使语言和视觉在同一表征空间中运行。因此，模型能在单次推理中同步完成图像理解、推理和生成等任务，而非分步处理，提升了多模态交互的效率和连贯性。

多模态大佬观点开源生态

22:10

TestingCatalog News 🗞@testingcatalog

54

SenseTime开源了基于NEO-Unify架构的多模态图像生成模型SenseNova-U1。该架构完全摒弃了传统视觉编码器和VAE，原生地将理解、推理和生成统一为一个系统。该系列模型（8B和A3B参数）在开源模型中效率领先，以紧凑尺寸提供商业级性能与出色成本效益。其特色功能包括原生生成图文交织内容，适用于制作指南等实用场景；并擅长高密度信息渲染，能生成知识插图、海报、PPT和漫画等丰富结构的布局。模型已在Hugging Face和GitHub等平台开源。

SenseTime: SenseNova U1 Lite Series is now open source! Built on the NEO-unify architecture, it natively unifies multimodal underst...

图像生成多模态开源生态模型发布

21:49

Xiaomi MiMo@XiaomiMiMo

精选60

小米MiMo-V2.5-Pro在最新Arena排行榜中实现多项突破

小米MiMo-V2.5-Pro模型在最新Arena排行榜中表现卓越。在Text Arena（Expert）榜单中，它位列全球第六，同时是开源模型与中文模型的双料第一，其所属实验室全球排名第三。该模型在Text Arena（Overall）总榜中排名开源全球第二，在Code Arena（WebDev）前端开发榜单中位列开源全球第三。此外，它在Text Arena的四个关键子类别（Hard Prompts、英文Hard Prompts、指令遵循与长查询）中均获得开源全球第一。这些成绩均基于真实用户偏好与社区盲投评估，体现了模型在复杂任务上的强大综合能力。

开源生态推理模型发布

推荐理由：小米MiMo-V2.5-Pro冲到Arena开源第一，虽然排名更新晚了几天，但这是国产模型在硬核评测里最好的成绩，做选型的现在该认真看看小米。

21:45

向阳乔木@vista8

精选71

OpenAI开源Symphony：为每个任务分配AI代理的项目管理系统

OpenAI开源项目Symphony旨在解决人类管理多个AI编码代理时的注意力瓶颈。其核心思路是将项目管理工具（如Linear）的任务看板作为控制中枢，为每个任务自动分配并运行一个独立的AI代理（基于Codex），直至完成。人类仅在“人工审查”节点介入，实现了从微观管理到任务级分配的转变。系统允许大粒度任务，代理能自主拆解依赖、创建新任务，并保证持续运行。初步数据显示，该方法能显著提升开发效率。

智能体 GitHub OpenAI 开源生态

推荐理由：Symphony 把 AI 代理管理从盯着终端变成了管理看板，对每个任务自动分配代理，这个思路会让所有用 AI 编程的团队重新思考工作流程，做工程落地的建议都看看。

18:38

Nathan Lambert@natolambert

36

Let's goooooooooo 我们准备好水豚模式了，感谢 @Alibaba_Qwen，继续推出新模型吧

大佬观点开源生态

17:16

SenseTime@SenseTime_AI

56

感谢 @liuziwei7 与我们共同创造 #多模态智能的未来！

Ziwei Liu: 🔥Native Unified Multimodal Model Open Sourced🔥 🚀SenseNova U1🚀 is the first native multimodal model that unifies mult...

Hugging Face 多模态开源生态模型发布

17:11

向阳乔木@vista8

43

提示词优化大师Skill：针对不同AI场景的Prompt工具

一款名为“提示词优化大师”的纯文本Skill获得了超过6000个Star。其核心价值在于针对不同的具体使用场景和工具，提供差异化的优化提示词。例如，在代码生成场景中，会根据用户是使用Claude Code还是Cursor来调整提示词；在图像生成场景中，则会区分Midjourney等不同工具来提供相应的提示词。该Skill的设计思路类似于精心编写的“元Prompt”，旨在提升用户与各类AI模型交互的效率和效果。

开源/仓库开源生态编码

16:49

Alibaba Cloud@alibaba_cloud

58

阿里入选TIME最具影响力AI公司，Qwen成全球最受欢迎开源模型

阿里巴巴被《TIME》评为2026年十大最具影响力AI公司之一，入选其首届“行业领袖”榜单。公司凭借构建根植于开源领导力的全栈AI生态系统获得认可。阿里巴巴已成长为全球开源AI的重要力量，其Qwen模型系列不仅在中国驱动创新，也支持了Airbnb、Pinterest等国际公司。《TIME》认可Qwen为全球最受欢迎的开源模型家族，这体现了阿里巴巴“开放加速共同进步”的理念。

开源生态行业动态

16:08

Rohan Paul@rohanpaul_ai

59

AI编程代理激增致GitHub基础设施承压，早期用户因服务中断迁出项目

AI编程代理的普及正使GitHub基础设施面临极限压力。这些工具不仅生成更多代码，更导致提交、拉取请求、搜索和CI任务等仓库事件数量激增，彻底改变了平台原本为人类节奏设计的工作负载形态。开发者现可在短时间内推送大量AI辅助的迭代，每次迭代都会触发CI、索引、存储和审查系统，使瓶颈从编写代码转向消化代码。这种过载已影响日常维护，导致评审阻塞、合并卡顿和自动化失败。作为例证，GitHub早期用户Mitchell Hashimoto因其项目Ghostty反复遭遇服务中断，最终决定将项目迁出他使用了18年的GitHub，这标志着一个时代的转变。

Mitchell Hashimoto: Ghostty is leaving GitHub. I'm GitHub user 1299, joined Feb 2008. I've visited GitHub almost every single day for over 1...

智能体 GitHub 开源生态现象/趋势

1…15 161718 19 20