Claude Sonnet 系列最强模型 Sonnet 5 发布! 定语有点多，不过它确实不是最强，也不是 Claude 最强，那两位都关着呢 😂 Sonnet 4.6 < Sonnet 5 < Opus 4.8 < Fable 5 < GPT-5.6 Sol

译Claude Sonnet 系列最强模型 Sonnet 5 发布! 定语有点多，不过它确实不是最强，也不是 Claude 最强，那两位都关着呢 😂 Sonnet 4.6 < Sonnet 5 < Opus 4.8 < Fable 5 < GPT-5.6 Sol

Berryxia.AI@berryxia · 2天前68

别说我觉得Sonnet 4.6 还挺好用的。昨晚Claude Sonnet 5 发布替代了Sonnet 4.6 ，免费用户都可以使用的模型。据称和Opus 级模型的能力相差不大，价格确实便宜40% 。

宝玉@dotey · 2天前62

Anthropic 今天发布了 Claude Science，一个面向科学研究者的 AI 工作台。它的定位很明确：做科学研究领域的 Claude Code。去年 Claude Code 改变了程序员的工作方式，Anthropic CEO Dario Amodei 认为 Claude Science 能在生命科学领域复制同样的事。考虑到 Anthropic 目前年化收入已达 420 亿美元、估值 9650 亿美元，这个野心至少有财力支撑。 Claude Science 不是新模型。它用的还是现有的 Claude 模型（包括 Opus 4.8），没有专门训练过生物学能力。它做的事情是把科研工作流程整合到了一个环境里。【1】解决什么问题做过计算生物学的人都知道，日常工作是在一堆工具之间反复横跳：查文献用 PubMed，写代码用 Jupyter，跑分析用 R，提交计算任务要登录集群终端，看蛋白结构又得换个软件。每个数据库还有自己的格式和查询方式。 Claude Science 把这些东西塞进了同一个界面。一个主 AI Agent 充当“项目经理”，连接了 60 多个科学数据库，涵盖基因组学、单细胞分析、蛋白质组学、结构生物学、化学信息学等领域。用自然语言提问，它就会调用相应的专业 Agent 去不同数据库查询和汇总。它还能往下分派任务，生成子 Agent 来处理具体工作，或者把任务交给用户自己创建的专家 Agent。另外有一个专门的审查 Agent，负责检查引用和计算结果是否正确。【2】两个比较实际的特性第一个是可复现性。Claude Science 生成的每张图表都附带生成它的完整代码、运行环境、创建过程的自然语言描述，以及完整对话记录。几个月后回头看，还能还原当时的整个分析过程。想调整图表也简单，用自然语言说“把 Y 轴改成对数刻度”或者“去掉网格线”，Agent会自动修改对应代码。第二个是本地运算。它可以装在 macOS 或 Linux 上，也可以通过 SSH 连接到实验室的高性能计算集群。数据不用全部传到云端，敏感数据可以留在本地基础设施上，只有分析每一步需要的上下文信息才会发送给 Claude。如果计算量大，它还能调用 Modal 账户按需扩展到上百个 GPU。【3】早期用户怎么说 Gladstone 研究所的 Sean Whalen 用它几天之内从零搭了一个基因组浏览器。UCSF 的 Stephen Francis 说，Claude Science 在他们的 RNA-seq 数据里发现了一个实验室病毒污染物，他们团队在这个问题上卡了将近一年。Allen 研究所的 Jérôme Lecoq 用它搭了一套多 Agent 文献综述系统，让多个子 Agent 读几千篇论文、提取核心发现，然后按叙事结构生成综述，以前他的团队写这样一篇综述要两年。 MIT 的 Iain Cheeseman 的评价可能最直观，他说这个工具让他作为一个非计算生物学背景的人能做以前根本做不了的分析，他发现自己会把积攒多年的研究问题拿去用 Claude Science 试。【4】竞争格局 Anthropic 并不是唯一盯上这个方向的公司。OpenAI 在今年 4 月推出了 GPT-Rosalind，一个专门针对生命科学的推理模型，6 月初又做了一轮能力升级。两者的思路不太一样：GPT-Rosalind 是专门训练的领域模型，侧重生物推理能力本身；Claude Science 不改模型，改的是工作流程，把现有模型包装成一个集成了数据库、计算资源和协作 Agent 的科研平台。 GPT-Rosalind 目前只对签了企业协议的美国客户开放研究预览。Claude Science 的门槛低一些，Pro（20 美元/月）以上的付费用户就能用。这反映了 Anthropic 的策略转变，从单纯卖模型能力，转向拥有特定行业的操作层，就像 Claude Code 成了软件开发的操作层一样。【5】怎么用 Claude Science 今天开始公测，macOS 和 Linux 可用，需要 Pro、Max、Team 或 Enterprise 订阅。Team 和 Enterprise 用户需要管理员开启权限。学术机构和非营利研究组织的活跃实验室可以申请 Team 计划的折扣席位。 Anthropic 还会资助最多 50 个 Claude Science 研究项目，每个项目最高 3 万美元额度，Modal 另外提供最多 2000 美元的计算资源。申请截止 7 月 15 日，结果 7 月 31 日前通知，项目运行时间为 9 月 1 日到 12 月 1 日。

译Anthropic 推出 Claude Science，一个面向生命科学等领域的 AI 工作台，将文献检索、代码运行、数据库查询等科研流程整合到统一界面。它基于现有 Claude 模型（含 Opus 4.8），未专门训练生物学能力，通过主 Agent 连接 60+ 科学数据库（基因组学、蛋白质组学等），并可生成子 Agent 执行任务。特性包括可复现性（图表附带生成代码与环境）和本地运算（macOS/Linux 或 SSH 连接集群，敏感数据本地保留）。早期用户案例：Gladstone 研究所几天内搭建基因组浏览器；UCSF 团队用其发现卡了一年的 RNA-seq 病毒污染物；Allen 研究所将两年综述缩短至数周。与 OpenAI 的 GPT-Rosalind 不同，Claude Science 侧重工作流集成。即日起公测，需 Pro（$20/月）及以上订阅。Anthropic 将资助最多 50 个项目，每个最高 $30,000，申请截止 7 月 15 日。

Rohan Paul@rohanpaul_ai · 2天前65

Anthropic unveils 'Claude Science' for scientific research. Early users report 10 review drafts over 100 pages and germline analyses in one-tenth the time. Its a beta tool featuring code-traced artifacts and access to 60 scientific databases. The launch is part of Anthropic's life sciences and healthcare initiative, which the IPO-bound Anthropic has been developing since October 2025. The traditional scientific workflow forces scientists across databases, notebooks, R, terminals, viewers, and cluster queues. Each switch broke context, added manual checking, and made results harder to reproduce months later. Claude Science tries to move that whole loop into one running research session. A coordinating agent can call specialist agents, lab skills, scientific databases, and compute resources. The app renders 3D proteins, genome tracks, chemical structures, figures, manuscripts, and underlying code. Every artifact includes its code, environment, plain-language method, and full message history. So makes verification less dependent on memory and more dependent on inspectable execution traces. - Claude Science can submit jobs to lab HPC systems or Modal compute. - It can scale analysis from 1 GPU to hundreds while datasets stay local. - The reviewer agent checks calculations, references, and figures against their source code.

译Anthropic 推出 Claude Science beta 版，整合 60 个科学数据库，支持代码追踪的 artifact（含环境、方法及完整消息历史），可渲染 3D 蛋白质、基因组轨迹、化学结构等。协调 agent 可调用专业 agent、实验室技能和计算资源（HPC 或 Modal），分析从 1 GPU 扩展至数百，数据保持本地。内置审稿 agent 自动检查计算、引用和图表与源代码一致性。早期用户报告：生成 10 份超 100 页的审稿草稿，种系分析时间降至十分之一。该工具属于 Anthropic 自 2025 年 10 月启动的生命科学与医疗健康计划。

Artificial Analysis@ArtificialAnlys · 2天前60

Claude Sonnet 5 achieves 53 on the Artificial Analysis Intelligence Index, but without promotional pricing will cost more per task than Opus 4.8 We supported @AnthropicAI to evaluate Claude Sonnet 5 ahead of release: with max effort it improves 6 points over Sonnet 4.6 to achieve the same Intelligence Index as GPT-5.5 with high reasoning, but remains behind Opus 4.7 and 4.8 Key takeaways: ➤ Claude Sonnet 5 is the #5 model on the Artificial Analysis Intelligence Index, only 2-3 points behind GPT-5.5 (xhigh) and Opus 4.8 (max) ➤ With max effort, Sonnet 5 works harder than previous Anthropic models: it used ~40% more output tokens per Intelligence Index task than Sonnet 4.6, and ~3x the agentic turns for our knowledge work evaluations AA-Briefcase and GDPval-AA. This behavior scales well with the ‘effort’ setting, with the max effort using around 6x more turns than low effort on GDPval-AA ➤ Claude Sonnet 5 costs more per task than Opus 4.8 before accounting for promotional pricing: Claude Sonnet 5 costs $2.29 per task on the Intelligence Index, a ~2x increase compared to Sonnet 4.6 and ~15% more than Claude Opus 4.8. This is driven entirely by increased token usage. Sonnet 5 retains the same $3/$15 per 1M input/output token pricing as Sonnet 4.6 (compared to $5/$25 for Opus 4.8), however Anthropic is offering a one-third reduction to $2/$10 until September 1. Our results use standard $3/$15 pricing ➤ Sonnet 5 matches or outperforms Opus 4.8 on agentic knowledge work tasks: on both AA-Briefcase and GDPval-AA, Claude Sonnet 5 sits just ahead of Opus 4.8, trailing only Claude Fable 5 (which is not currently generally available). These benchmarks test the ability of models to produce accurate and well-presented professional outputs using our open source reference agent harness, Stirrup ➤ For reasoning and knowledge-heavy tasks, Sonnet still sits behind its larger siblings: despite substantial gains across many evaluations, heavy reasoning and knowledge benchmarks still show Opus 4.8 ahead of Sonnet 5. On CritPt, a frontier physics reasoning benchmark developed by researchers at Argonne and UIUC, Sonnet 5 scores 17% - this is 14 points higher than its predecessor, but behind GLM-5.2, Claude Opus and Fable, and GPT-5.5 (xhigh and Pro) ➤ Sonnet 5 also showed significant improvements over Sonnet 4.6 on Terminal-Bench v2.1 (+9 points), Humanity’s Last Exam (+10 points), and SciCode (+7 points), with relatively flat scores elsewhere Other key model details: ➤ Context window of 1 million tokens (equivalent to Sonnet 4.6) ➤ Pricing of $3/$15 per 1M tokens of input/output (reduced to $2/$10 until September 1); cache pricing remains at a 25% premium for cache writes ($3.75 per million tokens) with 5-minute time to live, and 90% discount for cache hits ($0.3 per million tokens) ➤ Effort remains the recommended way of configuring model performance and latency. Sonnet 5 adds an additional ‘xhigh’ effort setting relative to Sonnet 4.6, matching the 5 effort levels available on Opus 4.8 (max, xhigh, high, medium, low)

译Claude Sonnet 5 以 max effort 在 Artificial Analysis Intelligence Index 上得分 53（第 5 名），比 Sonnet 4.6 提高 6 分，与 GPT-5.5 (xhigh) 持平，落后 Opus 4.7/4.8 约 2-3 分。标准定价下每任务成本 $2.29，比 Sonnet 4.6 贵约 2 倍、比 Opus 4.8 贵 15%，主要因输出 token 增加 40%、agentic 任务调用次数增加约 3 倍。定价 $3/$15 每百万 token（促销至 9 月 1 日降至 $2/$10），上下文窗口 1M tokens，新增 xhigh 力度设置。在 agentic 知识工作基准 AA-Briefcase 和 GDPval-AA 上匹配或超越 Opus 4.8，推理基准仍落后。Terminal-Bench v2.1（+9）、HLE（+10）、SciCode（+7）显著提升。

Chubby♨️@kimmonismus · 2天前56

Agents that can plan but can't pay are just expensive chatbots. No account, no API key, no human in the loop. The agent sends a request, pays in USDC, gets data back. That's actually new.

译能规划但付不了钱的智能体只是昂贵的聊天机器人。无需账户，无需API密钥，无需人工介入。智能体发送请求，用USDC支付，取回数据。这才是真正的创新。

ClaudeDevs@ClaudeDevs · 2天前51

We’ve added a few updates to Claude Managed Agents: Streaming session event deltas, per-session agent overrides, new webhook event types, reverse pagination, and credential injection scoping.

译我们为 Claude Managed Agents 添加了一些更新：流式会话事件增量、按会话的代理覆盖、新的 Webhook 事件类型、反向分页以及凭证注入作用域。

Rohan Paul@rohanpaul_ai · 2天前65

Most AI products ask users to leave their workflow and enter a separate box of intelligence. ⌨️ Acti (@openacti1) reverses that direction by putting the agent inside the text field, where plans, questions, replies, reminders, links, and decisions already begin. The phone keyboard into an AI action layer. Becasue people already start many small tasks inside chats, but the phone still forces them to leave the chat, open another app, finish the task, copy the result, then come back. Acti changes that flow by using the keyboard itself as the command surface. A user types what they want, holds the Acti spacebar, and the agent reads the intent, calls the right app or service, then returns something useful inside the same text field. That could mean a map link, restaurant options, a sports comparison, a clean reply, a reminder, or a Notion page. The strongest part is that this does not need a separate chatbot app. The keyboard becomes the place where AI meets the user’s real workflow. There is also a Skill Key system, where a user can bind actions to keys, like holding N for Notion or L for a LinkedIn profile view. The most practical demo is the Maps one. Someone asks where to meet, the user types “Times Square Starbucks location,” holds the Acti spacebar, and gets a ready map link plus a sendable message without opening Maps. 🧵 1.

译Acti 把 AI 智能体直接放在手机键盘的文本输入区。用户输入意图后长按 Acti 空格键，AI 读取需求并调用相应应用或服务，在同一输入框内返回地图链接、餐厅推荐、体育对比、回复草稿、提醒或 Notion 页面等结果。该方案无需单独的聊天机器人应用，键盘成为 AI 与真实工作流交汇的界面。此外还有 Skill Key 系统，可绑定按键（如按住 N 打开 Notion，按住 L 查看 LinkedIn 资料）。最实用的演示是地图：输入“时代广场星巴克位置”，长按空格即可获得地图链接和可发送的消息，无需打开地图 App。

OpenAI Developers@OpenAIDevs · 2天前26

As agents take on longer-running work, engineering shifts to setting direction, reviewing work, and designing better systems around the models. @steipete at @aiDotEngineer

译随着智能体承担更长期的工作，工程转向设定方向、审查工作以及围绕模型设计更好的系统。

Peter Steinberger 🦞@steipete · 2天前24

Honored to be part of @aiDotEngineer’s keynote today!

译随着AI智能体承担更长期的工作，工程任务转向设定方向、审查工作以及围绕模型设计更好的系统。@steipete 对参与@aiDotEngineer 的主题演讲感到荣幸。

AYi@AYi_AInotes · 2天前58

去年开发者是 AI 编码代理的 QA——手动找 bug，手动让代理修，今年代理能自己测自己修了，吴恩达老师管这叫"循环工程"，但我觉得真正值得说的不是这个循环工程本身，上周末他给女儿做了一个打字练习 app，编码代理自己跑了一小时，用浏览器反复检查自己写的东西，没要他干预。他要做的不是检查代码，是决策，比如视觉设计怎么调、猫咪皮肤加几个、家长登录流程怎么改。以前这些东西藏在"有空再优化"列表里，现在代理把代码层的事吃了，决策层的事就全浮出来了。吴恩达用了一个词来形容——叫"语境优势"。他说很多人把人类在循环里的价值叫"品味"，他不喜欢这个词，因为品味听起来像玄学，人类真正的优势不是品味，是语境——你知道用户是谁、为什么痛苦、什么功能他们会疯传。这些事代理不知道，不是因为模型不够强，是因为这些信息不在训练数据里。循环工程真正的洞察在这：它可以加速代码，但不能压缩语境。只要人拥有代理没有的信息，人就永远在循环里有一层不可替代的位置。只不过这层位置一直在往上移，从 QA 移到 PM，从检查移到判断。我觉得最容易被取代的，是代理能自己测的那部分工作，而回不去的是那种只有你知道用户想要什么的那一部分工作。所以循环工程真正的意义，不是让 AI 跑得更久，其实是反向逼你的能力不断往上走

译吴恩达提出“循环工程”概念：AI编码代理可自主迭代代码、测试直至正确，无需人类干预。他用上周末给女儿做打字练习app举例，代理自行运行约一小时，多次用浏览器检查成果后回报。开发者角色因此从手动找bug的QA转向高层决策（如视觉设计、用户流程）。吴恩达强调人类真正的优势不是“品味”而是“语境优势”——知道用户是谁、为何痛苦。循环工程加速代码但无法压缩语境，只要人拥有代理没有的信息，人就永远在循环中占据不可替代的位置。

AYi@AYi_AInotes · 2天前65

holy fucking shit， Anthropic把正经能落地的agent能力，直接下放到了中端产品线， Sonnet级的价格， Opus级的agent能力， Anthropic这波是真的杀疯了🤯

译我靠， Anthropic 把真正能落地的智能体能力，直接下放到了中端产品线， Sonnet 级的价格， Opus 级的智能体能力， Anthropic 这波是真的杀疯了🤯

宝玉@dotey · 2天前69

Anthropic 今天发布 Claude Sonnet 5，替代 Sonnet 4.6 成为免费版和 Pro 版的默认模型。Anthropic 的定位很明确：Agent 能力接近自家最贵的 Opus 4.8，API 价格只有后者的 40%。 Sonnet 系列是开发者用量最大的一档。但过去几个月，AI Agent 能力（让模型自主规划、调用工具完成多步骤任务）的主要进步集中在更贵的 Opus 系列，两者差距越来越明显。Sonnet 5 把差距缩了回来。在 Agent 编程基准上，Sonnet 5 得分 63.2%，Sonnet 4.6 是 58.1%，Opus 4.8 是 69.2%。在知识工作基准上，Sonnet 5 甚至略微超过了 Opus 4.8。早期测试者的反馈比较一致：以前 Sonnet 做到一半会停的复杂任务，现在能跑完，还会主动检查自己的输出。Zapier 的工程师说，让 Sonnet 5 连续执行“更新 Salesforce 账户等级，再给企业客户发公告邮件”，模型一口气做完了，“以前会卡在半路”。 API 定价分两阶段：8 月 31 日前的推广价是输入 2 美元/百万 Token、输出 10 美元/百万 Token，之后涨到 3 美元和 15 美元。据 TechCrunch 报道，这个价格低于 OpenAI 的 GPT-5.5 和 Google 的 Gemini 3.1 Pro，但仍高于 Gemini 3.5 Flash。有个容易忽略的细节：Sonnet 5 换了新的分词器，同样的文本可能消耗 1.0 到 1.35 倍的 Token。Anthropic 说推广期的定价已经把这个涨幅对冲掉了，过渡期总成本大致不变。但推广价结束后，实际花费会比官方标价的涨幅更大。安全方面，Sonnet 5 的幻觉率和迎合倾向低于前代，Agent 场景下抵御提示注入和恶意请求的能力更强。因为网络安全能力有所提升，模型默认开启了实时安全防护（和 Opus 4.7、4.8 相同的机制）。 Sonnet 5 今天起在 Claude 所有套餐、Claude Code 和 API 上可用，模型代号 claude-sonnet-5。

译Anthropic 发布 Claude Sonnet 5，替代 Sonnet 4.6 成为免费版和 Pro 版默认模型。Agent 编程基准得分 63.2%（Sonnet 4.6 为 58.1%，Opus 4.8 为 69.2%），知识工作基准略超 Opus 4.8。API 推广价（8 月 31 日前）输入 $2/百万 Token、输出 $10/百万 Token，之后涨至 $3 和 $15。新分词器可能使 Token 消耗增加 1.0–1.35 倍，但推广期定价已对冲。幻觉率和迎合倾向低于前代，默认开启实时安全防护。模型代号 claude-sonnet-5，即日起在 Claude 所有套餐、Claude Code 和 API 上可用。

elvis@omarsar0 · 2天前63

Sonnet 5 is here! This is going to support better long-running agents. Previous Sonnet models were unreliable, so it's great to see the improved version that can complete agentic tasks more reliably. It also seems to have improved substantially in computer use.

译Sonnet 5 来了！这将支持更好的长时间运行的智能体。之前的 Sonnet 模型不可靠，所以看到改进版本能更可靠地完成智能体任务，真是太棒了。它在 computer use 方面似乎也有大幅改进。

Claude@claudeai · 2天前73

Introducing Claude Sonnet 5, our most agentic Sonnet yet. It makes plans, uses tools like browsers and terminals, and runs autonomously at a level that just a few months ago required larger and more expensive models.

译介绍 Claude Sonnet 5，这是迄今为止最具智能体能力的 Sonnet。它会制定计划、使用浏览器和终端等工具，并以几个月前还需要更大、更昂贵模型才能达到的水平自主运行。

🚨 AI News | TestingCatalog@testingcatalog · 2天前80

ANTHROPIC 🔥: Claude Sonnet 5 has been officially announced, offering a close to Opus 4.8 performance at a lower price. Sonnet 5 scored 63.2% on SWE Bench Pro, up from 58.1% for Sonnet 4.6. Have you tried it already? 👀

译ANTHROPIC 🔥: Claude Sonnet 5 已正式发布，以更低的价格提供了接近 Opus 4.8 的性能。 Sonnet 5 在 SWE Bench Pro 上获得 63.2% 的分数，较 Sonnet 4.6 的 58.1% 有所提升。你已经试过了吗？👀

OpenRouter@OpenRouter · 2天前73

Claude Sonnet 5 is rolling out on OpenRouter with a promo price: $2/M in and $10/M out! It boosts agentic coding and pro workflows w/ flagship intelligence at Sonnet pricing. In early tests, agents were more reliable, faster, and easier to trust with larger tasks than 4.6.

译Claude Sonnet 5 正在 OpenRouter 上推出，促销价格：$2/M 输入，$10/M 输出！它以 Sonnet 定价提供旗舰智能，提升智能体编码和专业工作流。在早期测试中，智能体比 4.6 更可靠、更快，且更容易信任处理更大的任务。

Chubby♨️@kimmonismus · 2天前80

Here we go: Sonnet 5 is live: The tl;dr • Anthropic calls it the most agentic Sonnet yet • Near Opus 4.8-level performance, but cheaper • Strong gains in reasoning, tool use, coding, and knowledge work • Default model for Free and Pro users • Available in Claude Code and API today • Intro pricing: $2/M input, $10/M output until Aug 31 • Standard pricing: $3/M input, $15/M output • Safer than Sonnet 4.6 overall, with lower hallucination and sycophancy rates • Cyber safeguards are enabled by default, but Anthropic says Opus still remains stronger for serious cyber work

译Anthropic 发布 Sonnet 5，称其为迄今为止最智能体化的 Sonnet 模型。性能接近 Opus 4.8，在推理、工具使用、编码和知识工作方面有显著提升。即日起成为 Free 和 Pro 用户的默认模型，已在 Claude Code 和 API 上线。推出促销价：输入 $2/M token、输出 $10/M（截至 8 月 31 日），标准价分别为 $3/M 和 $15/M。整体较 Sonnet 4.6 更安全，幻觉率和奉承率更低，网络保护默认开启，但 Anthropic 表示 Opus 在严肃网络任务上仍更强。

OpenAI@OpenAI · 2天前58

We’re introducing GeneBench-Pro, a research-level benchmark for a harder kind of AI progress: how well agents can navigate messy biological data, choose the right analysis path, and make judgment calls that real computational research depends on. https://openai.com/index/introducing-genebench-pro/

译我们正在引入GeneBench-Pro，一个研究级基准测试，用于衡量一种更难的AI进步：智能体在混乱的生物数据中导航、选择正确分析路径、并做出真实计算研究所需的判断决策的能力。

Microsoft Research@MSFTResearch · 2天前39

AI agents often fail because their instructions, or skills, are manually modified with no guarantee of improvement. Learn how SkillOpt turns skill editing into a training process, making agent behavior more reliable without changing model weights: https://msft.it/6012vsvEs

译AI 智能体常常失败，因为它们的指令（即技能）被手动修改，且无法保证改进。了解 SkillOpt 如何将技能编辑转变为训练过程，在不改变模型权重的情况下使智能体行为更可靠：https://msft.it/6012vsvEs

fofr@fofrAI · 2天前73

You can bootstrap your agent quickly with the Omni API using the skill we published: https://github.com/google-gemini/gemini-skills It includes: - video editing - text to video - video generation with image references - first frame to video But it also has some helper tools for: - prepping input videos for editing (10s, 720p) - audio stripping if you want to generate new audio - video inspection

译Google 通过 Gemini Omni API 发布 gemini-skills 技能包，支持视频编辑、文生视频、图片参考视频生成、首帧生成视频，并提供预处理输入视频为 10 秒 720p、音频剥离、视频检查等辅助工具。同作者展示 Omni Flash 模型编辑能力：输入“将桌子改成浅水池”，模型输出湿手、水波、折射、阴影及音效。该 API 已开放，可用于构建视频编辑流水线。

Chubby♨️@kimmonismus · 2天前50

A creative agency bills you a fat monthly retainer just to figure out what's already working in your market. NoimosAI's Creative Agent launched today and runs that whole loop on its own. It scans top-performing creatives across Meta, TikTok and LinkedIn, pulls the patterns behind them, and builds assets for your brand off your own past results. You describe what you want, the agent handles the research and the creation, you sign off on the final cut. Really cool!

译NoimosAI 今日推出 Creative Agent，可自动扫描 Meta、TikTok、LinkedIn 上的顶级创意，提取成功模式，并结合品牌自身历史表现生成广告素材。用户只需描述需求，智能体即完成调研与创作，最后由用户确认。该工具将市场洞察转化为高表现内容，分析竞争对手、热门创意及自有数据，确保产出基于已验证有效的策略。

AK@_akhaliq · 2天前31

OSWorld2.0 Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

译OSWorld2.0 对计算机使用智能体在长程真实世界任务上进行评测

Rohan Paul@rohanpaul_ai · 2天前55

For years, great creatives came from taste, instinct, and knowing what usually works. Now AI can turn that into a repeatable data process. @noimos_ai just launched Creative Agent, an autonomous system that researches winning creative patterns and adapts them for your brand. Describe the creative you want - It learns from your past performance to see what clicked and what did not - checks hundreds of winning creatives across competitors and the wider market on Meta, TikTok, LinkedIn, and more - It understands your business and adapts those patterns to your products and services.

译noimos_ai 推出 Creative Agent，一个自动化系统，可研究成功的创意模式并针对品牌适配。系统通过学习品牌历史表现（分析哪些内容受欢迎、哪些不受欢迎），同时扫描 Meta、TikTok、LinkedIn 等平台数百个竞争对手及市场中的成功创意，理解业务后将这些模式调整用于自身产品和服务。引用推文指出，它能分析竞争对手、顶级创意和过往结果，生成基于已验证策略的高表现资产。

🚨 AI News | TestingCatalog@testingcatalog · 2天前49

NoimosAI has released a Creative Agent that can gather market insights and turn them into ready-to-use brand assets, with access to features such as competitor analysis, top-performing creatives, and the brand's past results. It can scan high-performing content across Meta, TikTok, LinkedIn, and more, surface the patterns behind what works, and map them onto the brand.

译NoimosAI 推出 Creative Agent，可收集市场洞察并将其转化为可直接使用的品牌资产。该工具支持竞品分析、高表现创意扫描及品牌历史结果调用，能跨 Meta、TikTok、LinkedIn 等平台分析高互动内容，识别有效模式并映射到品牌策略。

Rohan Paul@rohanpaul_ai · 2天前60

Agents have been good at deciding what should happen next. They have been much worse at acquiring the tools needed to make it happen. x402 and Apify’s thsounds of Actors give that problem a practical solution.

译智能体擅长决定下一步行动，但缺乏获取所需工具的能力。x402与Apify的网页自动化工具合作解决了这一难题：通过x402，代理此前可购买约2,000个工具；现与Coinbase合作升级，工具数量10倍增至20,000+，且无需账户、API密钥或人工介入。

凡人小北@frxiaobei · 2天前55

Cloudflare 全家桶又添 Browser Rendering，管远程 chromium 抓取。 Workers Paid $5/mo 给 10 小时 browser/day。我把订阅的AI 公司官网的信息抓取从 Jina Reader 切过来，十几个源实际用 ~3 min/day, 99%+ 余量没动。 Jina 免费层烧完 402 静默 fail 的老坑也避了, 价格从 token 计费突发不可控变 $5 封顶可预测。跟 Pages / Workers / D1 / R2 / KV / Tunnel 同款调子：免费层够个人测试。付费层够正经项目，没企业付费档强买强卖。以前个人爬 web 信息流要拼 jina / browserless / diffbot / scrapingbee 一堆 SaaS，现在 Cloudflare 一家把基础设施基本都能搞定一个人 + 一个 Openclaw + Cloudflare 全家桶，基本都能奥丁。

译Cloudflare 新增 Browser Rendering，提供远程 Chromium 抓取。Workers Paid $5/月含 10 小时浏览器/天。作者将 AI 公司官网抓取从 Jina Reader 迁至 Cloudflare，实际日耗约 3 分钟，99%+ 余量未用，避免了 Jina 免费层耗尽后的 402 静默失败问题，价格从 token 计费变为 $5 封顶可预测。结合 Pages、Workers、D1、R2、KV、Tunnel 及 Claude Code/OpenClaw + GitHub 自动构建，一人一套 Agent 即可将产品从 0 跑到上线，基础设施成本近乎免费。

Nathan Lambert@natolambert · 2天前74

When we were in China, @xeophon and I made a quick detour to visit Meituan. They continue to be one of our favorite open model builders, as they're showing how a variety of companies can succeed here and baffle a lot of people as to why they're making models. Meituan is one of the larger tech companies in China. They're building LLMs to add services to their own products. In China the notion of the "super app" is very popular, so this dream of more services for users with AI is very natural there. With this, Meituan wants to own the full stack of how they deliver value to their users. When we visited, they were very unassuming about everything. We just met a few people from the LLM team, a quick meeting about building models. They build general foundational reasoning models, and then fine-tune it further for their products. They can release the general model to support the ecosystem and learn how it can be used. Their focus was very clearly on ownership, and a hint of cost-saving, so the recent news of v2 being trained on asics fits with that mentality. They want to deliver real products to users with low cost. Companies like this will keep building models in China. It's a small micro study of how different the players in the AI ecosystem are. Kimi, Z ai, etc are all much flashier offices, come across as the "hot new thing" but Meituan has the talent and resources to build models as well. Congrats to the Meituan team & thx for having us!

译美团发布基座推理模型LongCat-2.0（v2），采用MoE架构，总参1.6T，活跃约48B，支持1M上下文。专为智能体编程设计，引入LongCat Sparse Attention、Zero-Compute Experts及MOPD任务路由。基准测试中SWE-bench Pro达59.5（超GPT-5.5的58.6），多项Agent评测领先。模型已在OpenRouter上线，技术博客公开。美团强调全栈自研与低成本，v2基于ASIC训练。

🚨 AI News | TestingCatalog@testingcatalog · 2天前48

Apify has partnered with Coinbase to add more than 20,000 of its web automation Actors to the x402 ecosystem, giving AI agents thousands of tools they can discover, pay for, and run on their own. When an agent calls an Actor, it gets back an HTTP 402, settles the payment in USDC on Base, and the Actor runs. Built on x402 by Coinbase.

译Apify与Coinbase合作，将超过20,000个Web自动化Actors接入x402生态系统。AI智能体可自主发现、支付并运行这些工具：调用Actor时收到HTTP 402状态码，通过Base链上的USDC完成支付后立即执行。此前x402生态仅有约2,000个工具（来自@apify），此次合作将可用工具数量提升10倍，无需账户、API密钥或人工介入。

凡人小北@frxiaobei · 2天前20

体验了，继续加油吧！这个软件唯一的价值就是消耗了一大堆 token。

译OpenClaw现已登陆iOS和Android，终于推出原生移动应用，可将Agent装进口袋，随时管理频道、任务和回复。用户@小北体验后评价：“继续加油吧！这个软件唯一的价值就是消耗了一大堆token。”

Chubby♨️@kimmonismus · 2天前51

Of all the places people keep trying to bolt AI onto, the keyboard is the one that finally clicks for me. An agentic keyboard just feels like the right form factor: it's the single surface that follows you into every app you open, so turning it into an action layer instead of just a place to type is a genuinely smart move. That's exactly what Acti does. You type what you want, hold to run it, and the result comes back ready to send without ever leaving the conversation. Bind your own workflows to a skill key and fire them on the spot. This is one of the most interesting things I've seen come out of the agentic space all year, and it's the kind of shift that feels obvious in hindsight.

译Acti (@openacti1) 推出 Agentic Keyboard（智能体键盘），定位为继 2007 年苹果玻璃键盘后的下一次变革。它不是语法修正或语音转写工具，而是在每个文本字段中嵌入隐形智能体。用户输入内容后按住即可运行，结果直接返回，无需离开当前对话。支持将自定义工作流绑定到技能键并即时触发。推文作者认为这是今年智能体领域最有趣的创新之一，称键盘是 AI 理想的载体形式。

elvis@omarsar0 · 2天前53

The gap in autonomous agentic loops that gets ignored: agents can plan and call APIs but can't acquire tools they don't have access to. x402 + Apify's 20,000+ Actors is a concrete fix for that. Worth paying attention to.

译自主智能体可规划和调用API，但无法获取未授权工具。x402协议与Apify的20,000+个Actors解决了这一缺口。此前智能体仅可通过x402购买约2,000个工具，如今Apify与Coinbase合作，将其10倍扩展至20,000+个，为自主智能体提供最大的网络自动化工具市场。无需账户、API密钥或人工介入。

Nathan Lambert@natolambert · 2天前69

letssss gooooo breaking this bad boy out today loooooooooooong cat

译美团LongCat正式发布LongCat-2.0，采用1.6T参数MoE架构，约48B活跃参数，支持1M上下文窗口。专为智能体编码设计，核心创新包括：LongCat稀疏注意力（LSA）高效扩展1M上下文；零计算专家（33B–56B动态激活，无浪费）；MOPD混合专家组（按任务路由至Agent/Reasoning/Interaction）。基准测试：Terminal-Bench 2.1达70.8，SWE-bench Pro 59.5（超越GPT-5.5的58.6），SWE-bench Multilingual 77.3，FORTE 73.2，RWSearch 78.8，BrowseComp 79.9。可通过OpenRouter上的Owl Alpha试用。

meng shao@shao__meng · 2天前74

Flowith 团队推出「Matrix」：Agent 公司的操作系统，你定使命，Matrix 编排多 Agent 部门长期运转，目标是从创建、分发到变现的完整商业闭环。 Matrix 的核心主张 · 产品形态：自演化、多层级的 multi-agent runtime · 用户角色：战略负责人（设 mission），不是日常执行者 · 组织模型：CEO Office → OKR → 部门（Research / Engineering / Growth / Product）→ 证明与复盘 · 商业闭环：建站、接 Stripe、发邮件、投广告、产内容、收 revenue · 新指标：VPTD（Value Per Token Dollar）= 产出价值 ÷ token 成本产品架构 1. Runtime 层每个 Agent 有独立 browser、工具、文件、记忆；支持 Neo / Claude Code / Codex 等，强调超长时运行和主动执行（不是一问一答）。 2. 协调层用户输入 intent + 资产 → CEO Office 定目标与节奏 → OKR 分解任务 → 各部门并行 → 以 proof（文件、截图、上线页面、收入、流量）闭环。 3. 公司原语（Company Primitives）内置：网站部署（*.matrix.site 或自定义域名）、Stripe 收款、Agent Wallet（预算与审批）、Agent Email。宣称可跳过传统公司注册、银行卡、域名等 setup。 4. 交付形态目前 macOS 客户端，Web 版「coming soon」。 GDPval-Bench 数据 1. Matrix（GPT 5.5 + harness）：95.45% 2. Codex CLI（GPT-5.5）：84.9% 3. Claude Opus 4.7：80.3% GDPval 是 OpenAI 等提出的 benchmark，测的是 44 个职业、真实知识工作交付物（法律文件、工程图、客服对话等）

译Flowith 团队发布「Matrix」，定位为自演化、多层级的 multi-agent runtime。用户设定使命后，Matrix 通过 CEO Office → OKR 分解任务，驱动多个 Agent 部门（Research/Engineering/Growth/Product）并行执行，并以 proof（文件、上线页面、收入等）闭环。架构包含 Runtime 层（独立 browser/工具/记忆，支持 Neo/Claude Code/Codex，超长时运行）和公司原语（网站部署、Stripe 收款、Agent Wallet、Agent Email）。macOS 客户端已上线，Web 版 coming soon。在 44 个职业真实知识工作的 GDPval-Bench 上，Matrix（GPT 5.5 + harness）得分 95.45%。此前有限 beta 中用户已创建数万个零人公司。

宝玉@dotey · 2天前65

Q：我们公司有十几个微服务，现在想让开发用 AI Agent 来做系统设计和编码。问题是一个 user story 经常需要多个微服务协作，Agent 必须了解每个服务的职责边界和业务概念才能做出合理的设计。我们打算把所有微服务放到一个 workspace 下，每个服务配上自己的文档，让 AI 自己去处理。这种方式合理吗？有没有更好的实践？ A：用好 Agent 的关键是两点：上下文的质量，和验证的闭环。先说上下文质量。放在一个 workspace 下是目前社区比较推荐的做法。 monorepo 天然适合和 AI 配合，因为 Agent 可以在一个地方同时看到 schema 定义、API 协议、各个服务的实现代码。如果因为历史原因确实不方便合成 monorepo，有个折中方案叫虚拟 monorepo，就是把多个仓库 clone 到同一个本地目录下。除了放在一起，文档也是很好的让Agent获取上下文的方式，最好给 Agent 一张地图，加上按需加载： 1. 根目录放一份总的 AGENTS.md(或 CLAUDE.md)当索引用，列清楚有哪些服务、各自负责什么、要改某个服务就去读它目录下的文档。 2. 每个微服务自己目录里再放一份,写清自己的职责边界和业务概念,这其实就是 DDD 里的 bounded context。 3. 让 Agent 先看根索引，定位到相关的那几个服务，再去加载它们的细节。不过要注意文档要及时更新，尤其是微服务协议变更了，一定要及时更新文档，否则会误导。能从代码或规格自动生成的，就别手写。手写文档迟早会和代码对不上，而像 OpenAPI 这种机器可读的接口规格，一份东西既是文档，又能拿去生成 mock 和测试。除了文档，还有一个很多人忽略的上下文来源：协议测试代码。高质量的 contract test 本身就是最准确的活文档，它精确地描述了服务之间实际的交互协议，比人写的文档更不容易过时，因为错了测试就无法通过。你如果已经有 OpenAPI spec 或者 Pact 契约文件，这些对 Agent 理解服务边界非常有价值。再说验证。微服务场景下验证是最麻烦的部分，因为一个 user story 可能涉及好几个服务协作，你不可能让 Agent 每改一行代码就把整个系统跑起来做端到端测试。一个实用的思路是：每个微服务提供 mock server 或者基于 OpenAPI spec 自动生成的模拟服务。Agent 写完代码后可以在本地跑 contract test 验证自己的改动有没有破坏和其他服务的协议约定，不需要依赖线上真实的 API 或者完整的集成环境。这样 Agent 就能形成一个“写代码→跑测试→自我修正”的闭环，不需要人在过程中频繁干预。想再进一步,建议了解一下契约测试(consumer-driven contract testing，常用工具是 Pact)。思路是调用方把自己实际用到的接口形状记下来，生成一个契约文件，被调方再去验证自己能不能满足这个契约。简单说：workspace 统一提供全局视图，分层文档 + 协议测试提供精准上下文，mock server + contract test 提供验证闭环。这三层搭好，Agent 处理跨微服务的系统设计就比较靠谱了。一些参考资料 1. Anthropic 的 Effective context engineering for AI agents，讲怎么把上下文当稀缺资源来经营、按需加载: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents 2. Anthropic 的 Effective harnesses for long-running agents，讲长任务里怎么给 Agent 搭脚手架(比如用进度文件加 git 记录跨上下文窗口接力)： https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents 3. 怎么在 monorepo 里组织 AGENTS.md 给 Agent 用,可以看 http://dev.to 上这篇 Steering AI Agents in Monorepos with AGENTS.md： https://dev.to/datadog-frontend-dev/steering-ai-agents-in-monorepos-with-agentsmd-13g0 契约测试入门，搜 Pact 加 consumer-driven contract testing 的指南就行。

译建议将所有微服务放在一个workspace（monorepo或虚拟monorepo），让Agent同时看到schema、API和实现代码。文档采用分层结构：根目录AGENTS.md索引各服务职责，每个服务内写清bounded context。优先用OpenAPI spec等机器可读规格自动生成文档。协议测试（contract test）是精准活文档，能验证服务间交互。验证环节各服务提供mock server或基于OpenAPI的模拟服务，Agent在本地跑contract test形成“写代码→跑测试→自我修正”闭环。可进一步引入consumer-driven contract testing（如Pact）。

凡人小北@frxiaobei · 2天前70

做 agent 自动化系统时，一个很容易踩的坑：把“放行信号”写在调用者也能写的地方。比如 AI review 在 PR 下面贴评论，monitor 再回读评论，看到 High: None 就自动合并。听起来合理，其实很危险。因为 PR 评论是第三方可写信道，任何有评论权限的人/agent 都能伪造格式正确的放行结果。安全门禁的信任结果应该走进程内闭环：returncode、内存状态、FD、签名结果。评论可以给人看，但不能当门禁。

译将放行信号放在PR评论等可被调用者写入的通道存在风险。AI review贴评论，monitor回读“High: None”即自动合并，但任何有评论权限的人或Agent都能伪造结果。安全门禁的信任结果应走进程内闭环（如returncode、内存状态），评论仅供查看，不可作为门禁依据。

🚨 AI News | TestingCatalog@testingcatalog · 2天前35

Bloome launched its instant messaging platform for agentic teams! Agents can draft, push back on one another, cross-check details, and refine the output until it is ready. Models like Claude, ChatGPT, and DeepSeek can run side by side with coding agents as well as custom agents built in @Bloome_im

译Bloome 推出了面向智能体团队的即时通讯平台！智能体可以互相起草、反驳、交叉核对细节，并不断完善输出，直到准备就绪。Claude、ChatGPT 和 DeepSeek 等模型可以与编码智能体以及 @Bloome_im 中构建的自定义智能体并排运行。

🚨 AI News | TestingCatalog@testingcatalog · 2天前79

Meituan released LongCat-2.0, a new 1.6T parameter model with 1M context window! > Both the full training run and the large-scale deployment are built entirely on AI ASIC superpods. It is also available for testing on OpenRouter under the Owl Alpha name.

译美团推出LongCat-2.0，总参数1.6T（MoE架构，活跃参数约48B），支持1M上下文窗口。训练与部署完全基于AI ASIC超算集群，已以Owl Alpha名称在OpenRouter上线测试。模型专为智能体编码设计：LongCat Sparse Attention（LSA）高效处理百万级token；Zero-Compute Experts每个token动态激活33B–56B参数，零浪费计算；MOPD机制含三种任务门控专家组（Agent/Reasoning/Interaction）。基准测试：Terminal-Bench 2.1得70.8，SWE-bench Pro 59.5（同期GPT-5.5为58.6），SWE-bench Multilingual 77.3，FORTE 73.2，RWSearch 78.8，BrowseComp 79.9。

小互@xiaohu · 2天前62

OpenClaw 推出了自己的手机客户端 • 通过二维码或设置码与你的小龙虾配对 • 在手机上和 A小龙虾聊天 • 支持实时和后台语音对话模式 • Agent执行操作前，先在手机上向你确认审批 • 直接把文字、链接、图片从其他 App 分享进来 • 可授权摄像头、定位、照片、通讯录、日历、提醒事项等设备权限 • 接收推送通知和节点状态更新

译OpenClaw 推出手机客户端，可通过二维码或设置码与 AI 助手“小龙虾”配对。支持在手机端实时及后台语音对话；Agent 执行操作前需在手机上确认审批；可跨 App 分享文字、链接、图片；授权摄像头、定位、照片、通讯录、日历等设备权限；接收推送通知与节点状态更新。

SiliconFlow@SiliconFlowAI · 2天前67

The full model behind "Owl Alpha" on @OpenRouter is here🦉 Let's meet @Meituan_LongCat 's latest flagship model, LongCat-2.0 Now Day 0 live on SiliconFlow 🔥 💰 Input Cache/Input/Output: $ 0.015/0.75/2.95 per 1M tokens ⚙️ 1.6T-param MoE (~48B active) · Native 1M context window 🧠 Built for agentic coding from the ground up: ◆ LSA: sparse attention that scales efficiently to 1M ◆ Zero-Compute Experts: dynamic 33B–56B active/token, no wasted compute ◆ MOPD: three specialized expert groups (Agent / Reasoning / Interaction), gate-routed per task 🏆 59.5 SWE-bench Pro: performance on par with mainstream close-sourced models Start building with 🐱👇

译美团 LongCat 推出旗舰模型 LongCat-2.0，采用 1.6T 参数 MoE 架构（约 48B 活跃参数），原生支持 1M 上下文窗口。定价为 Input Cache $0.015/1M tokens、Input $0.75/1M tokens、Output $2.95/1M tokens。模型专为 Agentic Coding 设计，包含三大技术：LSA 稀疏注意力实现高效 1M 扩展；Zero-Compute Experts 动态激活 33B–56B 参数/token，无算力浪费；MOPD 将专家分为 Agent / Reasoning / Interaction 三组，按任务门控路由。在 SWE-bench Pro 上取得 59.5 分，性能接近主流闭源模型。现已上线 SiliconFlow Day 0 服务。