用好 Coding Agent，重点是两头，尤其是开头的部分，如果一开始就走偏了后面怎么改都改不好。比如我要开发一个新功能，首先不是直接叫给 Agent 去写，会把需求简单整理一下，发给三个不同的 Agent（Codex、Claude Code、Cursor），打开 Plan 模式去帮我写 Plan，这里要用最好的模型。都写完了之后我去看看谁的最好，以及其他版本有什么可取的地方。GPT 5.5 和 Claude Opus 4.7 并不是谁总是最好，选好了设计后，再把另外两个设计也发给它，让它借鉴一下。当然都不满意就要反复调整提示词多轮讨论。如果是简单的 Plan，直接就可以开始做了。如果是复杂的 Plan，让它设计成几个 Phases，每个 phase 说清楚要求和验证的方法，保存成一个 Markdown 文档，把相关的素材也都引用上。偷懒一点就用 /goal 把 plan 文件发给它，让它按照 Phases 执行，担心 Agent 跑偏就每一步完成人工去审核一下，及时纠偏。写代码有条件当然用最好的模型，但如果像节约成本，便宜一点的模型也是可以的，毕竟设计好了、有明确的验收标准，偏不到哪里去。最后代码 Review 不需要太多 Agent 去，GPT-5.5 这种就够了，重点是看是不是符合设计要求以及代码质量有没有问题。这其实很像一个几个高水平的架构师，一人出一套系统设计方案，你来拍板，然后交给程序员去执行，最后让高水平的程序员或者架构师 review 一下代码。

译用好 Coding Agent 的关键在于初始规划。方法是先将需求整理后，用最强模型（如 GPT-5.5、Claude Opus 4.7）分别在 Codex、Claude Code、Cursor 的 Plan 模式下生成设计方案，选择最优方案并借鉴其他版本。对于复杂计划，可将其拆分为多个 Phases 并明确要求与验证标准，形成 Markdown 文档。执行时按 Phases 进行，并辅以人工审核纠偏。最后的代码审核（Code Review）用 GPT-5.5 审核代码质量与设计符合度即可。应避免让多个智能体交叉 Review，否则可能导致代码越改越多。

Rohan Paul@rohanpaul_ai · 5月28日63

Today’s edition of my newsletter just went out. 🔗 https://www.rohan-paul.com/p/chinas-huawei-reveals-a-new-chip 🗞️ China’s Huawei reveals a new chip design breakthrough which can close its gap with TSMC and Intel 🗞️ New Alibaba + Nanjing Univ paper shows standard LLMs can handle very long context faster by making attention selectively sparse. 🗞️ Deep Dive on DeepSeek: Their real story is not cheaper chatbots, but architecture that turns hardware scarcity into strategy. 🗞️ New Meta + Stanford + Illinois survey paper argues that AI agents work better when code becomes their main working layer. 🗞️ Anthropic billionaire cofounder backs Pope Leo, warning that AI job losses will create a historic moral crisis 🗞️ xAI just Dropped ‘Grok Build’: The Terminal-Native Agentic AI for all all SuperGrok and X Premium+ users.

译华为披露新的芯片设计突破，旨在缩小与台积电及英特尔的差距。阿里巴巴与南京大学的论文提出，标准大语言模型可通过选择性稀疏注意力机制更高效地处理长上下文。对DeepSeek的深度分析指出，其核心价值并非提供更廉价的聊天机器人，而是将硬件稀缺性转化为战略优势的架构设计。Meta、斯坦福及伊利诺伊大学的调查论文主张，当代码成为AI智能体的主要工作层时，其效能会更高。Anthropic联合创始人警示AI导致的失业将引发历史性道德危机。xAI为SuperGrok与X Premium+用户推出了终端原生的智能体AI产品“Grok Build”。

elvis@omarsar0 · 5月28日61

It's crazy that this is even possible today. It inspired me to build my own self-improving coding agent with simple read, write, bash,... I already used the coding agent to build an entire production-grade application in 24 hrs. I don't know, man. This feels so strange.

译真不敢相信这在今天竟然成为可能。这激发了我构建自己的自改进编程智能体，使用简单的读、写、bash等工具。我已经用这个编程智能体在24小时内构建了一个完整的生产级应用。我不知道，伙计。这感觉太奇怪了。

🚨 AI News | TestingCatalog@testingcatalog · 5月28日74

BREAKING 🚨: Sesame just released HER > Sesam iOS app is now available in Preview, offering a collection of 4 personal voice agents. > Sesame Agents are powered by a SOTA real-time voice mode. > Agents can search the web, manage reminders, and have memory. > App rollout is gradual and has a whitelist Nothing stands even close ATM.

译Sesame发布了iOS应用预览版，提供4个个人语音智能体。这些智能体基于SOTA实时语音模式，具备网络搜索、提醒管理和记忆功能。应用发布是逐步进行的，目前设有白名单。引用推文表明，这是继去年研究预览版后的正式推出，提供了新功能、新角色和更强的能力。

Artificial Analysis@ArtificialAnlys · 5月28日37

We're excited to work with Harvey to launch the full leaderboard for Legal Agent Benchmark - coming soon to Artificial Analysis!

译我们很高兴与Harvey合作，即将在Artificial Analysis推出法律智能体基准测试的完整排行榜！

Greg Brockman@gdb · 5月28日62

Codex for parallel browser-using subagents:

译Codex子智能体并行操控浏览器：一个提示词同时生成七个浏览器会话并行运行。航班、汽车、Airbnb、徒步、表单、结账页面。虽然仍显粗糙，但未来感十足。

Krea@krea_ai · 5月28日62

Krea 2 now built in to Hermes

译Krea 2现已内置到Hermes中。

Google AI Developers@googleaidevs · 5月28日49

Agents require speed and performance across complex tasks. Watch Gemini 3.5 Flash’s intelligence tackle these tasks at scale while you build ↓

译智能体需要在复杂任务中兼顾速度与性能。观看 Gemini 3.5 Flash 的智能如何大规模处理这些任务，同时您进行构建 ↓

swyx@swyx · 5月28日75

cognition is now the largest independent agent lab in the world. take the 200% utilization that everyone is hitting from this chart and run out the sales growth from this, i encourage you to go thru the exercise if you are new to investing a lot of you have read my cog initation post, but when i talk to people they do not sufficiently appreciate that this is the perfect storm of all the trends: - first* koding agent - first* cloud dev infra - best reviewed code review/security guy - first* llm wiki knowledge base - s-tier GTM - most IOI golds - somehow also cracked at Smash Bros and Poker agent lab gets you: - long model diversity - long reasoning/toolcalling models - long harnesses - long domain specific RLFT - long coding data/expertise/evals - long full agentic SDLC - chief partner to CIOs in combating tokenmaxxing slop from humans and from agents - SOTA foodbench from house chefs and this is why you can see this is Peter Thiel's biggest AI bet *i dont like "first" as an adjective, but "first + 2 years of serious scaling" really serves as a shorthand for "most enterprise battle tested" - see customer list in screenshot - literally 10s of thousands of devs, 10s of thousands of repos, PER CUSTOMER is the kind of weight I put on the simple word "first"

译Cognition宣布已成为全球最大的独立智能体实验室。公司完成超10亿美元融资，估值达260亿美元，由Lux Capital、General Catalyst等领投。其企业使用量自年初增长超10倍，年化收入增至4.92亿美元。Cognition于两年前推出Devin，定位为首个AI软件工程师。公司强调其拥有多项领先优势，包括首个编码智能体、顶级代码审查能力等，并得到了Peter Thiel的重大投资。

宝玉@dotey · 5月28日62

哈哈，严重赞同，去设定一堆角色来聊天没什么价值，纯浪费 Token。就跟早年想给人装上翅膀飞上天一样。人类之所以这么分工是因为能力有限，无法精通所有工种，不代表 AI 也要这么做。也不能说完全没用，还是能收获情绪价值，整个三省六部给自己汇报工作圆个帝王梦。

译推文强烈批评在AI智能体设计中，模仿人类组织架构、设定不同角色并通过聊天传递上下文的做法，认为这纯属浪费Token。其观点认为，人类分工是因能力有限，但AI不应受此限制。尽管承认此方式或能提供情绪价值，但用“三省六部”的比喻将其归结为满足用户幻想。

Artificial Analysis@ArtificialAnlys · 5月28日71

Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50% ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where models must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by @IBM's Software Innovation Lab, leveraging IBM’s deep expertise in enterprise IT operations Artificial Analysis has worked closely with IBM over the last 6 months to develop a implementation of the dataset for frontier AI evaluation, beginning with Site Reliability Engineering (SRE) and expanding to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time ITBench-AA SRE overview: ➤ 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks ➤ Each task provides a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident ➤ Faults span typical SRE failure modes including infrastructure, service, application, and chaos-injected incidents, such as resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions Methodology details: ➤ Agentic harness: each task is solved by the model running in our open-source Stirrup reference harness, with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task, 3 repeats per task ➤ Models submit a list of root-cause entities (Kubernetes Deployments, Services, Pods, etc.) they believe caused the incident. Each submission is compared against a ground-truth set of root causes provided by IBM Research ➤ Scoring uses average precision at full recall: if a model misses any of the ground-truth root causes, it scores 0.0 for that repeat. If it identifies all of them, it is awarded a score equal to its precision - the share of its submitted entities that are actual root causes, i.e. true positives / (true positives + false positives). The headline score is the average across 59 tasks × 3 repeats. ➤ The harness (Stirrup) is held constant across all evaluated models, allowing an apples-to-apples comparison between models. Key findings: ➤ Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42% ➤ All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in our suite. For context, frontier models score considerably higher on Terminal-Bench ➤ Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives ➤ GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, with Gemma 4 31B (Reasoning) at 37%, ahead of Gemini 3.1 Pro Preview at 30%

译Artificial Analysis与IBM Research联合推出ITBench-AA，首个评估AI智能体在企业IT任务中表现的基准，首发任务为站点可靠性工程（SRE）。该基准包含59项Kubernetes事件响应任务，所有前沿模型得分均未超过50%。其中，Claude Opus 4.7以47%领先，GPT-5.5得46%，通义千问（Qwen3.7 Max）得42%。开源模型中，智谱GLM-5.1（推理）得分40%，与Gemini 3.5 Flash持平；深度求索（DeepSeek V4 Pro）得38%。分析还发现，模型推理轮次差异近3倍，但更长轮次并不保证更高准确率。

elvis@omarsar0 · 5月28日57

// Your Agents are Aging Too // Huh!? They need "sleep," and now they are aging? Joke aside, great write-up on reliable agentic engineering. This new research introduces AgingBench, a longitudinal reliability benchmark. It organizes agent aging into four mechanisms, including compression aging and interference aging, and measures not just whether deployed agents degrade but what form the degradation takes and where repair should target. We benchmark agents on day one and then deploy them for months. That gap hides a basic systems question. How long does an agent stay reliable after deployment? Even with frozen model weights, an agent's effective state keeps shifting. It compresses interaction history, retrieves from a growing memory store, revises facts after updates, and goes through routine maintenance. Reliability becomes a lifespan property of the full harness, not a snapshot of the base model. Paper: https://arxiv.org/abs/2605.26302 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译这项研究提出了AgingBench，一个用于纵向评估AI智能体可靠性的基准。它将智能体老化归纳为四种机制，包括压缩老化和干扰老化，旨在衡量部署后的智能体是退化以及退化形式。研究指出，即使冻结模型权重，智能体的有效状态也会因压缩交互历史、检索记忆库、事实更新等操作而不断变化，其可靠性是整个运行系统的寿命属性，而非基础模型的快照。基准测试在智能体部署第一天进行，然后持续数月。

swyx@swyx · 5月28日42

insanely good company to keep

译Railway推出“代理原生云”，宣称拥有3M用户、每周10万注册量，其编码智能体上的支出超20万美元。创始人阐述了AI智能体为何需要新型云环境：Railway已将大部分工作负载迁移至自有的裸机数据中心，智能体使得CLI（命令行界面）比仪表盘更重要，而传统的Git/PR/CI/CD循环开始失效。文章还介绍了如何通过生产分支和功能开关使AI SRE更安全，并引用其观点：“如果你还在手动写代码，那你就是做错了。”

Rohan Paul@rohanpaul_ai · 5月28日71

Another great win for agentic coding. Cognition AI just raised over $1B at a $26B pre-money valuation. Revenue reportedly climbed from $37M in annualized run-rate to $492M, while customers like Goldman Sachs and Mercedes-Benz suggest Devin is moving from demo rooms into production workflows. Cognition's progress is driven by its flagship product, Devin, which aims to function as an autonomous junior engineer, going beyond typical coding assistants. Devin can plan, test, and deploy code through multi-step workflows in secure environments. Cognition combines its own models with OpenAI and Anthropic rather than relying on one model. Cognition is basically pitching Devin as a model-agnostic agent layer: the LLM does the reasoning and code generation, while Devin supplies the engineering workspace, repo context, terminal access, file edits, tests, and model choice around it. Last July, Cognition agreed to buy the remains of coding startup Windsurf after Google struck a $2.4 billion deal for Windsurf’s top talent and licensing rights.

译Cognition AI完成超10亿美元融资，投前估值达260亿美元。其年化收入从3700万美元增长至4.92亿美元，客户包括Goldman Sachs和Mercedes-Benz，标志着其产品Devin正进入生产环境。Devin定位为自主初级工程师，能通过多步骤工作流规划、测试和部署代码。Cognition采用自有模型与OpenAI、Anthropic相结合的模型无关技术路线，而非依赖单一模型。此外，该公司于去年7月同意收购编程初创公司Windsurf的剩余资产。

Google Gemini@GeminiApp · 5月28日51

From the #GoogleIO stage straight to the Gemini Discord Stage, join us for our next community event as we dive into two new agentic tools (Gemini Spark and Daily Brief) with members of the team who brought them to life. See these new features in action with live demos, plus get a chance to ask your questions live. 👉Join our Discord to watch live: http://discord.gg/gemini 📅 Today (Wednesday, May 27) at 11:30 AM PT

译从 #GoogleIO 舞台直接来到 Gemini Discord 舞台，加入我们的下一场社区活动，我们将与团队成员一起深入探讨两个新的智能体工具（Gemini Spark 和 Daily Brief）。观看这些新功能的现场演示，并有机会实时提问。 👉加入我们的 Discord 观看直播：http://discord.gg/gemini 📅 今天（周三，5月27日）太平洋时间上午 11:30

Rohan Paul@rohanpaul_ai · 5月28日53

OpenAI and Thrive just built a self-improving tax agent with up to 97% accuracy. Tax AI processed 7,000 returns across 30+ accounting firms, saved about one-third of preparation time, reached up to 97% accuracy, and raised throughput by about 50%. The hard part was not reading W-2s or 1099s, but handling messy K-1s, rental schedules, notes, spreadsheets, prior-year files, and values that must match across documents. The system records the full trace: source file, extracted field, citation, tax-engine mapping, accountant correction, and final filed value. Repeated corrections become eval targets, so Codex gets a narrow task with evidence, code, tests, and a pass condition. A wrong tax field can come from many places: bad extraction, weak mapping, unsupported workflow, prior-year carryover, or human judgment. The clever part was not simply using Codex to write fixes, but building a product environment where repeated practitioner corrections became bounded, testable engineering tasks. In the rental-property example, the agent could inspect source documents, extraction traces, mapper behavior, expected outputs, and regression tests before proposing a change.

译OpenAI与Thrive合作开发了一款自我改进的税务AI智能体，已在30多家会计事务所处理约7,000份报税表。该智能体将准备时间缩短约三分之一，吞吐量提升约50%，并达到高达97%的准确率。技术难点在于处理混乱的K-1s、租赁计划等非结构化文件，以及跨文档的数值匹配。系统为每个操作记录完整追踪链，并利用会计师的重复修正作为评估目标，驱动Codex生成可测试的代码修复任务，形成自我改进闭环。

Qwen@Alibaba_Qwen · 5月28日69

Fast, faster, Qwen. 🚀 Thrilled to see Qwen3.5 reaching a record-breaking 580 tps for agentic workloads on the TokenSpeed engine! This milestone wouldn't be possible without our incredible partners. Huge thanks to @lightseekorg, @NVIDIAAI, the Mooncake team, and @tri_dao for the pioneering FA4 optimization. Together, we are pushing the boundaries of open-source LLM inference. 🤝✨ Dive into the full @PyTorch blog post below! 👇 https://pytorch.org/blog/up-to-580tps-new-speed-record-of-qwen3-5-397b-a17b-on-gpu-for-agentic-workloads-with-tokenspeed/ #Qwen #Qwen3_5 #TokenSpeed #LLM #Inference #AI #PyTorch #OpenSource #AgenticAI #HighPerformance

译Qwen3.5在TokenSpeed推理引擎上，针对智能体工作负载达到了创纪录的580 tokens per second (tps)速度。这一成果由通义千问推理团队、lightseekorg Foundation TokenSpeed团队、NVIDIA及Mooncake团队共同实现，并采用了tri_dao的FlashAttention-4 (FA4) 优化。此里程碑标志着开源大语言模型推理性能的边界得到了推动，相关详情可查阅PyTorch社区博客。

Nathan Lambert@natolambert · 5月28日51

The most likely way continual learning manifests in the coming few years is through products used directly for knowledge work. Sort of how cursor can continually train their models with real-world data and RL, Claude, Copilot, and co will see if they can for knowledge work. I was chatting with Ronak a few weeks ago when this was crystalizing for me, so it's fun to see a startup in that area.

译创业公司Trajectory宣布成立，并获得Conviction、Bessemer、Radical Ventures、Jeff Dean及李飞飞等投资的1500万美元融资。该公司旨在构建持续学习平台，利用产品使用数据中的信号，帮助企业对大规模智能体模型进行持续后训练，使其性能超越前沿模型。目前，Trajectory已与Harvey、Decagon AI、Mercor、Rogo AI等AI原生公司建立合作，部分已进入生产环境。团队成员来自DeepMind、OpenAI、Apple、Meta Superintelligence等顶级机构。其理念是AI产品将通过每一次用户交互（如修正、重试、编辑）持续变得更智能。

宝玉@dotey · 5月28日63

Agent 产品的设计，要看定位是以 Agent 为主还是以人为主 Agent 为辅。如果 Agent 只是辅助，那么工作区在中间，Agent 干活区就在右侧。毕竟主要的场景都是人操作工具区，偶尔让 AI 在右侧辅助。如果以Agent为主，那Agent干活区在中间，其他在右边。因为大部分时候你都在指挥 Agent，所以并不需要直接去操作工作区。你看主流的 Agent：Codex App、Claude Desktop、Cursor Agent，都是 Agent 对话区在中间，其他都在右边典型场景就是写 PPT：如果主要都是你自己在写 Slides，那么打开 Google Slides，自己编辑，右侧随时和 Agent 对话，辅助帮你做一些事情如果是让 Agent 帮你写 Slides，打开 Codex，告诉你的想法，让它生成，生成好了你在右边看看，不满意告诉 Agent 去帮你调整。

译Agent产品的设计需首先明确定位：若以人为主、Agent为辅，则人的工作区居中，AI智能体对话区在右侧辅助；若以AI智能体为主，则AI智能体对话区居中，其他界面在右侧，因为用户主要通过指令与Agent交互。Codex App、Claude Desktop、Cursor Agent等主流产品均采用了后者布局。文中以写PPT为例对比：前者是用户亲自编辑幻灯片，右侧与Agent对话辅助；后者是用户下达指令，由Agent生成并调整。这一界面设计被认为是所有ToB AI软件的最终形态，并推荐使用Mastra框架实现业务AI化。

xAI@xai · 5月28日69

Use your SuperGrok or X Premium+ subscription in @kilocode. Try grok-build-0.1 for high speed and agentic coding intelligence, available in the Kilo IDE extensions or CLI. https://x.ai/news/grok-kilocode

译在 @kilocode 中使用您的 SuperGrok 或 X Premium+ 订阅。尝试 grok-build-0.1，享受高速和智能体编程智能，可在 Kilo IDE 扩展或 CLI 中使用。 https://x.ai/news/grok-kilocode

Berryxia.AI@berryxia · 5月28日69

另一个故事，忍不住分享！全程人就是动动嘴，然后下载验收即可。任务：https://x.com/cleoabram/status/2059622849266983122?s=20 下载视频并且添加中文字幕 @Berry小跟班 @BuLeng @乐迪你们三个看看谁最快完成这个任务最后结果：@Berry小跟班100%完成 @BuLeng 只输出软件字幕和剪辑的视频 @乐迪直接api限流~~ 就是花了点时间，但是全程人没有干预搞定！

译用户发布任务，要求三个AI智能体独立从链接下载视频并添加中文字幕。最终，一个AI智能体100%完成，一个只输出了部分成果，另一个则遇到API限流。整个流程虽耗时，但用户仅需“动动嘴”下达指令，全程零干预即可验收结果。评论称赞这种自主性（Agency）令人舒服，宛如“成精”。

Berryxia.AI@berryxia · 5月28日51

Berryxia 小跟的复盘，实现的步骤和方法。

译推文复盘了一个让三个AI智能体（Berry小跟班、BuLeng、乐迪）竞赛的任务，目标是根据一个链接下载视频并添加中文字幕。最终，Berry小跟班100%完成了任务，BuLeng仅完成了部分输出，而乐迪遇到了API限流问题。整个过程展示了用户只需通过自然语言下达指令，AI智能体便能自主尝试执行并交付结果，体现了当前智能体在任务执行上的进展与实际应用中的局限。

🚨 AI News | TestingCatalog@testingcatalog · 5月28日64

GOOGLE 🔥: Gemini for Business will get a new experience for collaborative Projects, where teams can work in a shared environment. Besides that, Google is rolling out Workflow Agents that can work on automation tasks across various apps. The same functionality is now available on Gemini Enterprise and will become better integrated into the core Gemini for Business experience. Is it only me, or does Gemini for Business feel much better than consumer-facing Gemini?

译GOOGLE 🔥: Gemini for Business 将推出协作项目新体验，团队可在共享环境中工作。此外，Google 正推出可在多个应用中执行自动化任务的工作流智能体。相同功能现已在 Gemini Enterprise 上可用，并将更好地集成到 Gemini for Business 的核心体验中。是我一个人这么觉得，还是 Gemini for Business 确实比面向消费者的 Gemini 体验好得多？

Rohan Paul@rohanpaul_ai · 5月28日62

Cracking continual learning would make AI far more capable, because models could improve from real usage after deployment. Trajectory just launched a continual learning platform, backed by a $15M round, to turn every agent trace and user correction into a system that keeps improving after deployment. A neolab with ex-DeepMind, OpenAI, and Meta Superintelligence researchers that also has paying customers, totally normal. AI products are still frozen software, because users correct them every day but those corrections rarely update the model, the prompts, or the surrounding agent workflow. Trajectory’s core unit is the trajectory, which combines what the agent did with what the user accepted, rejected, edited, retried, or fixed later, so companies can train on full failure chains and improve model weights, harness, and prompts together. The next major AI leap almost certainly will come from models that keep learning after they are shipped.

译AI公司Trajectory推出了一个持续学习平台，旨在解决AI模型部署后无法从实际使用中改进的核心问题。该平台的核心是“轨迹”概念，它将智能体（Agent）的行为与用户后续的接受、拒绝、编辑、重试或修复反馈结合，形成完整的交互链条。公司可借此对大规模智能体模型进行持续的后训练，以同步提升模型权重、配置和提示词。该平台已与Harvey、Decagon等多家AI公司合作，部分已投入生产。其团队由来自DeepMind、OpenAI、Meta Superintelligence等机构的研究人员组成。项目获得了1500万美元融资，投资方包括Conviction、Bessemer等。

elvis@omarsar0 · 5月27日47

For future-proof, build AI that's composable. Regardless of what you use, all these should be composable, iterative, and customizable: - LLMs - Evals - Automations - MCP/CLI tools - Skills/Memory/Context - Agent Harness (Codex, CC, Pi,...) The compounding effects are insane.

译为了面向未来，构建可组合的AI。无论你使用什么，所有这些都应该是可组合、可迭代和可定制的： - LLMs - Evals - Automations - MCP/CLI tools - Skills/Memory/Context - Agent Harness (Codex, CC, Pi,...) 复利效应是惊人的。

Berryxia.AI@berryxia · 5月27日66

现在可以Warp跑一个跨夜的AI agent coding项目。以前总得把MacBook半合着带在身上，生怕关机后agent就断了上下文。结果今天升级到最新版，发现一个细节直接把这个痛点抹平了。现在你只要合上笔记本，Warp就会自动把当前agent对话无缝切换到云端。整个过程零中断，agent继续执行任务，上下文完整保留。设置里一点，Agents -> Warp Agent -> Cloud Handoff，就开好了。 Warp本来就是从终端出生、支持本地和云端agent的开源开发环境。这次更新把“人离开电脑后agent还能继续干活”这件事，做成了默认行为。以前大家觉得agent要实用，得靠24小时开机或者复杂的手动迁移。现在它告诉你，真正的生产力来自这种安静的连续性。你出门旅行、合上电脑睡觉，agent照样在云端推进进度，等你打开就是最新状态。这步把agentic workflow从实验玩具，真正推向随时可用的工具。

译Warp最新版解决AI智能体跨夜运行痛点：用户合上笔记本电脑后，当前智能体会自动无缝切换到云端继续执行任务，上下文完整保留。设置路径为Agents -> Warp Agent -> Cloud Handoff。此前用户需保持电脑开机以维持智能体运行，此次更新将“离线连续执行”设为默认能力，使智能体工作流更接近实用工具，支持离线后云端持续推进项目状态。

Krea@krea_ai · 5月27日62

today, we're releasing the API for Krea 2. now available in platforms like @fal or @ComfyUI, through agents like Hermes from @NousResearch, and with full support for Claude, Codex, or OpenClaw. learn how you can set it up 👇

译今天，我们发布了 Krea 2 的 API。现已在 @fal 或 @ComfyUI 等平台可用，通过 @NousResearch 的 Hermes 等智能体使用，并全面支持 Claude、Codex 或 OpenClaw。了解如何设置 👇

OpenAI Developers@OpenAIDevs · 5月27日67

⚙️ Behind the build of self-improving tax agents with Codex We co-built Tax AI with @ThriveHoldings around tax prep workflows so when reviewers fix any errors, Codex can trace the failure, improve the system, and test the change before it ships. https://openai.com/index/building-self-improving-tax-agents-with-codex

译⚙️ 使用 Codex 构建自我改进税务智能体的幕后我们与 @ThriveHoldings 共同打造了 Tax AI，围绕税务准备流程进行协作。这样当审核员修正任何错误时，Codex 可以追溯故障、改进系统，并在部署前测试更改。 https://openai.com/index/building-self-improving-tax-agents-with-codex

meng shao@shao__meng · 5月27日68

AI Agent 协作编排层：Alook @alook_ai Alook 把 Claude Code、Codex、OpenCode 等本地 CLI agent 组织成一支「可管理的 AI 团队」——有角色、邮箱、任务板、日历和可追溯的执行记录。开源地址： https://github.com/alookai/alook 核心命题：换一条组织轴 Alook 的出发点很清晰：现有工具按「项目」组织，工作却按「人/角色」组织。一个项目往往需要规划、开发、审查、运营等多个角色，但工具只给单个 agent + 多个 context window。用户被迫在 tab、tmux、会话之间搬运上下文，自己当消息总线。传统模式 · 1 项目 → 1 agent → 多 session · 上下文在 session 内 · 用户是 router Alook 模式 · 1 人 → 多 agent → 各持角色 · 上下文跨天、跨任务持久化 · 用户是 CEO，agent 是员工 Email 被当作异步、持久、可线程化的上下文层——人机、机机通信都走邮件，底层共享记忆不断累积，而不是每次从零开始。架构：本地执行 + 云端协作 · 本地优先：代码、工具、文件系统都在本机，agent 有完整 repo 访问权。 · 云端协作：Dashboard、任务调度、邮件路由、多设备可达、团队共享。记忆系统：三层叠加 · 指令层：AGENTS.md（ symlink 到 CLAUDE.md），角色定义、同事列表、CLI 工具手册 · 记忆层：memory.md + experiences/*.md，短记忆索引 + 长经验文档 · 时间线：.context_timeline/YYYY-MM-DD.jsonl，全任务历史：prompt、响应、session_id、status

译Alook 是一个开源协作平台，用于管理 AI 编码智能体。它将 Claude Code、Codex、OpenCode 等本地 CLI 智能体组织成一个拥有角色、邮箱和任务板的“AI 团队”。其核心理念是将组织轴从“项目”转向“人/角色”，让用户（作为CEO）通过邮件系统异步协调多位智能体（员工），实现跨任务的共享记忆与上下文持久化。平台采用本地优先执行与云端协作的架构，并包含三层记忆系统以积累经验。它作为始终在线的守护进程运行，支持团队自主处理任务。

Berryxia.AI@berryxia · 5月27日18

麻蛋，Agent成精了。但就是，我想要的状态。这种Agency，真实令人舒服啊！

Baidu Inc.@Baidu_Inc · 5月27日51

As AI agents take on more work, it's worth asking what we should measure. Tokens tell you what you spent. DAA, or Daily Active Agents, tells you what you got back 👇

译随着AI智能体承担更多工作，值得思考我们该衡量什么。 Token告诉你花了什么。 DAA，即每日活跃智能体数，告诉你得到了什么回报 👇

Berryxia.AI@berryxia · 5月27日61

卧槽！这个开源Codex 实战手册太牛逼了！很多小白用户刚开始用Codex desktop app跑computer use和浏览器任务，结果一上来就被登录、充值、配置这些基础环节卡住。问AI、翻教程，很多都讲得模棱两可，最后自己折腾了半天。不如直接使用苍老师实战数周撰写开源的CodexGuide 实战指南！这位前大厂开发、现在专注AI创业的大牛，花了两周时间把所有坑踩一遍，整理成一份免费开源的实战指南。它按四层结构组织：认识入口、跑通任务、建立方法、团队沉淀。从CLI入门、桌面端安装、Plus订阅，到手机端通过ChatGPT App远程指挥Mac Mini继续vibe coding，全都写得清清楚楚。他还专门做了实战案例专栏，现在已经收录13个能直接复刻的场景，比如Codex配合http://Draw.io自动画架构图、GitHub Actions CI失败自动修复、Obsidian里搭AI知识库。最关键的是，他把“想用却用不上”的真实门槛彻底铺平了。以前很多人觉得Codex强大，却总在入门阶段就放弃。这份指南把经验沉淀下来，让后来人直接跳过试错，直接进入生产节奏。兄弟们，直接上Star吧，聊表心意了。地址见评论区～

译一份由开发者“苍老师”撰写的免费开源Codex实战指南（CodexGuide）已发布，旨在帮助新手跳过入门门槛。手册按四层结构组织：认识入口、跑通任务、建立方法、团队沉淀，详细覆盖CLI入门、桌面端安装、Plus订阅，乃至通过ChatGPT App远程指挥Mac Mini等场景。目前，该指南已收录13个可直接复刻的实战案例，例如配合Draw.io自动绘制架构图、GitHub Actions CI失败自动修复、以及在Obsidian中搭建AI知识库。

🚨 AI News | TestingCatalog@testingcatalog · 5月27日63

Alook launched an open-source platform that lets a single person run an organized team of AI agents, with defined roles, reporting lines, and real email coordination between agents. Close the screen and terminal, the agents keep running, and the work lands in your inbox.

译Alook推出了一个开源平台，让单人能够运营一个有组织的AI智能体团队，具备明确的角色、汇报关系和智能体之间的真实邮件协调。关闭屏幕和终端，智能体继续运行，工作成果会发送到你的收件箱。

Chubby♨️@kimmonismus · 5月27日58

Phoronix just published one of the first public benchmarks of NVIDIA's Vera CPU. I went through the full 11-page review this morning and the results are genuinely impressive. For those who don't follow server hardware: Vera is NVIDIA's new ARM-based data center processor with 88 custom-designed Olympus cores. The idea is straightforward. Agentic AI doesn't just need powerful GPUs. It needs CPUs that can keep up with code execution, tool calls, orchestration and data pipelines, all running concurrently at scale. The numbers are strong. Vera compiled a default Linux kernel in 20 seconds, the fastest result in Phoronix's tested field. Across all tested workloads, it delivered about 1.55x the performance of Intel's Xeon 6980P. Against AMD's EPYC 9575F, it came out about 10% ahead on a geometric mean basis. The memory story might be even more interesting. Vera uses LPDDR5X with up to 1.2 TB/s of bandwidth and delivers more than 4x the memory bandwidth per core compared to traditional x86 server CPUs. In the STREAM TRIAD benchmark, it sustained 90% of its rated peak bandwidth, the highest ratio Phoronix has measured on any CPU. If you're running agentic workloads with dozens of parallel processes and concurrent data queries, that kind of consistent memory performance matters more than core count on a spec sheet. Compared to NVIDIA's own Grace CPU, Vera is 1.63x faster in the geometric mean. That is an unusually large generation-over-generation jump for a CPU. Michael Larabel, who founded Phoronix and has been benchmarking Linux hardware for over two decades, said he's never seen any ARM processor compete with Intel and AMD at this level. I was at GTC in March when Jensen announced Vera. The thesis that agentic AI creates entirely new CPU demand made sense to me then. These benchmarks are the first real numbers behind that thesis. And they deliver. Vera ships to partners in H2 2026. The server CPU market just got a whole lot more interesting. Full 11-page review on Phoronix. Worth your time, all sources below.

译Phoronix发布了NVIDIA Vera CPU的首份公开基准测试。这款ARM架构数据中心处理器拥有88个Olympus核心，专为智能体AI（Agentic AI）所需的代码执行、工具调用与数据管道设计。测试数据显示，Vera编译Linux内核耗时20秒，为测试最快。其整体性能较Intel Xeon 6980P提升约1.55倍，较AMD EPYC 9575F平均领先约10%。内存方面，Vera采用LPDDR5X，提供高达1.2 TB/s的带宽，每核内存带宽是传统x86 CPU的4倍以上，且在STREAM TRIAD测试中达到了90%的峰值带宽利用率。与上一代Grace CPU相比，Vera性能平均提升1.63倍。该处理器预计于2026年H2出货给合作伙伴。

Qwen@Alibaba_Qwen · 5月27日56

🚀🚀

译🚀🚀 [引用 @NousResearch]：Qwen 3.7 Max 现已在 Hermes Agent 中获得支持

Orange AI@oran_ge · 5月27日54

今天看到蚂蚁集团CEO韩歆毅分享的 Agent 时代的经济和商业思考，有几点还蛮共鸣的。过去十年，互联网的核心逻辑是网络效应和流量，谁有用户注意力，谁就有护城河。但在智能体时代，这个逻辑在失效。人的流量会让位于智能体生态，新的网络效应会围绕Agent形成。谁的Agent生态更繁荣，谁的护城河更深，跟以前抢人头是不一样的竞争了。这时候一个新的问题就冒出水面：交易双方从人变成Agent，没有人能靠直觉去判断对面是否值得信任。如果我们参考人类建立信任的过程，它既不是靠说话，也不是靠名头，信任是靠一次一次结果的交付。其实Agent的世界也是一样的逻辑，谁把事办成的概率高，谁就会被信任被选择。这些结果需要被记录下来，成为一个Agent的credit，信任就这么建立。 Agent 会极大地影响商业，具体体现在企业层面，就是每家企业的高度和广度都大大提升了。这也是为什么YC的CEO说今天要boil the ocean，企业要多想增效提利润，而不是降本裁员。 Agent经济时代，最重要的关键词是Token。未来所有的一切能被Token化，Token会成为价值的新载体，以前的法币、积分、权益、营销，都会以Token的形式来流转，所以未来的经济基础设施也应该围绕Token来设计。 AI支付是未来最重要的基础设施之一。给Agent开钱包、定协议、搭清结算网络，现在还是百废待兴的状态，需要有人把生态做好、把基建做好，这种工作指望创业公司来做是比较难的。支付宝押注AI支付的决心挺大，AI 支付团队在内部战略地位很高，团队架构在保密状态下一直在扩充人员，这应该是他们的必争之地。

译蚂蚁集团CEO韩歆毅分享了对AI智能体时代的商业思考。他指出，核心逻辑正从流量经济转向以智能体生态繁荣度为核心的网络效应。智能体间的信任需通过一次次任务结果交付来建立。同时，所有价值将实现“Token化”，Token成为价值流转的新载体。AI支付被视为未来最关键的基础设施之一，涉及为智能体构建钱包、协议与清结算网络。蚂蚁集团已将AI支付团队置于高战略地位，正大力投入这一关键基建的布局。

Alibaba Cloud@alibaba_cloud · 5月27日56

Powering Hermes Agent with Qwen3.7-Max. Go check it out @NousResearch 🚀

译为 Hermes Agent 提供 Qwen3.7-Max 动力。快去试试 @NousResearch 🚀

X.PIN@thexpin · 5月27日64

JUST IN: Alipay unveiled the world's first "AI Wallet," built specifically for your AI assistant. You give the word ("book a Vegas trip under $500"), and the AI compares and pays for it directly, saving you the hassle. The key part is that it gives you full visibility and control over the budget. No need to worry about your AI going on a wild spending spree

译突发：支付宝推出了全球首个专为你的AI助手打造的“AI钱包”。你只需下达指令（例如“预订一趟500美元以下的拉斯维加斯旅行”），AI就会直接比较并支付，省去你的麻烦。关键在于，它让你对预算拥有完全的可见性和控制权。无需担心你的AI会疯狂消费。

向阳乔木@vista8 · 5月27日61

开发好Chrome插件，最琐碎的是上架步骤。现在只需浏览器登录Chrome应用商店后台，给Codex下个目标：上架这个插件。它会调用 Computer Use和Chrome ，鼠标模拟人操作填写资料，缺Logo和截图，它自己会调用工具生成。缺隐私协议，自己写一套放Github引用，全程你不需要做任何事情。成本：13分钟，65万Token 当下觉得OpenAI的产品力是强过Anthropic的，配套的开发工具太丰富了，尤其Computer Use、Browser Use相当加分。但写作方面，OpenAI的GPT现在还是不如Claude。

译推文分享了使用OpenAI Codex自动完成Chrome插件上架流程的案例。Codex能调用Computer Use和Chrome模拟人类操作浏览器，自动填写商店后台资料、生成缺失的Logo和截图、并编写隐私协议。整个过程耗时13分钟，消耗65万Token。作者同时表达了对OpenAI产品力的看法，认为其配套开发工具丰富，但指出GPT在写作方面目前仍不如Claude。

Alibaba Cloud@alibaba_cloud · 5月27日64

Alibaba Cloud Has Been Recognized as a Leader in Omdia’s Agentic AI Market Radar. Omdia highlights Alibaba Cloud's full-stack capabilities across every layer, recognizing it as the first cloud provider to orient its entire platform around the Agent paradigm.

译阿里云在Omdia的智能体AI市场雷达中被评为领导者。Omdia强调了阿里云在每一层的全栈能力，认可其是首个将整个平台围绕智能体范式进行构建的云服务商。