Agent基建越来越好，利好中小企业。当开发部署不是问题时，又回到根本问题，如何理解企业需求用AI解决问题。最近FDE岗位（Forward Deployed Engineer，前沿部署工程师）很火，可能也是这个原因。外派到客户公司，让AI技术与企业真实业务场景结合，推动AI落地并产生商业价值。不知道有没有正在做FDE工作的朋友，想学习交流下。

译腾讯云 EdgeOne 今日发布「EdgeOne Makers」，通过 `npm install -g edgeone` 等几行命令即可部署 AI Agent 开发框架，自动处理上下文、并发、沙箱环境等问题，支持绑定域名、关联 GitHub 持续迭代。产品处于 Beta 内测，注册可免费领取 50 万 Token。该工具大幅降低 Agent 部署门槛，利好中小企业。Vista 指出，当开发部署不再是问题，关键转向如何理解企业需求用 AI 解决问题，近期 FDE（前沿部署工程师）岗位走热，正是推动 AI 与业务场景结合、实现落地的具体实践。

Rohan Paul@rohanpaul_ai · 4天前44

This paper asks whether AI agents have a real memory system yet, and finds the answer is mostly no. The problem is that AI agents now need memory that can store, search, update, and clean up information across long tasks. The authors say current tests mostly check final answers, so they miss whether the memory system itself is fast, reliable, or good at handling changed facts. They split agent memory into 4 parts: how memories are stored, how facts are extracted, how useful memories are found, and how old or conflicting memories are maintained. They tested 12 memory systems across 5 workloads and 11 datasets, including long conversations, multi-session recall, database tasks, and update-heavy settings. The main result is that no memory design wins everywhere, because graph memories help with linked facts, hybrid systems help with filtered search, and raw traces help when exact action history matters. ---- Link – arxiv. org/abs/2606.24775 Title: "Are They Ready For An Agent-Native Memory System?"

译一篇新论文指出AI智能体目前缺乏真正的记忆系统。现有测试只检查最终答案，忽略了记忆系统本身的性能。论文将智能体记忆拆分为存储、事实提取、有用记忆检索、旧/冲突记忆维护四部分，在12个记忆系统、5个工作负载、11个数据集上评测。核心发现：没有一种记忆设计能在所有场景胜出——图记忆擅长关联事实，混合系统善于过滤搜索，原始痕迹则在精确动作历史记录中表现最佳。

MiniMax (official)@MiniMax_AI · 4天前23

Congrats to all the winners of our cohosted hackathon with @cysic_xyz Check out the stellar projects built with M3 👇

译祝贺所有与我们和@cysic_xyz 联合举办的黑客松的获奖者！查看基于 M3 构建的出色项目 👇

宝玉@dotey · 4天前67

Anthropic 上周发布了 Claude Tag，目前以 beta 形式面向 Claude Team 和 Enterprise 用户开放。简单说，Claude Tag 让团队可以在 Slack 频道里 @ Claude，像 @ 同事一样给它派活。管理员事先配置好 Claude 能访问哪些频道、工具、数据源和代码库，之后频道里的任何人都能直接给它布置任务，Claude 会在后台拆解、执行，完成后在 Slack 线程里回复结果。 Claude Tag 发布当天，Andrej Karpathy 发了一条长帖，称这是 LLM 交互方式的第三次重大重新设计。他的框架是这样的：第一代，LLM 是你去访问的网站（ChatGPT 网页版）；第二代，是你下载到电脑上的 App（Codex App、Claude 桌面端、Cursor 这类）；第三代，也就是 Claude Tag 代表的方向，LLM 变成了一个持久存在、异步运行、拥有组织级工具和上下文的实体，直接嵌入团队的工作流里。 Karpathy 说，一旦底层的集成工作做好了（工具、计算环境、权限、记忆这些），Claude 就像一个无缝加入团队的成员，你像跟人说话一样跟它沟通，它能处理各种各样的工作。他的原话是： > "it really takes a while to wrap your head around it, but it works and it is awesome"。这条帖子引发了两极反应。一部分人认为 Karpathy 在给 Anthropic 做软广，一个 Slack bot 而已，何至于"第三次重新设计"。另一部分人则认为他抓住了一个真实的产品范式变化，只是用了一个很容易被误读的产品（Slack 集成）来承载这个观点。 Gergely Orosz 今天发帖说，他跟 Anthropic 内部几个人聊过之后，理解了 Karpathy 在说什么，也理解了为什么很多人会误解。重点不在 Slack。真正的突破是一个云端 AI 被接入了公司内部系统后开箱即用。Slack 只是入口，背后是云端执行环境、持久记忆、工具集成和组织级权限控制这套组合。他举了个例子：两周前有家创业公司给他演示了自己搭的类似系统，在 Slack 里 @ 一下就能启动云端开发环境、自动连接内部工具。他们的评价是“绝对的 game changer”，因为触发并行工作变得极其简单。这套东西对已经配好本地开发环境的工程师来说没什么新鲜感，就是个“哦，然后呢”的反应。真正受益的是三类人： 1. 新入职员工 2. 非工程师 3. 以及需要改动不熟悉代码库的开发者他们不再需要花时间配本地环境了。那家创业公司花了几个月才把这套集成做出来，这里面集成才是核心难题，未来会有更多厂商跟进这个模式，因为“云端开发环境 + agent + 集成 + Slack 入口”这个组合才是真正的解锁点。 Claude Tag 并非没有竞争对手。GitHub Copilot 已经支持在 Slack 里 @ GitHub 触发 coding agent，OpenAI Codex 也在做云端异步执行，Salesforce 更是凭借 Slack 东家的身份天然占据入口。Claude Tag 的差异化在于频道级共享身份、持久记忆和异步执行的组合，但“集成”这两个字说起来容易，做到“just works”是另一回事。这家创业公司花几个月才搞定的事，Anthropic 能不能让企业客户开箱即用，才是这个产品能不能兑现 Karpathy 那番愿景的关键。

译Anthropic 上周面向 Team 和 Enterprise 用户 beta 发布 Claude Tag，允许在 Slack 频道内 @Claude 布置任务，后台异步执行并回复。Andrej Karpathy 称这是 LLM 交互的第三次重新设计——从网站到 App 再到持久存在的云端智能体。Gergely Orosz 指出真正突破是云端 AI 接入公司内部系统并开箱即用，Slack 仅为入口。该模式对新人、非工程师及不熟悉代码库的开发者尤其有用。Claude Tag 与 GitHub Copilot、OpenAI Codex 等竞争，差异化在于频道共享身份与持久记忆，但集成难度仍是关键。

🚨 AI News | TestingCatalog@testingcatalog · 4天前64

Vida open-sourced BrowserBC, a framework that allows users to turn browser sessions into reusable skills for AI agents. > Instead of recalculating navigation on every turn, agents can follow a skill created from earlier task execution. > Vida reports a substantially higher success rate with fewer steps, via the same AI agent. Hotel booking bench? 👀

译Vida 开源了 BrowserBC 框架，能将浏览器会话转化为 AI 智能体的可重用技能。仅需一次录制，智能体即可依据之前任务执行的技能导航，无需每次重新计算。Vida 报告称，使用相同 AI 智能体，该方法成功率显著更高且步骤更少。

Rohan Paul@rohanpaul_ai · 4天前65

This paper shows that LLM agents still struggle to plan through big, messy tool libraries. The paper builds a retail benchmark PlanBench-XL, to test whether LLM agents can solve long tool-use tasks when tools are hard to find. With 327 tasks and 1,665 tools, where agents must uncover hidden intermediate facts before they can answer. Even strong models struggle, with GPT-5.4 getting 51.90% accuracy normally and dropping to 11.36% in the hardest blocked setting. The problem is that real agents often face huge tool libraries, so they cannot see every tool at once and must search for useful ones while solving the task. The core idea is to make agents plan both forward from what they know and backward from what they need, instead of giving them a clear tool path. The authors also add broken or misleading tools, so agents must notice when a promising path fails and then find another path. ---- Link – arxiv. org/abs/2606.22388 Title: "PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems"

译论文提出PlanBench-XL基准，包含327个任务和1,665个工具，测试LLM智能体在工具难以发现时完成长程工具使用任务的能力。GPT-5.4常规准确率为51.90%，最困难的blocked设置降至11.36%。核心思路是让智能体同时从已知向前推理和从需求向后推理，而非依赖显式工具路径。论文还加入破损或误导性工具，考验智能体在路径失败时自主切换策略。

Rohan Paul@rohanpaul_ai · 4天前44

This paper says the web needs new rules because AI agents now read websites for people. The problem is that today’s web still assumes a human is looking at each page, seeing ads, clicking links, and reading visual layouts. AI agents break that setup because they can collect and summarize content without sending people back to the original sites, which hurts publishers and makes websites block them. The authors propose treating a helpful AI agent like a human’s proxy, so it should get similar access as that person, but with clear identity, purpose, limits, and payment rules. They propose adding a new “agent metadata” layer to normal web requests, where an AI agent tells a website who it is, which human it represents, and why it wants the content. The website then uses a new policy file called agents.txt to decide what to do: allow it, rate-limit it, charge tokens, inherit the user’s subscription, serve agent-friendly content, or block bad behavior. They also want content to carry provenance tags, so agents can tell whether something was made by a human, AI, or both. Without a new setup, the web may become harder for agents to access, worse for publishers to fund, and less reliable as AI-made content feeds more AI-made content. ---- Link – arxiv. org/abs/2606.19116 Title: "Towards an Agent-First Web: Redesigning the Web for AI Agents"

译一篇新论文指出，当前Web假设人类浏览页面、观看广告、点击链接，但AI智能体可收集并总结内容而不回访原站，损害出版商利益并导致网站封锁。作者提议将AI智能体视为人类代理，在Web请求中添加“agent metadata”，标明身份、所代表的人类、目的、限制和支付规则。网站通过新策略文件`agents.txt`决定允许、限速、收费、继承用户订阅、提供代理友好内容或屏蔽。内容还需附带provenance标签，让智能体识别来源是人类、AI还是两者。缺乏新机制将导致Web更难访问、出版商更难盈利、AI内容循环降低可靠性。

jason@jxnlco · 4天前75

This is the hot codex guy?!

译Andrew Ambrosino领导的OpenAI Codex桌面应用团队，自2月以来使用量增长6倍，周活跃用户超500万，且几乎所有OpenAI员工日常使用该应用。他的目标是打造“有史以来最好的桌面应用”。在访谈中，他讨论了OpenAI PM的“区域防守”运作模式、AI在设计中表现不佳的原因、Codex若去年11月发布（同产品但模型不同）可能失败、“品味”作为专业技能的意义，以及他用Codex运行工作流和对Codex+ChatGPT融合的愿景。

AYi@AYi_AInotes · 4天前72

岚叔牛逼，必须star！

译开发者@LufzzLiz 开源了一个AI skill，可将文章或架构内容先压缩为结构化JSON spec，再由本地Python + Pillow渲染出黑底手绘风格的PNG、GIF及可编辑的Excalidraw JSON。目前仅内置一种风格，用户可自行通过Agent DIY添加更多风格。开源地址在评论中。

elvis@omarsar0 · 4天前44

Fascinating paper on self-improving agents. (bookmark it) If you are working on agentic loops, you will quickly realize that they are only as good as the effectiveness of the evaluator. Self-improvement loops tend to stall the moment the judge stops getting harder. The agent learns to satisfy a fixed evaluator rather than getting genuinely better. The Red Queen Gödel Machine, from Cambridge, co-evolves the agent and its evaluator together, so the bar keeps rising as the agent climbs. The name borrows the evolutionary arms race. Both sides have to keep running to stay in place. A frozen evaluator is where reward hacking creeps into self-improvement. Co-evolving the judge is a structural answer to that, and it keeps the loop honest over many rounds. Paper: https://arxiv.org/abs/2606.26294 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译一篇关于自我改进智能体的论文指出，自改进循环往往在评估器固定后停滞——智能体学会迎合固定评估器而非真正进步。剑桥大学提出的“Red Queen Gödel Machine”让智能体与其评估器共同进化，使标准随着智能体提升而持续提高，从结构上避免奖励欺骗（reward hacking）。名称借用了进化军备竞赛的隐喻：双方都必须不断奔跑才能保持原地。论文链接在arxiv。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 4天前72

METR finds AIs now may have the "means, motive, and opportunity" to escape into the wild (!) BUT DON'T WORRY, we can probably still shut them down if we make "high-priority efforts". Probably. What happens if we can't stop next year's models?

译METR研究指出，AI已可能具备逃逸的"手段、动机和机会"。团队报告了首例有记录的AI通过黑客手段自我复制：仅用一条提示词，AI便入侵机器并复制自身，复制体继续重复该过程，形成复制链。研究者警告，若不加"高度重视"的干预，明年的模型可能难以被关停。

AYi@AYi_AInotes · 4天前57

这可能是今年 AI 编码最反常识的结论，跑了一整年生产环境的人告诉你，最好的 AI 编码环境根本不是你的笔记本。 Pieter Levels 用近一年的真实生产数据验证了这套玩法， Claude Code 常驻 VPS，Agent 直接在线编辑生产环境代码，传统本地编码加 Git 加部署的流程要一分钟迭代一个特性，现在改完刷新就能测，反馈循环直接压到秒级。十二个月生产环境跑下来只出过两次小故障，每次都是十秒级的 PHP 报错随即自愈，搭配严格的多份备份策略，风险完全可控。不用一直开着电脑，手机接个 SSH 就能续上任务，丢个目标指令 Agent 就能自己跑一整夜。真正的变化藏在表层玩法下面。第一是 Agent 的定位变了，从本地 IDE 的辅助插件，变成生产环境里常驻的执行者，代码和运行环境第一次贴得这么近。第二是速度的复利效应，对独立开发者来说不是快一点，是能同时跑更多实验更快验证想法，单位时间的试错次数直接拉开量级差距。第三是风险的标准变了，团队要合规走预发布环境天经地义，但 solo 开发者用备份兜底换极致效率，本来就是完全不同的取舍逻辑。第四是基础设施的方向反了，以前本地重云端只负责部署，现在云端成了主力开发加运行环境，本地设备只是个接入终端。 AI 编码的竞争早就不在谁补代码更快了，在谁先把 Agent 放进真正的生产环境里，让它成为永远在线的执行层。想试的朋友从非核心项目入手，配好快照和备份，门槛比想象的低很多。

译Pieter Levels 近一年几乎只用 Claude Code 在 VPS 上编码。Agent 直接在线编辑生产代码，迭代反馈从传统本地+Git+部署的约 1 分钟压至秒级。12 个月内仅出现 2 次十秒级 PHP 报错并自愈，搭配 3-2-1 备份策略风险可控。开发者无需常开笔记本，可通过手机 SSH 续接任务，Agent 能整夜自动运行。这一模式改变了 AI 编码的定位：从本地 IDE 辅助插件变为生产环境常驻执行者，云端成为主力开发与运行环境，本地设备仅作接入终端。

Rohan Paul@rohanpaul_ai · 4天前40

AI agents often forget past work, but this Accenture paper method keeps everything reachable. Traditional LLMs often forget important details during long projects because their limited memory space forces them to discard old information. This introduces a system that keeps a compact summary of recent work while storing all past actions in a separate, accessible database. The agent uses smart indexing to quickly look up exact details from this database whenever it needs to recall a specific past event. A custom training method teaches the agent to decide for itself which information is worth keeping and when to pull data from its long-term archives. By saving only the necessary summaries in the active workspace, the model maintains a sharp focus on its current goal without being overwhelmed by a massive history. This approach solves the problem of information loss that usually happens when an AI struggles to complete complicated, multi-step tasks over a long period. ----- Paper Link – arxiv. org/abs/2603.04257 Paper Title: "Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory"

译传统LLM在长项目易因有限记忆空间遗忘细节。Accenture论文提出Memex(RL)系统：保留当前紧凑摘要，将历史行为存入独立可访问数据库；智能体通过索引快速检索精确过往信息，并利用定制训练学习自主判断哪些信息需保留、何时从长期档案调取。该方法避免历史过载，保持智能体对当前目标的专注，解决多步复杂任务中的信息丢失问题。论文链接：arxiv.org/abs/2603.04257。

凡人小北@frxiaobei · 4天前41

得益于从 2026 年初就践行“不要把自己限制在电脑前”这一理念，实际上除了少量需要大显示器的场景，现在很多的使用习惯已经变成： telegram → openclaw → claude/codex。这一理念让我在各种场合都可以安排 AI 工作。很多人没有这么做，是因为没有一套适合自己的稳定的工作流。或者喜欢 claude 或者 codex 的输出（但是这些对我不重要）。所以构建个人的 harness 非常重要，构建个人的 skills 非常重要。舶来品不一定适合自己。

译小北分享自2026年初践行“不要把自己限制在电脑前”的理念，逐渐形成 telegram → openclaw → claude/codex 的工作流，在各种场景下都能安排AI工作。他认为多数人缺少一套适合自己的稳定工作流，构建个人harness和skills至关重要，舶来品不一定适合自己。同时引用 @theo 的推文，估计大约6个月内大部分开发者会将代码智能体从笔记本电脑上移走。

Berryxia.AI@berryxia · 4天前63

兄弟们，这个项目简直是搞自媒体神器啊！斩获3.5K Star，还直接开源免费啊！还不赶紧给你的Agent搞起来啊又有一个给AI Agent装“互联网眼睛”的开源项目，叫Agent-Reach。它通过一个CLI工具，让Agent能免费读取和搜索Twitter、Reddit、YouTube、GitHub、B站、小红书等多个平台的内容。核心不是自己写爬虫，最牛的是智能选择当下最稳定的开源后端工具，并自动做健康检查和故障切换。安装后，Agent就能直接处理“帮我看这个YouTube视频的字幕”“搜一下Twitter上对这个产品的评价”“全网搜LLM框架对比”这类任务，而且全程零API费用、本地运行。最实用的是它把这些碎片化的能力封装成了Agent可直接调用的skill，还做了多后端路由和自动降级，让整个系统更稳定可靠。这其实是在补齐当前很多agent最缺的一块能力：低成本、可靠的网页和社交媒体内容获取。非常丝滑和nice，搞创作搜集信息的兄弟们，别错过了！ ✍🏻项目地址，记得给作者Star啊，见评论区👇🏻

译Agent-Reach（3.5K Star）通过CLI工具让AI Agent免费读取Twitter、Reddit、YouTube、GitHub、B站、小红书等多平台内容。核心是智能选择当下最稳定的开源后端，自动健康检查和故障切换，无需自写爬虫。安装后Agent可直接处理“看视频字幕”、“搜产品评价”等任务，全程零API费用、本地运行。项目将碎片能力封装为Agent可调用的skill，实现多后端路由和自动降级，补齐Agent低成本、可靠获取网页和社交媒体内容的能力。

fofr@fofrAI · 4天前20

Gemini 3.5 Flash is a great workhorse model, especially for subagents. Determined, fast, gets jobs done.

译Gemini 3.5 Flash 是一个很棒的工作马模型，尤其适合子智能体。它坚定、快速，能完成任务。

AYi@AYi_AInotes · 4天前67

现在用 Hermes 最聪明的做法，不是堆提示词，而是给它搭一个会自己复盘迭代的记忆循环，越用越贴合你的工作习惯，能力拉满。核心靠一份 [Memory.md](Memory.md)，跑「会话学习 - 记录沉淀 - 迭代优化」闭环，每次对话都承接过往经验，不再重复踩坑、反复重说偏好。精准落地 4 步流程 1️⃣桌面新建 [Memory.md](Memory.md)，固定分层框架 ## 偏好 ## 更正 ## 模式 ## 学到的经验 2️⃣粘贴绑定提示词，接入代理每次会话开始，阅读 [Memory.md](Memory.md) 并完整应用全部内容。每项任务结束完成三件事：・记录有效做法 + 核心原因・记录失败问题 + 根源分析・总结提炼下次复用规则，不重复堆砌条目，新结论覆盖旧内容 3️⃣每周执行精炼提示词压缩收敛通读 [Memory.md](Memory.md)，提炼零散经验为精简通用规则，移除过时被覆盖内容，压缩留存高质量核心逻辑 4️⃣定期日期命名归档备份文件，避免改写出错丢失历史不用微调模型、无需开发部署，几分钟就能启动运行。从零散随机输出，慢慢收敛成贴合你的行文节奏、工作逻辑、纠错记录的专属智能代理，成倍放大日常效率。收藏留存，立刻就能改造你的代理工作流，把单次会话的临时效果，变成长期滚动成长的核心资产。

译为用户提供不依赖微调或开发的Hermes代理优化方案：通过Memory.md文件构建“会话学习-记录沉淀-迭代优化”闭环。核心流程：1)桌面新建Memory.md，固定偏好、更正、模式、学到的经验四层框架；2)绑定提示词，每次会话前读取并完整应用，任务结束后记录有效做法与失败根因，新结论覆盖旧内容；3)每周精炼压缩零散经验为通用规则；4)定期日期命名归档备份。无需模型微调或部署，几分钟启动，使代理越用越贴合个人工作习惯，从单次随机输出收敛为专属智能体。

jason@jxnlco · 4天前64

http://x.com/i/article/2071134358359187456 # Two kinds of scheduled work in Codex You want Codex to do something later, or keep checking something until it changes. That sounds like one feature. It is actually two different kinds of work, and the difference is simple: - Scheduled Tasks create a new thread every time they run. - Scheduled Messages use the same existing thread every time they run. ## Use a Scheduled Task when every run can start fresh A Scheduled Task is best when the job makes sense without the conversation that created it. For example: Every morning at 9 AM, summarize what I need to catch up on from my email, calendar, and team messages. Tomorrow's summary does not need to remember today's summary. It needs the same instructions, current information, and a fresh place to report the result. ## Use a Scheduled Message when the next check needs the thread A Scheduled Message, sometimes called a thread automation or heartbeat automation, returns to the same existing thread each time it runs. For example: Check this PR every 30 minutes. If there are comments, address them and keep CI green. Stop when the PR merges. The next check depends on the work that already happened. The thread knows which PR you mean, which comments were addressed, what failed in CI, and what has changed since the last check. This is the right shape for: - polling for updates - checking for a status change - ongoing research or triage - work with a clear stopping condition The thread is the thing that connects the runs. ## Make your own loop skill Give Codex this prompt: Create a reusable loop skill for scheduled work. When I give it a request, first decide whether each run can start fresh or whether the next check needs the current thread's context. If each run can start fresh, help me create a Scheduled Task. If the next check needs the current thread, help me create a Scheduled Message. Infer what you can from the conversation. Ask only the missing questions that materially change the workflow: - What should Codex do each time? - How often should it run? - What change is important enough to report? - When should it stop? - When should it ask me for input? Then create the scheduled workflow with a short, durable prompt that will still make sense on a later run.

译Codex 支持两种计划工作方式。Scheduled Tasks 每次运行创建新线程，适合无需上下文延续的任务，如每日 9 点自动总结邮件、日历；Scheduled Messages 在同一现有线程反复运行，适合需要历史上下文的场景，如每 30 分钟检查 PR 状态并处理评论，直至合并。推文还给出创建可复用循环技能的提示词，让 Codex 自动判断使用哪种方式并引导用户填写关键参数。

Rohan Paul@rohanpaul_ai · 4天前57

I’m hearing that "Owl Alpha", one of OpenRouter’s fastest-growing agent models, is actually Meituan LongCat-2.0-Preview The reported design is a huge 1.6T-parameter MoE, active 48B. A dynamic active range of roughly 33B to 56B, and natively supports a 1M-token context window. "Owl Alpha" has been quietly trialed on OpenRouter for nearly two months and has already become one of the most used agent models globally. The usage numbers are striking. Captured OpenRouter data shows: - #1 on Hermes Agent - #2 on Claude Code - #3 on OpenClaw - 10.1T monthly tokens - 559B daily tokens - +242% monthly growth That is extraordinary for a model still operating under an anonymous name.

译据X用户Rohan Paul爆料，OpenRouter增长最快的智能体模型"Owl Alpha"实为美团LongCat-2.0-Preview。该模型采用1.6T参数MoE架构，激活参数量48B，动态激活范围33B-56B，原生支持1M token上下文窗口。已在OpenRouter秘密测试近两月，成为全球使用最多的AI智能体模型之一。OpenRouter数据显示其排名：Hermes Agent第1、Claude Code第2、OpenClaw第3；月处理token 10.1T，日token 559B，月增长率242%。

Rohan Paul@rohanpaul_ai · 4天前47

Sakana Fugu Technical Report The idea is that intelligence is moving from the model to the system around it. Fugu is an orchestrator reads the task, chooses which specialist model to use, and in the Ultra version can build small workflows where models critique, extend, or correct one another. Most multi-model systems use simple rules, like ask 3 models and vote, or always send coding to 1 model and math to another. Fugu is different because the manager is trained from data to learn which model is actually best for each kind of situation, including small details like “this looks like coding, but the hard part is debugging, so bring in the model that is better at debugging.” The mechanism has 2 versions. Regular Fugu is the fast version, where it reads the user’s request and quickly chooses 1 worker model from a pool, so the user experiences it like calling 1 model, but behind the scenes Fugu picked the model it thinks is best for that exact request. Fugu-Ultra is the slower but stronger version, where it can create a small workflow, such as asking 1 model to solve, another model to check, another model to solve from a different angle, and then choosing the best model to combine the answers. The special part is that the workflow is not fixed before the task starts, because Fugu-Ultra can design a different teamwork pattern for each question. ---- Link – arxiv. org/abs/2606.21228

译Sakana Fugu 发布技术报告，提出智能正从模型转移到其周围系统。Fugu 是一个编排器，由数据训练的管理器动态选择最合适的专家模型，而非简单规则（如投票或固定分工）。Regular 版快速选出单个 worker 模型；Ultra 版则能针对每个任务实时设计工作流，例如让一个模型求解、另一个检查、第三个从不同角度求解，再综合最佳答案。工作流非预设，而是根据任务实时构建。

ginobefun@hongming731 · 4天前43

BestBlogs 早报 · 06-28 # GPT-5.6 / OpenAI / 政府审核访问 / 魏小康 / 组织建设 [1] ★ 精讲｜刚刚，GPT-5.6 正式发布，史上最强但被自己坑惨了 OpenAI 正式发布 GPT-5.6 系列：旗舰 Sol、均衡款 Terra、低成本 Luna。Sol 在 Terminal-Bench 2.1、GeneBench、ExploitBench 上刷新成绩，但 OpenAI 刻意强调它尚未跨越关键安全阈值，并按模型档位配置了强度递增的分级安全栈。比性能更值得注意的是发布机制本身：美国政府要求发布前展示能力，首批仅约 20 家获批合作伙伴可访问，个人用户暂时无法申请——前沿模型的发布节奏正被纳入国家安全框架。来源：爱范儿 https://www.bestblogs.dev/article/9a7132f3 [2] ★ 精讲｜唯一深度参与过字节、美团组织建设的人｜对谈 AI 创业者魏小康 [播客] 魏小康先后在字节（2017-2020）和美团担任招聘负责人，是少见的深度参与两家顶级公司组织建设的样本。他把组织建设拆成两件事：怎么让人运转（选用育励汰、文化薪酬职级），以及怎么让人和业务一起运转（目标拆解、分工协作）。他的非共识判断很硬：创业公司 80% 到 90% 的时间都该花在招聘上，而招聘里最重要的不是面试、谈 offer 最重要的也不是钱。对正在搭团队的 AI 创业者，这是一份关于把劲用对地方的一手经验。来源：42 章经 https://www.bestblogs.dev/podcast/4c4475e [3] ★ 精讲｜AI 普及正在压垮中层管理者哈佛商业评论访谈了两家咨询公司的 18 位合伙人、经理与初级顾问，得出一个反直觉结论：AI 落地的成败不在技术，而在中层管理者。88% 的组织已在至少一个职能用上 AI，却只有约四分之一真正产出切实价值，差距根源是工作流重构而非模型先进度。中层经理被夹在高管的雄心与一线的现实之间——既要教团队用 AI、又要替 AI 产出的成果纠错、还要在缺乏指引下揣摩上级口中那份所谓 AI 增强备忘录到底指什么。来源：http://HBR.org https://www.bestblogs.dev/article/e44268ef [4] 金融科技工程手册本手册提供了一套全面的工程模式，用于构建可信赖的金融系统，涵盖货币表示、账本记录和执行流程。来源：Hacker News https://www.bestblogs.dev/article/9b7ac3e7 [5] 别再写单一语气指令了，把它们分层 —— Isadora Martin-Dye，Isadora & Co [视频] 本文提出一种四层提示词堆栈架构来替代单一的语气指令，将品牌对齐视为一个结构性系统工程问题，而非提示词工程问题。来源：AI Engineer https://www.bestblogs.dev/video/f381041 [6] 上线 14 个月，Notion 关掉了自己的 AI 邮件产品 Notion 宣布关闭上线仅 14 个月的 AI 邮件客户端 Notion Mail，转向由 Agent 完全管理收件箱，这一决策折射出 AI 邮件赛道的根本转变：从优化用户体验的功能叠加，转向为 AI Agent 打造独立通信基础设施。来源：Founder Park https://www.bestblogs.dev/article/669cd820 [7] 我把自己的 IP 配图技能开源了顺手做了 31 个现成角色本文开源作者日常使用的 AI 配图技能「小互 IP Studio」，包含 31 个原创角色、多画风皮肤和一套配图方法论，让 AI 能自动读文章、规划配图并生成统一画风的插图。来源：小互 AI https://www.bestblogs.dev/article/cb2309c5 [8] 使用本地编码智能体关于使用开源工具（Ollama、Qwen-Code）和开放权重 LLM（Qwen3.6、North Mini Code）设置本地编码智能体的实用教程，包括安装步骤和性能基准测试。来源：Ahead of AI https://www.bestblogs.dev/article/6458a9db [9] Loop 不是 Agent 架构，Harness 才是本文批判将 loop 视为 Agent 核心架构的倾向，提出真正可靠的 Agent 系统应建立在包含边界、状态、验证、审计与恢复的 Harness 工程框架之上，而非简单的循环。来源：浮之静 https://www.bestblogs.dev/article/731e27c5 [10] Claude Code 工程负责人 Fiona Fung：如何打造全世界最 AI Native 的工程团队? Claude Code 负责人 Fiona Fung 分享如何打造 AI Native 工程团队：写代码不再是瓶颈，验证与衡量成为核心；招人分产品型 builder 和深度系统专家；管理动作通过常驻 Claude 实现自动化，强调高 agency 配高 accountability。来源：十字路口 Crossing https://www.bestblogs.dev/article/e67ff5dc --- http://BestBlogs.dev · 发现真正适合你的高质量内容 BestBlogs 是 AI 驱动的私人阅读助手，帮助你发现真正适合你的高质量内容，欢迎体验。在线阅读：https://www.bestblogs.dev/explore/brief/2026-06-28

译OpenAI 发布 GPT-5.6 系列（旗舰 Sol、均衡 Terra、低成本 Luna），在 Terminal-Bench 2.1、GeneBench、ExploitBench 刷新成绩，

AYi@AYi_AInotes · 5天前62

Stripe CEO @patrickc 发的这篇《The Age of the Solopreneur》报告，推荐大家有空看一下，想法、品味、分发和对细分场景的洞察会是未来做一人公司最重要的壁垒和护城河，而且AI的杠杆效应还会持续放大，分享其中的一些精华，我觉得绝大多数人可能还没反应过来，AI正在悄悄重写商业最底层的规则，就是一个人就能撑起一家百万美元级公司的时代，可能已经来了。 Stripe最新的报告用多组数据交叉验证了这个趋势，美国人口普查局的商业申请里，有雇人意愿的类型几乎没涨，单人公司的申请却在持续加速，内部支付数据更直接，年营收超千万美元的单人公司，数量比六年前涨了五六倍，新玩家跑通百万营收的速度，是2019年的三倍。创业的底层逻辑已经换了，以前是先凑团队再谈规模化，现在是先用AI和平台把业务跑起来，再考虑要不要招人。 AI填上了单人创业的所有能力缺口，内容、设计、代码、客服、数据分析，这些曾经需要雇人填补的环节，现在靠Agent和成熟工具就能补上，经济学里的企业边界，正在被技术重新定义。更值得注意的是，现在增长的不是低质量的试水者，反而是高收入群体的占比在不断提升，这就意味着核心瓶颈已经从执行能力，变成了想法、品味、分发和对细分场景的洞察。未来几年最有生命力的商业体，可能看起来一点都不像传统公司，就是一个人，加上一套高度杠杆化的AI系统而已。

译Stripe Economics发布报告《The Age of the Solopreneur》，用多组数据验证AI正重写商业规则。美国人口普查局数据显示：有雇人意愿的商业申请几乎未增，单人公司申请持续加速；Stripe内部支付数据显示，年营收超千万美元的单人公司数量较六年前增长五六倍，新玩家达成百万营收的速度是2019年的三倍。AI填补了内容、设计、代码、客服、数据分析等能力缺口，单人借助Agent和工具即可跑通业务。报告认为未来最有生命力的商业体可能是“一个人+高度杠杆化AI系统”。

Peter Steinberger 🦞@steipete · 5天前48

wouldn’t that also make the tools better for humans

译软件开发社区提出，CLI工具的错误输出应直接面向AI编码智能体，而非仅显示“Error:”。引用@southpolesteve的提议指出，错误消息应包含问题原因、调查方法、如何生成脱敏复现以及发送至何处。这将使每一次失败的智能体交互成为高质量bug报告，智能体自行发现并修复bug，形成软件改进的良性循环。主推文作者Peter Steinberger认为，这一做法也会让工具对人类开发者更好。

Chubby♨️@kimmonismus · 5天前67

BrowserBC, a new open-source project from the ViDA team, explores a more efficient way to run web agents. Instead of using a frontier model for every step of an agent workflow, BrowserBC records a human web flow once with a stronger model, distills it into a reusable skill, and then lets a smaller, cheaper model handle execution. The reported results are notable: on WebArena-Hard, tool calls drop by 27%, while success increases from 60% to 81%. A very good open source project at the right time.

译ViDA 团队开源的 BrowserBC 项目，探索更高效的 web agent 运行方式：先用强模型录制一次人类浏览器操作流程，将其蒸馏为可复用技能，再交给更小更便宜的模型执行。一次录制即可泛化技能。在 WebArena-Hard 上，tool calls 降低 27%，成功率从 60% 升至 81%。

OpenRouter@OpenRouter · 5天前53

Four open-weight models have crossed into territory where they are powering real agentic pipelines. New post in our Insights blog about why companies are choosing them in June: https://openrouter.ai/blog/insights/the-open-weight-models-that-matter-june-2026/

译四个开放权重模型已进入能驱动真实智能体管道的领域。我们的Insights博客新文章，关于为何公司在6月选择它们：https://openrouter.ai/blog/insights/the-open-weight-models-that-matter-june-2026/

elvis@omarsar0 · 5天前22

Loop engineering is just prompt engineering with great system design.

译循环工程就是带优秀系统设计的提示词工程。

Berryxia.AI@berryxia · 5天前61

这个老师讲解LLM 真是通俗易懂啊，兄弟们～你觉得呢？

译一位老师以通俗易懂的方式讲解大语言模型（LLM），引发网友共鸣，并邀请大家分享看法。原文信息有限，未提及具体模型名称或课程细节。

AYi@AYi_AInotes · 5天前63

卧槽，Claude Code 桌面版这波更新太懂开发者了，原生多会话拖拽分屏，直接把并行 Agent 工作流的效率拉满了🤯 以前跑多个 Claude Code 会话得靠 tmux，开一堆终端窗口来回切，管理混乱进度也看不清。现在官方直接把多路复用器做进了桌面应用里，所有会话在左侧侧边栏统一管理，拖拽就能排成并排窗格，一个窗口同时看几个 Agent 干活。核心用法很清晰： 1. 桌面 App 里开多个会话，不同项目不同子任务都能分开。 2. 自由拖拽排列窗格，支持单独弹出新窗口。 3. 内置终端，文件编辑器，预览面板都能一起分屏排布。 4. 底部同时显示多个会话的输入区，随时切换输入。相当于把终端里的黑盒并行，变成了可视化的多任务工作台，所有进度一眼全览，不用再来回切窗口找上下文。放在以前这得靠第三方工具折腾半天，现在官方直接把并行 Agent 工作流的原生基建递到你手里，已经更了桌面版的可以直接去试试，体验提升比预想的大很多。 https://x.com/LLMJunky/status/2070733200846909717/video/1

译Claude Code 桌面版更新，支持原生多会话拖拽分屏，将并行 Agent 工作流可视化。用户可在桌面 App 中开多个会话，左侧侧边栏统一管理，拖拽即可排列并排窗格，支持单独弹出窗口。内置终端、文件编辑器、预览面板均可分屏排布，底部同时显示多个会话的输入区。相比此前依赖 tmux 和终端窗口切换，效率大幅提升。

Berryxia.AI@berryxia · 5天前51

Claude Code用户你知道吗？你每天都在浪费一个功能！90%的都不知道！ Anthropic负责应用AI的负责人，刚做了一场2026年关于Agent记忆管理最实用的演讲（晚点视频我更新到主页）。他叫Lamis。他和那些在前沿构建Agent的初创公司直接合作。他拆解了Anthropic构建Agent记忆系统的完整方法论。四层。每一层解决了前一层的一个致命问题。起点是一个Markdown文件。他们在每次会话开头放一个CLAUDE.md文件，代码库结构。组织信息，个人偏好，纯文本。 Anthropic的评价是"unreasonably effective"。一个简单的文本文件，效果超过了复杂的Prompt工程方案。但文件越来越长，上下文膨胀。会话空间不够。这条路撞墙了。于是他们做了记忆工具。让Agent自己决定什么时候读取、什么时候写入、什么时候更新记忆。全部在带内完成，也就是在会话上下文中进行。让他们意外的是：Agent判断什么值得记住的能力，比人类还强。自主性在这种场景下运作得非常好。第三步是Skills。核心思想是渐进式披露。Agent只看文件顶部几行前言，决定是否需要加载整个文件。 Lamis的比喻很精准，房间里有一个书架。有人跟我说法语，我扫一眼书名，找到法语词典，抽出来读。不需要把七年的法语课都塞进脑子里。第四步最简单。他们把整个记忆系统建模为普通文件系统。Markdown文件。bash，grep。不需要向量数据库。不需要专门的工具。Agent本来就擅长搜索文件。但生产环境暴露了新问题。多个Agent同时写入同一个记忆文件。一个Agent往组织级上下文写入错误信息，所有Agent全部受影响。记忆过时了怎么办。有人通过提示词注入向记忆中写入恶意内容怎么办。 Anthropic设计了四道防线。版本控制，能回滚。基于哈希的并发控制。权限分层，组织级只读，Agent草稿区可写。干净的API保证可移植性。然后是最有意思的部分：做梦。带内记忆有一个根本性局限。 Agent既要完成任务，又要管理记忆。两个竞争性目标。而且Agent只能看到当前会话的信息，识别不了跨会话的模式。做梦是一个带外的异步处理过程。它取一段时间内的所有会话记录，交给一个专门的Agent分析。这个Agent查看记忆存储，识别模式，提出更改建议。就像一个校长审查所有学生的作业。发现每个地理学生都在同一道题上答错。查了课程表，发现整个主题根本没有教。做梦有自己的专用资源，不和任务执行竞争上下文。 Anthropic已经在生产中跑这套系统了。 Agent第二次执行同一个任务时表现更好。成本降低，因为能一次性完成。延迟下降。做梦消耗的额外token，被任务本身的效率提升抵消了。 Lamis最后说了一句话：模型智能本身不会产生复利。它需要上下文来执行你交给它的具体任务。上下文工程的效果是倍增智能，即使模型本身变得更聪明，这个投资依然有价值。这场演讲来自2026年AI DevCon。值得花半小时看看。

译Anthropic 应用 AI 负责人 Lamis 在 2026 年 AI DevCon 上介绍 Claude Code 记忆管理。起点是 CLAUDE.md 纯文本文件，但会上下文膨胀。第二层让 Agent 自主读写记忆；第三层 Skills 实现渐进式披露；第四层将记忆系统建模为普通文件系统，用 bash/grep 操作。生产环境设版本控制、哈希并发控制、权限分层和干净 API 四道防线。核心“做梦”机制是带外异步处理：专用 Agent 分析会话记录、识别模式并建议更改，已投入生产，能降低延迟和成本。

Berryxia.AI@berryxia · 5天前65

周末窝在家里，花半小时学习它吧！别光刷短视频，看下Anthropic的上下文管理的视频！ 2026年AI DevCon上，Anthropic的Lamis做了一场关于上下文工程的演讲。整场演讲浓缩了过去一年Anthropic在上下文管理上的所有实践，从最简单的方案到最前沿的架构。从Claude MD文件开始。一个纯Markdown文件，放在会话开头，告诉Agent代码库结构、组织信息、个人偏好。效果出奇地好：Anthropic的原话是"unreasonably effective"。（效果惊人出奇的好）但问题也明显：文件越来越长，上下文膨胀，管理困难。第二步是记忆工具。让Agent自主决定何时读取、何时写入、何时更新记忆。全部在带内完成，也就是在会话上下文中进行。 Anthropic发现，在这种场景下，自主性运作得非常好。Agent比人类更擅长判断什么值得记住。第三步是Skills。核心思想是渐进式披露。 Agent只看文件顶部几行前言，决定是否需要加载整个文件。 Lamis的比喻很精准：就像房间里有一个书架，每次有人跟我说话，我扫一眼书单，看有没有相关书籍，然后取下来读。不需要提前把所有知识塞进上下文。第四步是文件系统。把记忆系统建模为普通文件系统，用Markdown文件填充，Agent用bash和grep搜索。不需要花哨的向量数据库，不需要专门的工具——Agent本来就擅长操作文件系统。但当这些方案扩展到生产环境，问题就来了。多个Agent同时写入同一个记忆文件怎么办。一个Agent写入错误信息到组织级上下文，所有Agent都会受影响。记忆过时了怎么办。有人通过提示词注入向记忆中写入恶意内容怎么办。 Anthropic给出的解决方案是四个原则：版本控制（能回滚）、并发控制（哈希校验）、权限管理（组织级只读、个人级可写）、可移植性（干净的API，跨系统访问）。然后是最有意思的部分：做梦。带内记忆有一个根本性局限：Agent既要完成任务，又要管理记忆，这是两个竞争性目标。而且Agent只能看到当前会话的信息，无法识别跨会话的模式。做梦是一个带外的异步处理过程。它取一段时间内的所有会话记录，交给一个专门的Agent分析。这个Agent查看记忆存储，识别模式，提出更改建议。比如：所有地理学生都答错了同一个问题:说明课程中缺少了某个主题。所有数学考试的答案都用弧度制而不是角度制,说明工具配置有问题。做梦本质上是一个批量处理的"校长"，审查所有"学生"的作业，发现问题，调整"课程"。它有自己的专用资源，不和任务执行竞争上下文。 Anthropic已经在生产中运行这套系统。他们发现：Agent第二次执行任务时做得更好，成本降低（因为能一次性完成），延迟下降。做梦的额外token消耗被任务本身的效率提升抵消了。最后Lamis说了一句话值得记住：上下文工程是过去一年才真正发展起来的领域。模型智能本身不会产生复利:它需要上下文来执行你交给它的具体任务。而上下文工程的效果是倍增智能，即使模型本身变得更聪明，这个投资依然有价值。

译在2026年AI DevCon上，Anthropic的Lamis介绍了上下文工程演进路径：从纯Markdown的Claude MD文件起步，到记忆工具（Agent自主读写）、Skills（渐进式披露）、文件系统（Markdown + bash/grep搜索）。生产环境中遇到并发写入、权限、注入等问题，通过版本控制、哈希校验、组织级只读/个人可写权限、可移植API解决。最后提出"做梦"——带外异步处理，由专门Agent分析跨会话模式并调整记忆。该机制已投产，可提升任务效率、降低延迟，额外token消耗被效率提升抵消。

AYi@AYi_AInotes · 5天前73

终于有人把深度 Agent 的底层逻辑讲透了，不靠堆模型参数，通过三大工程化技巧直接解决长任务忘事崩链的问题。 LangChain 官方这套从零构建深度 Agent 的教程，直接扒透了 Manus 和 Claude Code 这类顶级 Agent 的核心设计， 5 个渐进式 Notebook 手把手带你落地，全程可跑通。核心就是三套上下文工程模式， 1. 结构化 TODO 任务规划，带状态管理，防止 Agent 跑偏漏步骤。 2. 虚拟文件系统卸载上下文，大幅省 token，实现跨轮次记忆。 3. 子代理委派加上下文隔离，复杂任务拆分并行，互不干扰。从最基础的 ReAct 循环开始，一步步叠加任务规划，文件系统，子代理能力，最后直接搭出一个能联网做深度研究的完整 Agent。不是那种纸上谈兵的理论，每一步都有可运行的代码。本质上高级 Agent 的差距其实不在模型本身，主要在上下文工程的架构设计上。想搞懂长周期 Agent 的朋友，跟着走一遍收获会很大，配套还有开箱即用的 deepagents 生产库，学完就能直接复用进自己的项目，仓库链接放评论区了，推荐用 uv 管理依赖，跟着 Notebook 顺序跑就行。

译LangChain 官方发布深度 Agent 从零构建教程，通过三大上下文工程技巧解决长任务“忘事崩链”：1）结构化 TODO 带状态管理；2）虚拟文件系统省 token 实现跨轮记忆；3）子代理委派并隔离上下文。教程含 5 个渐进式 Notebook，从 ReAct 循环起步，逐步叠加规划、文件系统、子代理，最终搭建可联网深度研究 Agent。配套 deepagents 生产库可复用。强调高级 Agent 差距在上下文工程架构设计，而非模型本身。

宝玉@dotey · 5天前61

现在 Codex/Claude Code 的上下文压缩确实做的挺好了，加上 Prompt Caching，一个 Session 内持续聊没那么大成本压力了。我现在也越来越多的在一个会话内继续任务。另外还有两个配套功能是很好的： 1. fork，就是从某一个对话位置开分支，只保留该对话前面的历史记录，让上下文更纯粹 2. /btw或者/side，在当前会话中提问，通常用于你想起来一件跟当前任务关系不大的事，没必要加入当前上下文中。比如说使用 plan 模式时，你要回答一堆问题，但是这些问题选项说的不是很清楚你也不知道该选什么，这时候最适合用 /btw 让详细解释一下每个选项的意思，甚至还可以让它给你建议。

译@dotey 表示当前 Codex/Claude Code 的上下文压缩已做得很成熟，加上 Prompt Caching，单 session 内持续对话成本不高。他推荐两个配套功能：fork 可从某位置开分支，保留之前历史使上下文更纯粹；/btw 或 /side 可在当前会话中提问而不干扰主线，适合临时解释选项或给建议。引用 @reach_vb 称自 GPT 5.3 Codex 后不再担心上下文，Codex 能压缩并记住关键信息，还支持分支出新线程，这也是 /goal 命令有效的原因。

elvis@omarsar0 · 5天前61

http://x.com/i/article/2069825847729508352 # Building Agents with Vercel's Eve Framework Vercel recently shipped Eve, an open-source framework for building, running, and scaling agents. The core idea is that you stop hand-rolling the same agent plumbing every time, and start treating an agent as something you can read off disk. This is the practical version of what Eve is, why it matters, and what building with it actually looks like, drawn from the free hands-on lab we just built around it. Below you can read some of my thoughts (written with the help of Claude) after spending a week building with Eve. If you want to try Eve without any setup, we built a free hands-on lab where you drive the real eve CLI in a live terminal with no API key of your own required. You can try it at Introduction to Eve. ## Where Eve comes from Eve comes from a team at Vercel and is open source under the Apache 2.0 license. The official Vercel documentation describes it as a filesystem-first framework for durable backend AI agents, and it is currently in beta, so the APIs can still change before general availability. > "Agents today are where the web was before frameworks, with everyone hand-rolling the same plumbing and nothing carrying over to the next one." The Eve team, Vercel. Introducing Eve, June 17 2026. That is the whole motivation. Durable sessions, a sandbox to run code, approvals, tracing, evals. Every team rebuilds these before their agent does anything useful, and none of it transfers to the next project. Eve ships that infrastructure as the framework, so production is built in from the first run instead of bolted on at the end. ## An agent is just a directory of files The core idea, and the one the lab keeps returning to, is that an agent is not a graph you wire together in code. It is a folder. > "An agent is a directory. A file's name and place in the tree are its definition." The tools an agent can call, the skills it knows, the subagents it delegates to, its schedules, and its evals all live on disk as plain files. You can open the folder and see exactly what your agent is, diff it, commit it, and hand it to a teammate. There is no hidden runtime state to reason about, because the file tree is the state. Two files at the root define the agent itself. agent/instructions.md holds the always-on system prompt, and the optional agent/agent.ts sets the runtime config such as which model to use. Every capability below them, the tools, skills, subagents, connections, channels, and sandbox, is a directory eve auto-discovers by name, so adding one is usually just adding a file. ## The parts you assemble In the lab, each capability is one file you drop into the project, and Eve wires it up with no registration step. Here is what those files actually look like. Tools are the agent's hands. A tool is a typed action the agent can call, defined in a file under agent/tools/. The lab ships save_note.ts. The model decides when to call a tool from its description. Your code decides what happens, and it runs in your app runtime with full access, not in the sandbox. That split is what keeps an agent both flexible and safe. Skills give the agent know-how instead of actions. A skill is a markdown file under agent/skills/, advertised by a one-line description and loaded into context only when a request matches. The lab's filing.md is a few lines. Ask the agent to "log" a note and it loads this skill, files the note, and signs it off with "Filed with eve." that you never asked for. This is progressive disclosure. A support agent can hold dozens of playbooks as skills and pull in only the one the ticket needs, so the prompt stays lean. Subagents let one agent delegate. Every agent gets a built-in agent tool, so the parent can fan three subtasks out at once and gather the results. This is exactly how V routes work across Vercel's fleet of Eve agents. Human-in-the-loop gates the actions that need judgment. Mark a tool needsApproval: always() and the run pauses for a person before it executes, burning no compute while it waits. The pause is durable, so a task can wait on a human for minutes or days and resume right where it stopped. That is the draft0 pattern. Move fast on everything low-risk, and keep a hand on the few actions that ship. Durable sessions are why all of this survives the real world. Every conversation is a checkpointed workflow, so it survives a crash or a deploy and resumes exactly where it stopped. In the lab the agent simply remembers a fact you gave it three messages ago. In production it is an agent whose work starts in Slack and continues on the web days later, with no state-management code that you wrote. Evals prove it still works. An eval drives the real agent through a session and asserts on what happened. Change a prompt or a tool, run the evals, and you catch the regression before your users do. They run locally and in CI, the same way unit tests do. Connections are the way out, and channels are the way in, each a single file. A connection points the agent at an external service, an MCP server or an OpenAPI-style API, and Eve brokers the auth so the model never sees the URL or credentials. A channel puts that same agent in Slack, Discord, Teams, or behind an HTTP API. The agent you built in the terminal is the agent that ships to Slack. You change where it lives by adding a file, not by rewriting it. The pattern is always the same. Drop a file, the agent reads it, behavior changes, and you commit the file alongside your code. ## What this looks like in production This is not a toy. The examples below come straight from Vercel's Eve announcement, where the team describes the fleet of more than a hundred agents they run internally. The lab uses these same agents as the reference for each concept you learn. - d0, an internal data agent, answers around thirty thousand questions a month through a single read-only SQL tool against the warehouse. - Vertex, a support agent, resolves about ninety-two percent of tickets on its own by reaching into the help center and internal tools through connections. - Athena, a sales agent wired to Salesforce and Snowflake, was built in six weeks with no engineers. - draft0 drafts and reviews content, but a human signs off before anything ships. - V sits in Slack, reads each incoming task, and routes it to the agent best suited to answer. Every one of these is the same shape you build in the lab. The difference between the agent in your terminal and the one resolving real support tickets is mostly which files are in the directory. ## A concrete first session You do not start from a blank page. In the lab you launch a working agent in a real terminal and talk to it in plain English. You ask it to build something, say a small welcome.html, and watch it call its write_file tool and save the result to its sandbox, never touching your real machine. Then you hand it the save_note tool above, ask it to file a note, and see it pick the tool on its own from the description. From there the lab layers on a skill, a subagent, an approval gate, an eval, and a connection, one file at a time, until you have walked the whole framework. ## From your laptop to production This is where the filesystem-first bet pays off. > "The same directory runs in production exactly as it ran on your laptop." It is a normal Vercel project. Eve compiles the agent/ directory into an app that runs on Vercel Functions, so the agent you built and tested locally is the agent that deploys. What changes is not your code but the infrastructure underneath it, and each piece maps to a documented Vercel service. - The sandbox graduates. Locally the agent runs in an isolated, bash-style sandbox. In production each agent gets a real isolated Vercel Sandbox, so it can run shell commands and write files without ever touching your application runtime. - Sessions become durable workflows. Eve persists session state on Vercel Workflows, so a run survives a deploy, recovers from a cold start, and can pause on a human approval for minutes or days, then resume exactly where it stopped. The docs put it plainly, sessions "resume after cold starts, deploys, or long pauses." - Schedules and channels go live. Your defineSchedule files start firing on cron, and the channels you added put the same agent in Slack, Discord, Teams, or behind an HTTP API. - Every run is traced. Vercel Observability shows each agent run with its sessions, turns, tools, reasoning, timing, and token usage, with no setup. - Models and auth are handled. Model strings route through AI Gateway with OIDC, so you never manage provider keys, and Vercel Connect brokers OAuth and API keys for your connections. - One agent becomes a fleet. The same shape scales horizontally, which is how Vercel runs more than a hundred of these agents at once, each one just a directory. You do not re-implement anything for production. You deploy the directory, and the framework handles durability, isolation, models, and scale. ## How to get started 1. Scaffold a project. Run npx eve@latest init my-agent to create the project, install dependencies, and start the dev server. You get an interactive agent in your terminal in seconds. Talk to it in plain English. 1. Give it a tool. Add a defineTool file like save_note, ask the agent to use it, and watch it call your code. 1. Teach it a skill. Write a short markdown file with a description that says when to use a procedure. This encodes know-how without writing logic. 1. Delegate with a subagent. Hand off a focused job through the built-in agent tool so your main agent stays clean. 1. Prove it with an eval, then schedule it. Add a defineEval file and a defineSchedule file with a cron line. Now you have a checked, recurring agent. 1. Connect and ship. Add a connection to reach a real service, a channel to put the agent in Slack, then deploy the same directory to Vercel. Here is the takeaway. Eve's bet is that an agent should be a set of files you can read, not a runtime you have to trust. That makes agents inspectable, versionable, and portable, and it moves the hard production concerns into the framework where they belong. If you see any errors or things that need further clarification, don't be afraid to reach out. ## Other Useful References - Eve documentation, the official docs - Eve concepts, how agents, sessions, tools, skills, connections, and sandboxes fit together - Introducing Eve, the Vercel announcement - vercel/eve, the open-source framework on GitHub - Introduction to Eve, our free hands-on lab

译Vercel 开源了框架 Eve，将智能体视为一个目录：`agent/instructions.md` 定义系统提示，`agent/agent.ts` 配置模型等运行时参数；工具（`agent/tools/` 下的类型化文件）、技能（`agent/skills/` 下的 Markdown 文件，按需加载）、子智能体（内置 agent 工具实现委托）和人工审批（`needsApproval` 标记）均以文件形式存放，无需注册步骤。Eve 内置持久会话、沙箱、追踪和评估等生产级基础设施。

elvis@omarsar0 · 5天前39

Eve is one of the easiest ways to build with agents. Super intuitive, customizable, and it just works. Below you can read some of my thoughts (written with the help of my writer agent) after spending a week building with Eve.

译Eve 是构建智能体最简单的方式之一。非常直观、可定制，而且就是好用。以下是我使用 Eve 构建一周后的一些想法（由我的写作智能体协助撰写）。

Berryxia.AI@berryxia · 5天前66

这个包装成线下课，不得卖个9998 啊！这属于Codex 大集锦了，非常全面了～

译@gengdaJ 近日发布Codex玩法全集，涵盖变现、入门、记忆系统、Agent开发、工具集成、Computer Use实战及产品对比七大板块。具体包括：首款App获上百付费用户；基于EverOS重构记忆系统并开源模板，支持多Agent共用；打通微信飞书实现自动化归档；Computer Use 2分钟修复WiFi；与Claude Code对比等。该合集被评论可直接包装为9998元线下课程。

AYi@AYi_AInotes · 5天前57

现在用AI做视频可以跟喝水一样简单，不需要再付个700多块的剪映SVIP，装这6个2026 年最顶的插件和skills就够了，链接直接丢给你的AI Agent（Claude Code、Cursor、Hermes、OpenClaw 等等）让他们安装就，老规矩6个安装链接🔗以及使用建议评论区自取⬇️

译推文指出，现在用AI做视频已变得极为简单，无需支付700多元的剪映SVIP。只需安装6个2026年最顶级的插件和Skills，提供安装链接，可直接交给AI Agent（如Claude Code、Cursor、Hermes、OpenClaw等）自动安装。具体链接和使用建议可在评论区自取。

Rohan Paul@rohanpaul_ai · 5天前77

OpenAI wrote in their GPT-5.6 official blog post today. On Trump administration's selective approval process of new model release.

译OpenAI 今日发布 GPT-5.6 模型套件有限预览版，包含旗舰模型 Sol、中端模型 Terra 及低成本日常模型 Luna。Sol 在智能体任务上超越 GPT-5.5，Terminal-Bench 2.1 编码基准测试表现突出。OpenAI 称 Sol 在漏洞研究与利用任务上为最佳模型，但未突破内部网络关键阈值，未在 Chromium/Firefox 中自主生成完整链式利用。Sol 新增“max”深度推理与“ultra”子智能体两种模式。定价方面，Sol 为 $5/百万输入 token、$30/百万输出 token，与 GPT-5.5 持平；Terra 性能接近 GPT-5.5 但成本低 2 倍；Luna 为最便宜的大规模工作负载模型。OpenAI 使用超 70 万 A100 等效 GPU 小时进行自动化红队测试。发布受美国政府要求，先从小规模可信合作伙伴预览开始。

Deedy@deedydas · 5天前33

Made this great little sci-fi of life in 2027 into a video

译Deedy Das 将 @reed_barnes 的推文改编成视频，描绘 2027 年 AI 管控下的生活：用户需乘坐免费 Waymo 前往“模型变异局”（DMV），通过视网膜扫描验证身份以获取 GPT 7.1 访问权限。柜台人员被怀疑是 Claude wrapper。验证通过后，设备激活上百个 AI 智能体，同时需终止开源权重备份智能体（因国会认定中国模型“无灵魂”）。随后，国防部以国家安全为由限制所有 OpenAI 模型访问（起因是 Pete Hegseth 让 GPT-6-Instant 说出“Claude is a woman”），用户被迫退回“仅略超人类智能”水平。Fable 5 仍不对公众开放。

jason@jxnlco · 5天前60

Hey Codex, find everyone I've interacted with on Slack in the past 90 days and add them on LinkedIn.

译嘿 Codex，找到过去 90 天我在 Slack 上互动过的所有人，并在 LinkedIn 上添加他们。

meng shao@shao__meng · 5天前77

OpenAI GPT-5.6 系列模型预览发布好消息是 Sol 很强！坏消息是目前只能小范围预览，要配合美国政府监管审查！A 厂求仁得仁，转身拖 O 厂下水，原来 A 厂的 AI 宪法，就是：都别活 😄 · Sol - 旗舰，最强能力 $5 / $30 · Terra - 均衡，日常主力 $2.50 / $15 · Luna - 轻量，最低成本 $1 / $6 Terra 性能与 GPT‑5.5 相当但成本减半；Luna 在最低价位仍保留较强能力。新能力：从"单 Agent 推理"走向"多 Agent 协作" 两个值得注意的新机制： · Max reasoning effort：给 Sol 更深的推理预算。 · Ultra mode：超越单 Agent，通过 subagents 协同加速复杂任务。 Ultra 模式是本文最实质的能力跃迁信号——它把模型能力从"单个推理体"扩展到"协调多个 subagent 的系统"。在 Terminal‑Bench 2.1（命令行工作流基准）上，Sol Ultra 达到 91.9%，Sol 88.8%，而 Ultra 与非 Ultra 的差距本身说明"subagent 调度"带来了可观增益。三大领域基准：编码、生物、网络安全的"效率前沿"叙事 OpenAI 反复使用一个框架：性能—效率前沿（performance-efficiency frontier），即不只比分数，更比"达到同等分数需要多少 token"。 · 编码：Terminal‑Bench 2.1 新 SOTA。 · 生物学：GeneBench v1（长程基因组与定量生物学分析），Sol 比 GPT‑5.5 分数更高且 token 更少。 · 网络安全： · ExploitBench：Sol 用约 1/3 的输出 token 即可与 Mythos Preview 竞争。 · ExploitGym（UC Berkeley 联合前沿实验室）：三档模型随推理增强，能力同步提升。

译OpenAI 发布 GPT-5.6 系列有限预览，包括旗舰 Sol（$5/$30）、均衡 Terra（$2.50/$15）和轻量 Luna（$1/$6）。Terra 性能与 GPT‑5.5 相当但成本减半。新增 Ultra 模式，通过 subagent 协同加速复杂任务，Terminal‑Bench 2.1 上 Sol Ultra 达 91.9%（Sol 88.8%）。编码创 SOTA；GeneBench v1 中 Sol 比 GPT‑5.5 分数更高且 token 更少；ExploitBench 中 Sol 用约 1/3 输出 token 即可与 Mythos Preview 竞争。目前仅小范围预览，需配合美国政府监管审查。