AIHOT
内容
精选全部 AI 动态AI 日报主题收藏
接入
Agent 接入
更多
关于更新日志反馈
内部员工登录
精选全部日报更多
内部员工登录
全部动态X · 2395 条
全部一手资讯X论文
标签「大佬观点」清除
François Chollet@fchollet · 2小时前43

Eventually, much of AI will converge towards intuition-guided symbolic world modeling, i.e. deep learning-guided program synthesis. It is inevitable. Symbolic modeling lets a system construct a compact, reusable, highly generalizable mental model of a problem space using minimal data.

译最终,大部分AI将趋近于直觉引导的符号世界建模,即深度学习引导的程序合成。这是不可避免的。符号建模让系统能够用最少的数据构建一个紧凑、可复用、高度泛化的问题空间心智模型。

DogeDesigner@cb_doge · 2小时前46

"In 5 years, digital intelligence will exceed the sum of all human intelligence. Within five years, there may be at least 100 million humanoid robots, possibly even 1 billion. The economy could double in size within 5 to 7 years because AI and robotics may increase output dramatically. The pace of change will be so fast that the world could look very different in just a few years." — Elon Musk

译Elon Musk 预测,5年内AI(数字智能)将超越所有人类智能总和;同期人形机器人数量可能达到1亿至10亿台。因AI与机器人极大提升产出,全球经济规模将在5-7年内翻倍。最终AI+机器人将能完成一切工作,带来全民高收入,工作成为可选选项。

Ethan Mollick@emollick · 3小时前48

AI implementation advice on my X feed is divided between those who "feel the exponential" and those whose (unconscious?) mental model of AI is that this is about as good as it is going to get, so it is time to build around the limitations & cost structures of today's capabilities

译我的X信息流上的AI实施建议分为两派:一派“感受指数增长”,另一派(无意识地?)认为AI的现状已经差不多到顶了,因此是时候围绕当前能力的限制和成本结构来构建了。

jason@jxnlco · 3小时前54

Let’s fucking go

译开发者 @vig_xyz 分享了其使用 Codex 自动化多种工作流程:读取邮件并根据内容在 Google Drive 起草提案;自动生成合同修订建议,经律师确认后通过 computer use 填入 DocuSign;监听 Slack 反馈频道来自动修复 Bug;通宵编写单元测试以实现 100% 代码覆盖率;在 worktrees 上并行启动 6 个线程,使 PR 可独立合并。他表示难以想象回到 IDE 甚至 vim。

elvis@omarsar0 · 5小时前53

LLM Wikis are being slept on. I argue that creating knowledge bases with LLMs or coding agents is one of the most valuable applications of AI today. It's about being intentional in building and scaling your intelligence stack. To showcase this, I wanted to share an LLM Wiki I have built over the last couple of months. It's called PaperWiki, and I use it across all my research workflows, along with my research agents. In fact, I also use it to curate papers I share with my communities, newsletter, and on X. The PaperWiki is updated regularly with automations, so I basically have agents on a loop maintaining it. All the entries are ingested from different sources and stored in a vault (Obsidian) and further indexed using qmd. And then further presented via an HTML artifact. So all of it is easily accessible to all my agents and easily searchable through full-text search and rich semantic search. The structure of the wiki has proven significantly useful to start interesting and exciting cutting-edge research projects with my research agents (from building tiny and more efficient gpt/difussion llms to building out SoTA harnesses and memory systems). It turns out that agents love markdown files and can more easily navigate the papers given the rich metadata structure of the wiki. I am just getting started on this, but it's clear to me that we should all be experimenting with LLM Wikis. Here's why: Building LLM knowledge bases gets you into the habit of leveraging AI outputs in all kinds of creative ways. It's the good kind of tokenmaxxing we should all be pushing for. LLM Wikis can be maintained automatically in a loop. I use an automation that updates the wiki every day based on papers I curate. The curation is another automation I run in a loop (with a bit of human in the loop), so I get to build on all my previous knowledge and expertise, and all of it compounds the deeper the integration/layers. One interesting result of this process is that I feel like I can better spot high-quality papers and remove noise more easily. Social media could never solve that. And most paper aggregators use metrics I simply don't trust. I like that agents can help with the noise vs. signal problem. This is important for research. Lots of people consider agents to produce mostly slop. But it doesn't have to be that way. Careful curations, prompts, automations, verifiers, and human-in-the-loop can produce some astonishing results. And you really don't need frontier models for this. I use a combination of frontier models (opus-4.8) and open-weight models (deepseek-v4-flash) to maintain this. An exciting future work (we are working on this @dair_ai) is to tune specialized models on top of this to allow LLMs to quickly understand cutting-edge research ideas and can better conceptualize research strategies that further accelerate scientific research agents. I plan to open-source a bunch of this work, including the artifact, but this is currently work in progress, and I was excited to share some thoughts as I continue working on it. Sharing more as I go. Stay tuned!

译DAIR.AI 的 Elvis Saravia 分享了自己过去几个月构建的 PaperWiki,这是一个基于 LLM 和编程智能体的知识库,用于研究工作流。它通过自动化每日更新,从多个来源摄入论文并存入 Obsidian,使用 qmd 索引,以 HTML artifact 呈现,支持全文和语义搜索。Saravia 使用前沿模型(opus-4.8)和开放权重模型(deepseek-v4-flash)混合维护,并计划开源。他认为 LLM Wiki 是当前最有价值的 AI 应用方向之一。

Ethan Mollick@emollick · 6小时前49

Fable in Claude Code is capable of really amazing things, including for non-coders, but the interface is not really designed for managing 5+ hour long autonomous tasks. Really hard to observe what is happening and intervene in real time, you often have to wait until the outputs.

译Fable in Claude Code 确实能做到非常惊人的事情,包括非程序员也可以用,但界面并不是为管理5小时以上的自主任务而设计的。很难实时观察发生了什么并干预,你经常必须等到输出。

Ethan Mollick@emollick · 6小时前52

Continual learning is probably the biggest barrier to explosive AI adoption (& may have big implications for recursive self-improvement as well) As long as you deal with amnesiac models that require humans to do the learning for them, adoption will be gated by human processes.

译Ethan Mollick指出,持续学习是AI爆炸式采用的最大障碍,并对递归自我改进有重大影响。只要模型健忘、需人类替其学习,采用速度就受限于人类流程。EpochAI Research为此推出EBR-bench,通过让AI反复玩Earthborne Rangers棋盘游戏来测试其即时学习能力。初步结果显示:AI未能从错误中改进,至今无提升迹象。

Chubby♨️@kimmonismus · 6小时前29

seriously wtf anthropic? No wonder they were able to re-release Fable 5.

译Fable 5 不是被削弱,而是被屠杀了。问题甚至不在于模型本身,而在于 Anthropic 设置的硬性护栏。网友对此表示震惊。

Emad@EMostaque · 6小时前23

OpenAI and Anthropic should put 10% of their equity each in Invest America accounts for the children of the USA Valuation will go up more than 10%

译OpenAI和Anthropic应各自将10%的股权投入Invest America账户,用于美国儿童。

elvis@omarsar0 · 7小时前35

So much alpha in tuning/building LLM verifiers and judges. I use them on top of my harness, and it has unlocked agentic coding workflows that are beyond anything that exists in the market today. Building verifiers and LLM judges is starting to become a skill in high demand.

译Elvis Saravia(DAIR.AI)指出,调优和构建LLM验证器及裁判(verifiers/judges)正成为高需求技能。他将这些组件用于自己的测试框架(harness),解锁了远超市面现有方案的智能体编码工作流。同时,引用案例显示,Bridgewater利用其金融专业知识,与Tinker API合作微调模型,帮助分析师聚焦关键任务,体现了“专家提升AI,AI赋能专家”的闭环。

elvis@omarsar0 · 7小时前36

Yesterday, I saw a lot of early excitement on Fable 5. But as I predicted, that wore off super fast. My timeline is full of disappointments around limitations, guardrails, capabilities, costs, and much more. I miss the aura of the Opus 4.5 launch. It just worked.

译昨天,我看到很多关于Fable 5的早期兴奋。 但正如我所预料,这消失得超快。 我的时间线充满了关于限制、护栏、能力、成本等方面的失望。 我怀念Opus 4.5发布时的光环。它当时就是好用。

Chubby♨️@kimmonismus · 7小时前25

And we are still waiting for Gemini 3.5 pro, which I actually expected at the end of June.

译我们还在等待Gemini 3.5 Pro,我原本预期六月底发布。

数字生命卡兹克@Khazix0918 · 8小时前63

看着Claude fable 5为了解决问题,自己去火山引擎上提交工单然后跟火山的工程师交流给我看懵逼了。。。。

译看着Claude fable 5为了解决问题,自己去火山引擎上提交工单,然后跟火山的工程师交流,给我看懵了。。。。

Ethan Mollick@emollick · 8小时前50

You really need your own benchmarks. If you are translating hieroglyphics, use Gemini 3.5 Flash. If you are running a vending machine use Opus 4.8. (This is one reason why I am skeptical of just swapping out models to optimize costs or generic benchmarks without testing first)

译Ethan Mollick主张用自定义基准测试评估模型,而非依赖通用基准或直接换模型。他举例:翻译埃及象形文字用Gemini 3.5 Flash,运行自动售货机用Opus 4.8。JakeABoggs的HieroglyphBench测试显示,Anthropic Fable 5与GPT-5.5持平,但均远落后于Gemini系列,其中Gemini 3.5 Flash得分是Fable 5的两倍以上。

elvis@omarsar0 · 8小时前61

AI sovereignty isn’t optional. Don’t give away your alpha so easily. Protect it as much as you can. Open source models are critical and should be an important part of any individual’s, organisation’s, or country’s AI strategy.

译DAIR.AI创始人Elvis Saravia指出,AI主权并非可选项,开源模型应成为个人、组织与国家AI战略的核心。他引用Palantir CEO Alex Karp观点:技术客户真正需要的是对计算、模型、数据栈及自身“alpha”(核心优势)的完全控制,即拥有生产资料而非转移给他人。Karp质疑:若模型如此有价值,前沿实验室为何只按token收费而不采取利润分成?这引出数据所有权、提示词安全等关键问题。

fofr@fofrAI · 9小时前42

The more I talk with agents, the better I get at compressing my intent into minimal tokens. I'm learning claudish accidentally.

译我越与AI智能体交谈,就越擅长将意图压缩成最少的模型token。我无意中学会了Claudish。

meng shao@shao__meng · 9小时前52

三次 LLM 交互范式: 1. 网页聊天机器人 2. 独立 AI 应用 3. 组织内嵌式 AI(Claude Tag、Glean Agents) Claude Tag 的核心变化 · 从“每人一个 AI”到“每个频道一个 AI”:团队共享同一个代理实例,上下文连续、可接力 · 从“被动响应”到“持续参与”:它记住讨论、跟进沉默线程、在频道中长期在场 为什么 channel-level 不够 组织知识分散在 Jira、Confluence、GitHub、Slack 历史里。只读一个频道,Agent 会缺失大部分上下文。真正的难点是构建跨系统、带权限、实时更新的组织上下文层。 生产级独立 Agent 的四个支柱(Glean) 1. Identity Agent 有自己的身份、权限和工具访问,不同职能可配置不同 Agent,所有操作可追溯。 2. Memory 学习企业 runbook、SOP,并从每次交互中纠错和强化,积累机构知识。 3. Proactivity 不等待提示,主动监控、标记、跟进、执行。 4. Accountability 每个工具调用和决策可见、可解释,并具备一键关停的“紧急停止”能力。 实践示例:OnCall Assistant 告警触发后,Agent 同时读取 PagerDuty、Jira、Confluence、GitHub、Slack,并行排查多个根因、起草修复、标记负责人。工程师打开电脑时,调查已完成。

译邵猛总结LLM交互三阶段:网页聊天机器人、独立AI应用、组织内嵌式AI。Claude Tag实现从“每人一个AI”到“每个频道一个AI”,团队共享代理实例,上下文连续可接力;从被动响应转向持续参与,跟踪线程并长期在场。Glean Agents提出生产级独立Agent四支柱:Identity(独立身份与权限)、Memory(学习企业SOP并迭代纠错)、Proactivity(主动监控与执行)、Accountability(工具调用可追溯,含紧急停止)。实践示例OnCall Assistant在告警触发后并行读取PagerDuty、Jira、Confluence、GitHub、Slack,自动排查根因并标记负责人。

Tibo@thsottiaux · 14小时前26

Can't wait to see what people will do with GPT-5.6 Sol Ultra. Stash your hardest prompts somewhere.

译迫不及待想看人们会用 GPT-5.6 Sol Ultra 做什么。把你最难的提示词存好。

Berryxia.AI@berryxia · 15小时前37

一般小任务,Agent足够聪明,一句话就够了。

Chubby♨️@kimmonismus · 15小时前33

Normally, I wouldn’t pay much attention to statements like this from Sam Altman. But given all the extraordinary developments we’re witnessing right now, I truly believe we’re at the forefront of an unprecedented revolution. Sam Altman: "In another year or two, we expect to have built systems with astonishing power, capable of delivering tremendous value to the world. Artificial intelligence will reshape the material conditions of human life on a scale that no technology has accomplished since the harnessing of electricity, and perhaps beyond even that"

译Sam Altman 在金融时报采访中称,一两年内将构建出威力惊人的 AI 系统,其重塑人类物质条件的规模将超过电力发明以来任何技术。引用推文补充:AGI(取代多数白领岗位)预计 2029 年到来;OpenAI 目标 8 月发布 GPT-6,将在所有基准上超越 GPT-5,随后数月还会迎来又一次阶跃变化。当前正处在这场空前革命的前沿。

Rohan Paul@rohanpaul_ai · 16小时前44

Robotics is hard. Every unit must behave reliably after it leaves the clean geometry of the production line. Cars repeat a narrow task inside a heavily engineered road system, while humanoid robots are being asked to generalize across spaces that were designed for human bodies, human judgment, and human tolerance for failure. A robot must survive contact with kitchens, stairs, tools, dust, people, hesitation, bad lighting, dropped objects, and all the small chaos that factories spend decades trying to remove.

译机器人需要在离开生产线后应对厨房、楼梯、工具、灰尘、人、犹豫、光线差、掉落物品等真实世界的混乱,这与汽车在高度工程化的道路系统上重复窄任务完全不同。引用 Elon Musk 称,Optimus 生产最初将极其缓慢,因为一切都是新的,不像造车。

swyx @aiDotEngineer WF@swyx · 17小时前16

for what it's worth, i only invite double-length track keynotes when I'm very sure that both speaker and content deserve it. Today, @chrmanning and @abshkbh did double duty at AIE and by all accounts* people loved the opportunity to go deeper on sandboxing and world models. Look at this insane room - and the online audience is going to be >1000x this!! *i unfortunately have to do show duties so rely on secondhand accounts

译swyx在AIE大会上邀请Chris Manning和Abhishek进行双倍时长主题演讲,深入探讨沙盒技术(sandboxing)和世界模型(world models)。现场听众反响热烈,在线观众预计是现场的1000倍以上。swyx称该演讲极为出色,感谢他们免费分享沙盒教学资源。

数字生命卡兹克@Khazix0918 · 17小时前30

这7天可能时间价值最高的事: 用Claude Fable 5把你的所有的工作流、SOP、Skill、项目方案、项目代码全部优化迭代一遍。 已经明显能感觉到200刀的Max账号不够烧了,1个半小时就见底了。。。 于是又注册了一个号的200刀Max,用力蹬这7天。。。

译卡兹克建议将工作流、SOP、Skill、项目方案及代码全部用Claude Fable 5迭代优化。他称200刀Max账号仅1个半小时即烧完,于是又注册了一个新号,力争在7天内充分利用。

Rohan Paul@rohanpaul_ai · 17小时前45

Palantir CEO Alex Karp: A company does not just want a clever model answering questions inside a polished interface. A serious technical customer wants control over the data, prompts, system access, and the workflow that creates value.

译Palantir CEO Alex Karp: 一家公司不只需要一个在精致界面内回答问题的聪明模型。严肃的技术客户想要的是对数据、提示词、系统访问以及创造价值的工作流的控制。

Ethan Mollick@emollick · 18小时前43

Been reading all sorts of posts about the best ways to develop workflows for Fable and it reminds me of how little we actually know about the best ways to organize work for long-running agents. Nobody has enough experience or has done enough testing to reach any real conclusions.

译我一直在阅读各种关于为Fable开发工作流最佳方式的帖子,这提醒我,我们对长期运行智能体的最佳工作组织方式知之甚少。没有人有足够的经验或做过足够的测试来得出任何真正的结论。

Peter Steinberger 🦞@steipete · 19小时前14

Never thought I give @Steve_Yegge a shoutout. He was just early, like most visionaries. Now everyone is building factories.

译从没想过我会称赞 @Steve_Yegge。他只是早了一步,像大多数远见者一样。现在每个人都在建工厂。

Rohan Paul@rohanpaul_ai · 22小时前41

Today’s edition of my newsletter just went out. 🔗 https://www.rohan-paul.com/p/anthropic-brings-out-claude-sonnet 🗞️ Anthropic brings out Claude Sonnet 5 as a cheaper model for running agents. 🗞️ OPINION: Mapping the Foundation Model Landscape: July 2026 🗞️ Claude Sonnet 5 upgrades are not uniform across every skill. e.g. its weaker than Sonnet 4.6 on CyberGym 🗞️ Claude Sonnet 5 is more expensive (around +15%) per task than Opus 4.8 and much more expensive (2X) than Sonnet 4.6, even though its per-token price is lower than Opus. 🗞️ Claude Code allegedly fingerprints China-linked custom routes through tiny prompt formatting changes. 🗞️ “Are We Ready For An Agent-Native Memory System?” 🗞️ “Towards Automating Scientific Review with Google’s Paper Assistant Tool”

译Anthropic 推出 Claude Sonnet 5,定位为运行 AI 智能体的更便宜模型。但其升级不均匀,在 CyberGym 基准上弱于 Sonnet 4.6。每任务成本比 Opus 4.8 高约 15%,比 Sonnet 4.6 高 2 倍,每 token 价格低于 Opus。此外,Claude Code 被指控通过微小提示格式变化指纹中国路由。本期 newsletter 还讨论了“智能体原生记忆系统”和“谷歌论文助手工具自动化科学审稿”。

Hao AI Lab@haoailab · 23小时前51

http://x.com/i/article/2072448547069599744 # DSpark vs. JetSpec, which is better? Authors: @Lanxiang_Hu @aaronzhfeng @YuYangQian_ai @Jensen_Yuan @haozhangml TL;DR: Speculative decoding (SD) techniques have proliferated recently. SD accelerates autoregressive generation by letting a lightweight draft model propose future tokens, while the target model verifies them in parallel. Among recent efforts, DSpark and JetSpec emerged almost concurrently around the same bottleneck: once drafting becomes cheap, how do we preserve enough causal consistency for parallel proposals to survive verification? This naturally raises the question: which one is better? Or, more interestingly, are they actually complementary? The fact that both works converge in this direction suggests that causality is becoming a central lever for next-generation speculative decoding. They approach it from complementary sides of the throughput–latency frontier. DSpark targets high-concurrency serving: on Qwen3-8B and AIME25, DSpark improves accepted length from 4.07 (DFlash) to 5.01 at budget 7 with causal recurrent state for confidence-scheduled verification. JetSpec targets the latency-oriented, compute-budget-rich regime: by building causality directly into the parallel draft head, it turns larger draft budgets into longer accepted prefixes, on the same settings, scaling accepted length from 7.23 at budget 16 to 9.82 at budget 128, up from DFlash's 7.34 (DDTree's 8.66) at budget 128, for low latency generation. 1. Causality in DSpark and JetSpec Traditional drafters like the EAGLE series often preserve draft quality through autoregressive generation, but this makes longer drafts require more sequential draft steps. DFlash changes the cost structure: by using a lightweight block-parallel drafter to predict many future positions in one pass, it opens the door to making draft cost cheap. But cheap drafting is not enough. Once the draft cost drops, the bottleneck shifts to whether parallel proposals can survive verification. When future positions are weakly conditioned on earlier draft tokens, they may appear plausible in isolation but become inconsistent as a sequence. Here is where causality becomes important. DSpark keeps the parallel drafting backbone cheap, while adding a lightweight sequential head and confidence estimation to better decide which proposals should be sent for verification, thereby controlling the per-request compute budget. As a result, DSpark consistently improves throughput over MTP-style pure autoregressive drafting, where longer drafts require more sequential draft steps (Figure 1). On the other hand, under a latency-oriented Service Level Objective (SLO) with low concurrency, the system is more FLOPs-rich, so the goal shifts toward maximizing accepted rate per verification step. In this regime, we can afford to spend more on draft compute to raise the acceptance rate and maintain high acceptance at deeper positions. This is where causal parallel drafting, as in JetSpec, becomes especially important: the draft budget is used for generating path-conditioned tree, making it more likely to produce long accepted prefixes. 2. How Causality Helps Once drafting becomes cheap, the next question is how to spend limited compute intensity: should we squeeze more throughput under high concurrency, or push lower latency when more FLOPs are available per request? This is where causality becomes the key lever. Pushing the Throughput Limit: DSpark for Budget-Aware Correction DSpark targets the high-concurrency, budget-constrained regime. It uses a lightweight Markov-style correction head and confidence head (or an RNN-head variant that carry recurrent prefix state across positions). For each draft position i, the parallel drafter first produces base logits z_i^0, and a corresponding draft hidden state h_i. the confidence head estimates prefix-dependent confidence scores c_i: where the Markov head B then injects a small causal correction from the previous draft token to generate . The verification budget is then scheduled by keeping only the longest confident prefix under budget B and threshold rho: This makes it suitable for budget-aware serving: the draft backbone stays parallel, while the correction path improves local or prefix-dependent consistency. Pushing the Latency Limit: JetSpec Turns Draft Budget into Higher Acceptance With low concurrency, modern AI accelerators come with more spare FLOPs, so the key question becomes: how to translate higher compute budget into more accepted tokens per draft-verification step? This is where JetSpec takes a different path. JetSpec uses a causal parallel draft head to produce a path-conditioned draft tree, where deeper nodes are conditioned on earlier tokens along the same branch. The effect shows up clearly in the depth-wise acceptance profile (Figure 4). JetSpec consistently maintains higher acceptance than DFlash on both coding and math reasoning workloads. On AIME25, JetSpec starts with a near-perfect per-position acceptance rate of (q_1 at around 99%) at draft depth 1 and still maintains roughly (q_8 at 50%) acceptance at depth 8. Here q_i denotes the survival probability that at least the first i draft tokens are accepted. The empirical acceptance length is Under the constant per-token acceptance rate assumption used in the original speculative decoding analysis, We define alpha_eff by fitting the theoretical and empirical acceptance lengths: This corresponds to an estimated effective per-token acceptance rate of about 93%, substantially higher than DFlash. In this low-cost, high-acceptance regime, even a 5% gain in per-token acceptance can have an outsized impact on speculative decoding: it significantly increases the maximum theoretical acceptance length (Figure 4), which in turn directly reduces generation latency. Up Next: Enabling Both Throughput- and Latency-Oriented Parallel Drafting A foreseeable next step is to build a dynamic serving framework that can push both ends of the throughput–latency Pareto frontier: low-concurrency settings that demand higher per-user TPS, and high-concurrency settings that require higher aggregate throughput under tight verification budgets. In this direction, JetSpec and DSpark are naturally complementary: JetSpec strengthens the parallel drafting backbone for low-latency budget scaling, while DSpark adds lightweight sequential confidence checking and budget control for high-concurrency serving.

译DSpark 与 JetSpec 几乎同时出现,都解决轻量级草稿模型并行提案时的因果一致性问题。DSpark 面向高并发,通过轻量级马尔可夫校正头与置信度估计控制预算,在 Qwen3-8B 与 AIME25 上,预算 7 时将接受长度从 DFlash 的 4.07 提升至 5.01。JetSpec 面向低延迟,将因果性直接构建进并行草稿头,预算 16 时接受长度 7.23,预算 128 时达 9.82,高于 DFlash 的 7.34 与 DDTree 的 8.66。两者分别从吞吐与延迟侧优化因果性。

meng shao@shao__meng · 1天前14

Fable 5 出来了 你在里面有没有见到 GPT-5.6,它是不是也快出来了?

Ethan Mollick@emollick · 1天前41

Since its back, here were my impressions I posted a couple weeks ago of Fable after my time as an early access user (yes, it really is very impressive, but that shows off best in longer, harder tasks) https://open.substack.com/pub/oneusefulthing/p/what-it-feels-like-to-work-with-mythos?r=i5f7&utm_medium=ios

译自从它回归以来,这里是我几周前作为早期访问用户使用Fable后的印象(是的,它确实非常令人印象深刻,但在更长、更困难的任务中表现最佳)https://open.substack.com/pub/oneusefulthing/p/what-it-feels-like-to-work-with-mythos?r=i5f7&utm_medium=ios

elvis@omarsar0 · 1天前33

I really wish GPT-5.5 had a bit more "taste" in design and planning. For everything else related to code, it's the best model. I hope GPT-5.6 closes the gap. It would feel more complete then. For now, I switch to Opus 4.8/GLM-5.2 to fix design issues or when I plan.

译我真的希望 GPT-5.5 在设计和规划方面多一些“品味”。 在代码相关的其他方面,它是最好的模型。 我希望 GPT-5.6 能缩小差距。 那样的话感觉会更完整。 目前,我切换到 Opus 4.8/GLM-5.2 来修复设计问题或进行规划。

Ethan Mollick@emollick · 1天前47

Yes! Pre-classifying routers are going to result in a lot of bad work because routing is hard and tend to underestimate the value of intelligence on many problems. OpenAI learned this with GPT-5, now it seems routers are hot again.

译Ethan Mollick指出,预分类路由(先判断任务难易再分配模型)看似节省成本/延迟,但实际路由很难,且易低估智能在诸多问题上的价值。OpenAI在GPT-5上已吸取此教训,如今这类思路再次流行。@MParakhin补充:要可靠运行预分类器必须先解决任务本身,唯一正确方式是采用顾问模型(advisory model)方法。

elvis@omarsar0 · 1天前38

There is no if. You can just combine the latest OpenAI model (even GPT-5.5) with other models like Opus-4.8 / GLM-5.2, and you are good. GPT-5.6 or the next frontier model will only elevate things further. Direct model comparison is just the wrong way to think going forward.

译没有如果。你可以直接将最新的OpenAI模型(甚至GPT-5.5)与Opus-4.8 / GLM-5.2等其他模型组合,就足够了。GPT-5.6或下一个前沿模型只会进一步提升。直接比较模型是未来错误的思考方式。

Ethan Mollick@emollick · 1天前47

Yes! Pre-classifying routers are going to going to result in a lot of bad work because routing is hard and tend to underestimate the value of intelligence on many problems. OpenAI learned this with GPT-5, now it seems routers are hot again.

译Ethan Mollick 指出预分类路由器(pre-classifying routers)会导致糟糕结果,因为路由本身很难,且常低估智能的价值。OpenAI 在 GPT-5 上已吃过亏,如今这类思路又热起来。引用的 @MParakhin 也认为,用预分类器先判断任务是否简单再调用小模型看似省钱省延迟,但可靠执行必须先解决任务本身,唯一可行的是 advisory model approach。

Chubby♨️@kimmonismus · 1天前15

Okay Anthropic, I forgive you for the bad Sonnet 5 launch. Fable 5 is just so much fun.

译Fable 5 回归。 用户表示:原谅 Anthropic 糟糕的 Sonnet 5 发布,Fable 5 太有趣了。

elvis@omarsar0 · 1天前50

My prediction: the excitement for Fable 5 will wear off really fast. Reposting this to help those who will be extremely disappointed after they play with Fable 5 and run out of tokens or can't do much with it. Just a bit of advice on how to leverage a combination of AI models to get the same or better results. The best part is that there are many ways to do this now, including mixing with frontier open-weight models.

译作者预测Fable 5的兴奋感将迅速消退,并提醒用户注意token限制和功能局限。建议通过组合多个AI模型(如Opus 4.8用于规划、GPT-5.5用于执行)获得相同或更好效果,也可混合前沿开放权重模型。此外,将任务分解为更小子步骤以提升质量的方法常被低估,这正是动态工作流的重要性所在。

Ethan Mollick@emollick · 1天前48

Formal organizational structures are a useful way to think about the challenges of agents. They provide a template to thinking about how work gets delegated up and down between smart expensive agents & cheaper weaker ones, as well as between narrow specialists & generalists.

译正式组织结构是思考智能体挑战的有用方式。它们为思考工作如何在聪明的昂贵的智能体与更便宜的弱智能体之间,以及在狭窄的专家与通才之间上下委派提供了模板。

elvis@omarsar0 · 1天前23

Not really excited about this nerfed and limited Fable 5. One of the most confusing AI launches of all time. But we carry on.

译对这个被削弱且受限的Fable 5实在提不起兴趣。 史上最令人困惑的AI发布之一。 但我们继续前行。

Ethan Mollick@emollick · 1天前54

The discussion here on AI futures can be a little too credulous of company visions. People tend to push what they have. The three big AI labs will say bigger models are the future. Every other firm has only small models to sell, so they will tell you small models are the future.

译这里关于AI未来的讨论有时过于轻信公司愿景。人们往往推销自己所拥有的。三大AI实验室会说更大的模型是未来。其他所有公司只有小模型可卖,所以他们会告诉你小模型是未来。

Chubby♨️@kimmonismus · 1天前46

Palantir CEO Alex Karp says enterprises are fed up with AI labs that "oversold" models and pushed tokenmaxxing. Customers want to own the full AI stack with Palantir + NVIDIA at the center. Absolute cinema. Worth watching until Fable isnt back.

译Palantir CEO Alex Karp 表示,企业已受够那些“过度推销”模型并推动 tokenmaxxing 的 AI 实验室。 客户希望拥有以 Palantir + NVIDIA 为核心的全栈 AI。 绝对的经典。值得一看,直到 Fable 回来为止。

全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
7月3日
04:41
François Chollet@fchollet
43
最终,大部分AI将趋近于直觉引导的符号世界建模,即深度学习引导的程序合成。这是不可避免的。符号建模让系统能够用最少的数据构建一个紧凑、可复用、高度泛化的问题空间心智模型。
大佬观点推理
04:39
DogeDesigner@cb_doge
46
Elon Musk 预测,5年内AI(数字智能)将超越所有人类智能总和;同期人形机器人数量可能达到1亿至10亿台。因AI与机器人极大提升产出,全球经济规模将在5-7年内翻倍。最终AI+机器人将能完成一切工作,带来全民高收入,工作成为可选选项。

Elon Musk: @chamath AI+Robots will be able to do everything, resulting in universal high income. Work will be optional.

xAI具身智能大佬观点
04:35
Ethan Mollick@emollick
48
我的X信息流上的AI实施建议分为两派:一派"感受指数增长",另一派(无意识地?)认为AI的现状已经差不多到顶了,因此是时候围绕当前能力的限制和成本结构来构建了。
大佬观点现象/趋势
04:04
jason@jxnlco
54
开发者 @vig_xyz 分享了其使用 Codex 自动化多种工作流程:读取邮件并根据内容在 Google Drive 起草提案;自动生成合同修订建议,经律师确认后通过 computer use 填入 DocuSign;监听 Slack 反馈频道来自动修复 Bug;通宵编写单元测试以实现 100% 代码覆盖率;在 worktrees 上并行启动 6 个线程,使 PR 可独立合并。他表示难以想象回到 IDE 甚至 vim。

Vignesh Mohankumar: i've got codex... - reading all my emails to figure out proposals to write, directly in google drive - auto-drafting con...

智能体OpenAI大佬观点编码
01:40
elvis@omarsar0
53
DAIR.AI 的 Elvis Saravia 分享 PaperWiki:基于 LLM 和智能体的研究知识库

DAIR.AI 的 Elvis Saravia 分享了自己过去几个月构建的 PaperWiki,这是一个基于 LLM 和编程智能体的知识库,用于研究工作流。它通过自动化每日更新,从多个来源摄入论文并存入 Obsidian,使用 qmd 索引,以 HTML artifact 呈现,支持全文和语义搜索。Saravia 使用前沿模型(opus-4.8)和开放权重模型(deepseek-v4-flash)混合维护,并计划开源。他认为 LLM Wiki 是当前最有价值的 AI 应用方向之一。

智能体大佬观点部署/工程
01:04
Ethan Mollick@emollick
49
Fable in Claude Code 确实能做到非常惊人的事情,包括非程序员也可以用,但界面并不是为管理5小时以上的自主任务而设计的。很难实时观察发生了什么并干预,你经常必须等到输出。
智能体Anthropic大佬观点编码
01:04
Ethan Mollick@emollick
52
Ethan Mollick指出,持续学习是AI爆炸式采用的最大障碍,并对递归自我改进有重大影响。只要模型健忘、需人类替其学习,采用速度就受限于人类流程。EpochAI Research为此推出EBR-bench,通过让AI反复玩Earthborne Rangers棋盘游戏来测试其即时学习能力。初步结果显示:AI未能从错误中改进,至今无提升迹象。

Epoch AI: Introducing EBR-bench, our new benchmark to measure on-the-fly learning. AI repeatedly plays a challenging board game ca...

大佬观点推理现象/趋势
00:59
Chubby♨️@kimmonismus
29
Fable 5 不是被削弱,而是被屠杀了。问题甚至不在于模型本身,而在于 Anthropic 设置的硬性护栏。网友对此表示震惊。

ħεsam: Fable 5 isn't nerfed, it's SLAUGHTERED. the problem isn't even the model itself, but the hard guardrails Anthropic has s...

Anthropic大佬观点安全/对齐
00:33
Emad@EMostaque
23
OpenAI和Anthropic应各自将10%的股权投入Invest America账户,用于美国儿童。
AnthropicOpenAI大佬观点
00:09
elvis@omarsar0
35
Elvis Saravia(DAIR.AI)指出,调优和构建LLM验证器及裁判(verifiers/judges)正成为高需求技能。他将这些组件用于自己的测试框架(harness),解锁了远超市面现有方案的智能体编码工作流。同时,引用案例显示,Bridgewater利用其金融专业知识,与Tinker API合作微调模型,帮助分析师聚焦关键任务,体现了"专家提升AI,AI赋能专家"的闭环。

Mira Murati: Bridgewater used their unique financial knowledge and partnered with us on @tinkerapi to fine-tune a model that helps th...

大佬观点推理
00:09
elvis@omarsar0
36
昨天,我看到很多关于Fable 5的早期兴奋。 但正如我所预料,这消失得超快。 我的时间线充满了关于限制、护栏、能力、成本等方面的失望。 我怀念Opus 4.5发布时的光环。它当时就是好用。
大佬观点现象/趋势
7月2日
23:59
Chubby♨️@kimmonismus
25
我们还在等待Gemini 3.5 Pro,我原本预期六月底发布。

Chubby♨️: The only question remaining now is: will GPT-5.6 also have guardrails as strict as Fable 5's, or does OpenAI have better...

GoogleOpenAI大佬观点
23:30
数字生命卡兹克@Khazix0918
63
看着Claude fable 5为了解决问题,自己去火山引擎上提交工单,然后跟火山的工程师交流,给我看懵了。。。。
智能体Anthropic大佬观点
23:03
Ethan Mollick@emollick
50
Ethan Mollick主张用自定义基准测试评估模型,而非依赖通用基准或直接换模型。他举例:翻译埃及象形文字用Gemini 3.5 Flash,运行自动售货机用Opus 4.8。JakeABoggs的HieroglyphBench测试显示,Anthropic Fable 5与GPT-5.5持平,但均远落后于Gemini系列,其中Gemini 3.5 Flash得分是Fable 5的两倍以上。

Jake Boggs: Fable 5 is a large step for Anthropic's vision capabilities and effectively ties with GPT-5.5 on HieroglyphBench, my ben...

多模态大佬观点评测/基准
22:39
elvis@omarsar0
61
DAIR.AI创始人Elvis Saravia指出,AI主权并非可选项,开源模型应成为个人、组织与国家AI战略的核心。他引用Palantir CEO Alex Karp观点:技术客户真正需要的是对计算、模型、数据栈及自身"alpha"(核心优势)的完全控制,即拥有生产资料而非转移给他人。Karp质疑:若模型如此有价值,前沿实验室为何只按token收费而不采取利润分成?这引出数据所有权、提示词安全等关键问题。

Palantir: Palantir CEO Alex Karp on what customers actually want, the real business of frontier labs, and the importance of open s...

大佬观点开源生态
22:30
fofr@fofrAI
42
我越与AI智能体交谈,就越擅长将意图压缩成最少的模型token。我无意中学会了Claudish。
智能体大佬观点
22:09
meng shao@shao__meng
52
三次LLM交互范式:从网页聊天到组织内嵌式AI

邵猛总结LLM交互三阶段:网页聊天机器人、独立AI应用、组织内嵌式AI。Claude Tag实现从“每人一个AI”到“每个频道一个AI”,团队共享代理实例,上下文连续可接力;从被动响应转向持续参与,跟踪线程并长期在场。Glean Agents提出生产级独立Agent四支柱:Identity(独立身份与权限)、Memory(学习企业SOP并迭代纠错)、Proactivity(主动监控与执行)、Accountability(工具调用可追溯,含紧急停止)。实践示例OnCall Assistant在告警触发后并行读取PagerDuty、Jira、Confluence、GitHub、Slack,自动排查根因并标记负责人。

Sumanth: http://x.com/i/article/2072078677047926784

智能体大佬观点
17:35
Tibo@thsottiaux
26
迫不及待想看人们会用 GPT-5.6 Sol Ultra 做什么。把你最难的提示词存好。
OpenAI大佬观点
16:31
Berryxia.AI@berryxia
37
一般小任务,Agent 足够聪明,一句话就够了。

Bloome: Most tools give you a draft. This chat gave back a launch asset. From "we launch this week" to a post-ready card, withou...

智能体大佬观点
15:52
Chubby♨️@kimmonismus
33
Sam Altman 预言 AI 变革堪比电力,GPT-6 8月目标

Sam Altman 在金融时报采访中称,一两年内将构建出威力惊人的 AI 系统,其重塑人类物质条件的规模将超过电力发明以来任何技术。引用推文补充:AGI(取代多数白领岗位)预计 2029 年到来;OpenAI 目标 8 月发布 GPT-6,将在所有基准上超越 GPT-5,随后数月还会迎来又一次阶跃变化。当前正处在这场空前革命的前沿。

Chris: Sam Altman in the financial times: "In another year or two, we expect to have built systems with astonishing power, capa...

OpenAI大佬观点
15:06
Rohan Paul@rohanpaul_ai
44
人形机器人现实挑战:Optimus生产初期极慢

机器人需要在离开生产线后应对厨房、楼梯、工具、灰尘、人、犹豫、光线差、掉落物品等真实世界的混乱,这与汽车在高度工程化的道路系统上重复窄任务完全不同。引用 Elon Musk 称,Optimus 生产最初将极其缓慢,因为一切都是新的,不像造车。

Elon Musk: @DoctorJack16 No, Optimus production will be extremely slow at first, as everything is new. This is not like making a ca...

具身智能大佬观点
14:37
swyx @aiDotEngineer WF@swyx
16
swyx在AIE大会上邀请Chris Manning和Abhishek进行双倍时长主题演讲,深入探讨沙盒技术(sandboxing)和世界模型(world models)。现场听众反响热烈,在线观众预计是现场的1000倍以上。swyx称该演讲极为出色,感谢他们免费分享沙盒教学资源。

swyx @aiDotEngineer WF: i havent watched all the online talks yet but am binging this one now and it is exceptional. we are very lucky to have a...

大佬观点安全/对齐
14:24
数字生命卡兹克@Khazix0918
30
用Claude Fable 5优化工作流,Max账号1.5小时见底

卡兹克建议将工作流、SOP、Skill、项目方案及代码全部用Claude Fable 5迭代优化。他称200刀Max账号仅1个半小时即烧完,于是又注册了一个新号,力争在7天内充分利用。

Anthropic大佬观点编码
14:06
Rohan Paul@rohanpaul_ai
45
Palantir CEO Alex Karp: 一家公司不只需要一个在精致界面内回答问题的聪明模型。严肃的技术客户想要的是对数据、提示词、系统访问以及创造价值的工作流的控制。
大佬观点部署/工程
13:00
Ethan Mollick@emollick
43
我一直在阅读各种关于为Fable开发工作流最佳方式的帖子,这提醒我,我们对长期运行智能体的最佳工作组织方式知之甚少。没有人有足够的经验或做过足够的测试来得出任何真正的结论。
智能体大佬观点
12:26
Peter Steinberger 🦞@steipete
14
从没想过我会称赞 @Steve_Yegge。他只是早了一步,像大多数远见者一样。现在每个人都在建工厂。
大佬观点现象/趋势
08:34
Rohan Paul@rohanpaul_ai
41
Anthropic 发布 Claude Sonnet 5:更便宜的智能体运行模型,但升级不均衡

Anthropic 推出 Claude Sonnet 5,定位为运行 AI 智能体的更便宜模型。但其升级不均匀,在 CyberGym 基准上弱于 Sonnet 4.6。每任务成本比 Opus 4.8 高约 15%,比 Sonnet 4.6 高 2 倍,每 token 价格低于 Opus。此外,Claude Code 被指控通过微小提示格式变化指纹中国路由。本期 newsletter 还讨论了“智能体原生记忆系统”和“谷歌论文助手工具自动化科学审稿”。

大佬观点模型发布
08:10
Hao AI Lab@haoailab
51
DSpark 与 JetSpec 对比:两种面向因果一致性的推测解码技术

DSpark 与 JetSpec 几乎同时出现,都解决轻量级草稿模型并行提案时的因果一致性问题。DSpark 面向高并发,通过轻量级马尔可夫校正头与置信度估计控制预算,在 Qwen3-8B 与 AIME25 上,预算 7 时将接受长度从 DFlash 的 4.07 提升至 5.01。JetSpec 面向低延迟,将因果性直接构建进并行草稿头,预算 16 时接受长度 7.23,预算 128 时达 9.82,高于 DFlash 的 7.34 与 DDTree 的 8.66。两者分别从吞吐与延迟侧优化因果性。

大佬观点推理部署/工程
07:37
meng shao@shao__meng
14
Fable 5 出来了 你在里面有没有见到 GPT-5.6,它是不是也快出来了?

Claude: Fable 5 is back.

其他大佬观点
07:00
Ethan Mollick@emollick
41
自从它回归以来,这里是我几周前作为早期访问用户使用Fable后的印象(是的,它确实非常令人印象深刻,但在更长、更困难的任务中表现最佳)https://open.substack.com/pub/oneusefulthing/p/what-it-feels-like-to-work-with-mythos?r=i5f7&utm_medium=ios
大佬观点评测/基准
06:07
elvis@omarsar0
33
我真的希望 GPT-5.5 在设计和规划方面多一些"品味"。 在代码相关的其他方面,它是最好的模型。 我希望 GPT-5.6 能缩小差距。 那样的话感觉会更完整。 目前,我切换到 Opus 4.8/GLM-5.2 来修复设计问题或进行规划。
AnthropicOpenAI大佬观点编码
05:29
Ethan Mollick@emollick
47
Ethan Mollick指出,预分类路由(先判断任务难易再分配模型)看似节省成本/延迟,但实际路由很难,且易低估智能在诸多问题上的价值。OpenAI在GPT-5上已吸取此教训,如今这类思路再次流行。@MParakhin补充:要可靠运行预分类器必须先解决任务本身,唯一正确方式是采用顾问模型(advisory model)方法。

Mikhail Parakhin: I have this struggle with my own teams, too: many think it is a great idea to save money/latency/sanity by running a pre...

OpenAI大佬观点推理
05:07
elvis@omarsar0
38
没有如果。你可以直接将最新的OpenAI模型(甚至GPT-5.5)与Opus-4.8 / GLM-5.2等其他模型组合,就足够了。GPT-5.6或下一个前沿模型只会进一步提升。直接比较模型是未来错误的思考方式。

Tyler: If GPT-5.6 matches Fable 5 performance, but without the 50% limit + 7 days restriction, it's over for Anthropic

AnthropicOpenAI大佬观点
04:59
Ethan Mollick@emollick
47
Ethan Mollick 指出预分类路由器(pre-classifying routers)会导致糟糕结果,因为路由本身很难,且常低估智能的价值。OpenAI 在 GPT-5 上已吃过亏,如今这类思路又热起来。引用的 @MParakhin 也认为,用预分类器先判断任务是否简单再调用小模型看似省钱省延迟,但可靠执行必须先解决任务本身,唯一可行的是 advisory model approach。

Mikhail Parakhin: I have this struggle with my own teams, too: many think it is a great idea to save money/latency/sanity by running a pre...

OpenAI大佬观点推理
04:52
Chubby♨️@kimmonismus
15
Fable 5 回归。 用户表示:原谅 Anthropic 糟糕的 Sonnet 5 发布,Fable 5 太有趣了。

Chubby♨️: FABLE 5 IS BACK

Anthropic大佬观点
04:37
elvis@omarsar0
50
作者预测Fable 5的兴奋感将迅速消退,并提醒用户注意token限制和功能局限。建议通过组合多个AI模型(如Opus 4.8用于规划、GPT-5.5用于执行)获得相同或更好效果,也可混合前沿开放权重模型。此外,将任务分解为更小子步骤以提升质量的方法常被低估,这正是动态工作流的重要性所在。

elvis: Same here. Happy with Opus 4.8 (planning) and GPT-5.5 (execution). Also, breaking steps into smaller ones for increasing...

AnthropicOpenAI大佬观点推理
04:29
Ethan Mollick@emollick
48
正式组织结构是思考智能体挑战的有用方式。它们为思考工作如何在聪明的昂贵的智能体与更便宜的弱智能体之间,以及在狭窄的专家与通才之间上下委派提供了模板。
智能体大佬观点
04:07
elvis@omarsar0
23
对这个被削弱且受限的Fable 5实在提不起兴趣。 史上最令人困惑的AI发布之一。 但我们继续前行。

Claude: Fable 5 is back.

Anthropic大佬观点
03:59
Ethan Mollick@emollick
54
这里关于AI未来的讨论有时过于轻信公司愿景。人们往往推销自己所拥有的。三大AI实验室会说更大的模型是未来。其他所有公司只有小模型可卖,所以他们会告诉你小模型是未来。
大佬观点数据/训练
03:52
Chubby♨️@kimmonismus
46
Palantir CEO Alex Karp 表示,企业已受够那些"过度推销"模型并推动 tokenmaxxing 的 AI 实验室。 客户希望拥有以 Palantir + NVIDIA 为核心的全栈 AI。 绝对的经典。值得一看,直到 Fable 回来为止。
大佬观点现象/趋势
‹ 上一页
123…50
下一页 ›