AIHOT
内容
精选全部 AI 动态AI 日报主题收藏
接入
Agent 接入
更多
关于更新日志反馈
内部员工登录
精选全部日报更多
内部员工登录
全部动态X · 968 条
全部一手资讯X论文
标签「推理」清除
AK@_akhaliq · 4月22日44

OneVL One-Step Latent Reasoning and Planning with Vision-Language Explanation paper: https://huggingface.co/papers/2604.18486

译OneVL 一步到位的潜在推理与规划,附带视觉-语言解释 论文: https://huggingface.co/papers/2604.18486

AK@_akhaliq · 4月22日39

MathNet a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval paper: https://huggingface.co/papers/2604.18584

译MathNet 一个用于数学推理与检索的全球多模态基准 论文: https://huggingface.co/papers/2604.18584

Rohan Paul@rohanpaul_ai · 4月21日

Columbia CS Prof Vishal Misra explains why LLMs can’t generate new science ideas. Bcz LLMs learn a structured map, Bayesian manifold of known data & work well within it, but fail outside it. True discovery requires creating new maps, which LLMs can't do

译Columbia CS 教授 Vishal Misra 解释为什么 LLMs 无法产生新的科学思想。 因为 LLMs 学习的是结构化地图,即已知数据的贝叶斯流形,在其中表现良好,但在其之外则失效。 真正的发现需要创建新地图,而这是 LLMs 无法做到的

François Chollet@fchollet · 4月21日

One of the most jarring things about current AI is its lack of introspection ability and metacognition. It doesn't know what it doesn't know, how it knows, or how it could find out. It's a one-way system.

译当前 AI 最令人震惊的一点是其缺乏内省能力和元认知。它不知道自己不知道什么,也不知道自己是如何知道的,或者如何能查明。这是一个单向系统。

Ethan Mollick@emollick · 4月20日

The second most important release of the LLM era (after GPT-3.5), featuring what was likely the most important chart. Still seems surprising to me that OpenAI told everyone about the biggest advance in AI technology since the LLM rather than keeping it to themselves until later.

译LLM 时代第二重要的发布(仅次于 GPT-3.5),包含了可能是史上最重要的一张图表。 OpenAI 将自 LLM 以来 AI 技术的最大进展公之于众,而非暂时保密,这仍然让我感到惊讶。

François Chollet@fchollet · 4月20日

Human biological limits, like our tiny working memory and shallow calculation depth, are actually a feature. They force us to abstract, compress, intuit. If we had infinite resources, we would never have needed intelligence.

译人类的生理局限,比如我们有限的工作记忆和浅层的计算深度,实际上是一种特性。它们迫使我们抽象、压缩、凭直觉思考。如果我们拥有无限的资源,就永远不需要智能。

Chubby♨️@kimmonismus · 4月20日

Still prefer Opus 4.6 over 4.7 Worst Anthropic release ever.

译相比 4.7 还是更喜欢 Opus 4.6 Anthropic 史上最差发布。

Rohan Paul@rohanpaul_ai · 4月19日

Big claim in this paper. "Prefill-as-a-Service" Prefill, the heaviest part of inference, may finally be portable. Long-context AI is no longer trapped inside a single datacenter. Shows how to run LLM prefill on remote clusters by sending much smaller saved prompt state. So long-prompt work can be done on remote machines and sending back only the smaller saved state needed to answer. The breakthrough is not sending everything farther, but sending the right requests farther. --- When you ask a model a long question, it first has to read and digest the whole prompt before it starts answering. That first step is called prefill, and it is brutally compute-heavy. The second step is decode, where the model generates tokens one by one, and that part is more about memory bandwidth than raw compute. But moving the saved prompt state between those phases is usually so data-heavy that both parts must stay in the same tightly connected cluster. So Until now, those two steps usually had to stay close together inside the same fast network, because prefill creates a huge blob of temporary memory called KVCache that had to be moved quickly to the decode machine. That is the bottleneck. What changed is model design. Newer hybrid-attention models produce much smaller KVCache than older dense-attention models, so shipping that state across ordinary datacenter links starts to become practical instead of absurd. The paper’s idea is a Prefill-as-a-Service setup that sends only long, uncached prompts to a remote prefill cluster, then ships back the saved prompt state, called KV cache, over normal Ethernet while short requests stay local. This works mainly because newer hybrid-attention models create far less KV cache than older dense models, and the system adds smart routing, bandwidth-aware scheduling, and cache-aware placement so the network does not clog up. The authors test this with an internal 1T-parameter hybrid model on a mixed setup that uses H200 GPUs for remote prefill and H20 GPUs for local decode. With a routing threshold near 19.4K tokens, about 50% of requests go remote, average cross-cluster traffic is only 13Gbps on a 100Gbps link, and throughput rises 54% over a local-only baseline and 32% over a naive heterogeneous setup. The real point is that smaller KV cache alone was not enough, but paired with selective offloading and scheduling it makes cross-datacenter LLM serving workable, more flexible, and easier to scale across different hardware. ---- Paper Link – arxiv. org/abs/2604.15039v1 Paper Title: "Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter"

译新一代混合注意力模型通过压缩KV Cache,使Prefill-as-a-Service架构成为可能。该方案将重计算的Prefill阶段卸载至远程集群,仅回传轻量KV Cache至本地解码,短请求则本地处理。配合智能路由与带宽感知调度,可在普通以太网高效传输。实测1T参数模型显示,50%请求远程处理时跨集群流量仅13Gbps,吞吐量提升54%,打破长上下文AI局限于单一数据中心的瓶颈。

Chubby♨️@kimmonismus · 4月19日

Some people say that GPT-5.5 is already rolling out for them, it’s being stealth tested. Initial testing from them say it outperforms Opus4.7 for them (don’t know in which tasks tho). Hopefully it’s not being released on Monday since I’ll be on a 13hour flight to china and would miss the release

译有人说 GPT-5.5 已经在向他们推出,正在进行秘密测试。他们的初步测试显示它比 Opus4.7 表现更好(但不知道是在哪些任务上)。 希望它不要在周一发布,因为我要坐 13 小时的飞机去中国,会错过发布。

TestingCatalog News 🗞@testingcatalog · 4月19日47

Grok 4.3 (beta) is now available to SuperGrok and X Premium+ users! Testing time 👀

译Grok 4.3(测试版)现已向 SuperGrok 和 X Premium+ 用户推出! 测试时间 👀

Rohan Paul@rohanpaul_ai · 4月19日

Anonymous usernames are no longer much protection when LLMs can piece together a person’s public trail. LLMs can identify supposedly anonymous people online by turning messy posts into personal clues. The best setup finds 68% of true matches at 90% precision, meaning 9 out of 10 guesses are right, while older methods stay near 0%. The problem is that pseudonyms often seemed safe only because linking a person across sites used to take lots of careful manual work. This paper cuts that work by making an LLM do 3 jobs: pull identity hints from raw text, search a huge pool of possible matches, and compare the best candidates to reject weak fits. The authors tested this on 3 cases: matching Hacker News users to LinkedIn profiles, matching Reddit movie users across communities, and matching the same Reddit users across different time periods. The main result is that the reasoning step beats simple matching by a wide margin and stays useful even as the candidate pool grows, which matters because it shows that public writing alone can now be enough to join accounts or name a person at scale. ---- Paper Link – arxiv. org/abs/2602.16800 Paper Title: "Large-scale online deanonymization with LLMs"

译LLM可通过分析公开写作实现大规模去匿名化。研究让模型执行提取身份线索、搜索匹配池、比较验证候选者三项任务,在Hacker News与LinkedIn、Reddit跨社区及跨时间段等场景测试中,达到90%精确度与68%召回率,远胜旧方法。关键突破在于推理步骤能处理大规模候选池,证明零散公开文本已足以关联账户并识别个人,传统匿名保护机制失效。

Nathan Lambert@natolambert · 4月19日

A big problem with this is that we don't really have a clear description of what mythos capabilities are. A model on each of the benchmarks in the launch blog post, sure. A model that you can swap right in for the same use-cases and notice no drop in perf? Doubt it.

译这里的一个大问题是,我们并没有清晰界定 mythos capabilities 到底是什么。 发布博客中的每个基准测试都有模型能达标,当然。 但要说有模型能直接替换到相同用例中且性能毫无下降?我对此表示怀疑。

Chubby♨️@kimmonismus · 4月18日

Opus 4.7 does seem to have improved, and its adaptive thinking now uses more tokens. However, compared to Opus 4.6, it still performs significantly worse.

译Opus 4.7 确实似乎有所改进,其自适应思考现在使用了更多 tokens。 然而,与 Opus 4.6 相比,它的表现仍然明显更差。

Rohan Paul@rohanpaul_ai · 4月18日

Interesting paper title😀 "What the F*ck Is Artificial General Intelligence?" It defines intelligence as adaptability under limits of compute, memory, and energy. So AGI is a system that adapts at least as generally as a human scientist That means it should be able to plan experiments, learn cause and effect, balance exploration and action, and operate with autonomy. The paper calls this type of AGI an artificial scientist, because it is judged by its ability to discover and adapt across many tasks, not just by passing human-like tests. So AGI is not just “human-level AI” but a whole system that can adapt broadly, efficiently, and scientifically, at least as well as a human scientist. ---- arxiv. org/abs/2503.23923

译一篇论文提出,智能的本质是在计算、内存和能源限制下的适应性。据此,AGI被定义为至少能像人类科学家一样普遍适应的系统,需具备规划实验、学习因果关系、平衡探索与行动及自主操作的能力。论文将这种AGI称为 artificial scientist,强调其评判标准在于跨任务发现与适应能力,而非通过类人测试。作者指出,AGI并非简单的"人类水平AI",而是能够广泛、高效且科学地进行适应的完整系统。

Epoch AI@EpochAIResearch · 4月18日

Have AI capabilities accelerated? On 3 out of the 4 AI capability metrics we investigated, we found strong evidence of acceleration, around when reasoning models emerged.

译AI 能力是否加速了? 在我们调查的 4 项 AI 能力指标中,有 3 项发现了强有力的加速证据,大约在推理模型出现时。

Chubby♨️@kimmonismus · 4月18日

A few more thoughts on Anthropic's adaptive thinking. Because it's quite revealing and offers some insight. First of all: nobody asked for this feature. And I don't mean that as a rant. Rather, one has to ask why Anthropic implemented it directly. And the answer is, of course, as simple as it is efficient. Profit margins aren't high enough in the consumer sector; Anthropic focuses on the enterprise and business sectors. At the same time, it's obviously bad PR when there are constant complaints that all rates are exhausted, while the competition, namely OpenAI, repeatedly increases and resets the rates. So Anthropic wanted to do what OpenAI implemented with GPT-5: dynamic compute allocation. While OpenAI routes between different models - Instant for simple tasks, Thinking for complex ones - Anthropic's adaptive thinking lets the same model decide how many reasoning tokens are needed for the request. The idea: an efficiency gain with (ideally) consistent quality. However, the consistent quality part is not holding up. Just as OpenAI's routing was initially considered a bug and needed to be revised, and there is now also the option to manually enable reasoning, I hope that Anthropic will follow suit. And overall, I believe the entire release must be read in this context. OpenAI's CRO repeatedly pointed out in the leaked memo that, unlike OpenAI, Anthropic has a significant shortage of compute and miscalculated its procurement needs. Regardless of whether the memo was deliberately leaked, I agree with this assessment. Anthropic is currently the big winner in the business and enterprise sectors, at the expense of the consumer sector. This balancing act became quite evident in the Opus 4.7 release.

译Anthropic推出adaptive thinking功能,允许Claude根据请求动态分配推理token。与OpenAI通过GPT-5在不同模型间路由不同,Anthropic选择让单一模型自行调节。此举背后是企业市场利润压力与严重算力短缺——OpenAI CRO在泄露备忘录中指出Anthropic误判了计算资源采购需求。该功能虽提升效率却导致质量不稳,显示Anthropic正优先服务企业客户而牺牲消费者体验,这一点在Opus 4.7发布中已显露无遗。

Ethan Mollick@emollick · 4月17日

I'll give Anthropic credit for moving quickly. Opus 4.7 Adaptive Thinking now triggers thinking much more often, including for the tasks it failed at yesterday. That also means it is doing a lot more web search. So far, a large improvement in output quality on non-coding tasks.

译我要称赞 Anthropic 行动迅速。Opus 4.7 Adaptive Thinking 现在更频繁地触发思考,包括昨天失败的任务。这也意味着它进行了更多网页搜索。 到目前为止,非编码任务的输出质量大幅提升。

Chubby♨️@kimmonismus · 4月17日

my whole fy page is people ranting about opus 4.7 anthropic messed up big time

译我的整个 fy 页面都是人们在吐槽 opus 4.7 anthropic 这次搞砸了

Chubby♨️@kimmonismus · 4月17日

Opus 4.7 consumes approximately 1.3 times as many tokens. The instructions must be very precise. Many are complaining about a "rushed release." In the Bullshit Benchmark, it performs worse than Opus 4.6. The mood is very mixed. Anthropic may have done OpenAI a big favor with this. Spud is expected next week. And if the release is done right, it could overshadow Opus and catapult ChatGPT back to the top. h/t @petergostev for the benchmark and image

译Opus 4.7 消耗的 token 数量约为原来的 1.3 倍。指令必须非常精确。许多人在抱怨这是一次"仓促发布"。在 Bullshit Benchmark 中,它的表现比 Opus 4.6 更差。反响非常两极分化。 Anthropic 这次可能帮了 OpenAI 一个大忙。Spud 预计下周发布。如果发布得当,它可能会盖过 Opus 的风头,让 ChatGPT 重回巅峰。 h/t @petergostev 提供基准测试和图片

Nathan Lambert@natolambert · 4月17日

Eventually adaptive thinking is going to work and people are going to forget about this. But yeah it sucks for now.

译最终自适应思考会起作用,人们会忘记这件事。但现在确实很糟。 [引用 @emollick]:我认为 Claude Opus 4.7 中的自适应思考要求很糟糕,就像所有 AI effort 路由器一样糟糕,但由于没有像 ChatGPT 那样的手动覆盖选项,问题被放大了。 它经常判定非数学/代码类内容是"低 effort",然后生成更差的结果。

Ethan Mollick@emollick · 4月17日

I was told by Anthropic that they are looking at ways of fixing this, which is good (you can also see a reply from a Claude PM in the thread).

译Anthropic 告诉我他们正在寻找修复这个问题的方法,这很好(你也可以在该线程中看到一位 Claude 产品经理的回复)。 我认为 Claude Opus 4.7 的自适应思考要求在所有 AI 工作量路由机制糟糕的方面都很糟糕,但由于没有像 ChatGPT 那样的手动覆盖选项,问题被放大了。 它经常将非数学/代码类内容判定为"低工作量"并产生更差的结果。

SemiAnalysis@SemiAnalysis_ · 4月17日51

NVIDIA vLLM NVL72 ADVANTAGE: GB200 NVL72 delivers up to 3x performance compared to B200 on @Kimi_Moonshot 's Kimi K2.5. This is enabled by GB200's scale-up network which allows for frontier inference optimizations like wide expert parallelism. Great work to @rogerw0108 @NVIDIAAIDev @vllm_project @inferact @simon_mo_ ! 🚀 Not only is SGLang optimized for disagg+wideEP but vLLM is optimized too!

译NVIDIA vLLM NVL72 优势:与 B200 相比,GB200 NVL72 在 @Kimi_Moonshot 的 Kimi K2.5 上性能提升高达 3 倍。这得益于 GB200 的纵向扩展网络,支持前沿推理优化,如宽专家并行。向 @rogerw0108 @NVIDIAAIDev @vllm_project @inferact @simon_mo_ 致敬,出色的工作!🚀 不仅 SGLang 针对分解+宽专家并行进行了优化,vLLM 也进行了优化!

Ethan Mollick@emollick · 4月17日

I think the adaptive thinking requirement in Claude Opus 4.7 is bad in the ways that all AI effort routers are bad, but magnified by the fact that there is no manual override like in ChatGPT. It regularly decides that non-math/code stuff is "low effort" & produces worse results.

译我认为 Claude Opus 4.7 中的自适应思考需求具有所有 AI 努力度路由器的糟糕之处,但由于没有像 ChatGPT 那样的手动覆盖选项,问题被放大了。 它经常将非数学/代码类内容判定为"低努力度",并产生更差的结果。

Chubby♨️@kimmonismus · 4月17日

Anthropic increased rate limits for all subscribers? Permanent! That was not on my bingo card!

译Anthropic 提高了所有订阅者的速率限制? 永久性的! 这我可没料到! [引用 @bcherny]:Opus 4.7 使用了更多 thinking tokens,所以我们提高了所有订阅者的速率限制作为补偿。Enjoy!

Boris Cherny@bcherny · 4月17日

Opus 4.7 uses more thinking tokens, so we've increased rate limits for all subscribers to make up for it. Enjoy!

译Opus 4.7 使用了更多 thinking tokens,因此我们提高了所有订阅者的 rate limits 作为补偿。Enjoy!

宝玉@dotey · 4月17日

Claude Opus 4.7 比前代消耗更多思考 token,为此 Anthropic 已经给所有付费订阅用户永久上调了速率限制(rate limits),以抵消新模型更费额度带来的影响。 没看到额度上调的用户需要确认自己用的是 Opus 4.7,并且 Claude Code 已经升级到最新版本。

译Claude Opus 4.7 较上一代模型消耗更多思考 token,Anthropic 已为所有付费订阅用户永久上调速率限制(rate limits),以抵消新模型带来的额外额度消耗。用户若未看到额度上调,需确认当前选用的是 Opus 4.7 模型,且 Claude Code 已升级至最新版本。

Yuchen Jin@Yuchenj_UW · 4月16日

My biggest issue with Opus 4.7 on Claude web: Only “Adaptive” or non-thinking. No way to force thinking mode. And it doesn’t even know Opus 4.6 exists, and I cannot force it to think and do web search mid conversation!

译我在 Claude 网页版上使用 Opus 4.7 的最大问题: 只有"Adaptive"或非思考模式。 无法强制开启思考模式。 而且它甚至不知道 Opus 4.6 的存在,而且我无法在对话中途强制它进行思考和网络搜索!

TestingCatalog News 🗞@testingcatalog · 4月16日45

Opus 4.7 on Claude for mobile uses “Adaptive thinking” instead of “Extended thinking” as before. > Switch to Opus 4.7 for your most ambitious work > Thinks only when needed Should we turn that off? 👀

译移动端的Claude中,Opus 4.7版本使用了“自适应思考”模式,而非之前的“扩展思考”。 > 切换至Opus 4.7来处理你最雄心勃勃的工作 > 仅在需要时思考 我们该关闭这个功能吗?👀

Deedy@deedydas · 4月16日

Opus 4.7 benchmarks colored by ranking. – Strong coding (SWE-Bench) bump – Strong Computer use bump – Strong visual reasoning (CharXiv) bump – Weak Terminal Bench bump – BrowseComp regression Slots in between 4.6 and Mythos. [Chart generated by 4.7]

译Opus 4.7 基准测试按排名着色。 – 编程(SWE-Bench)大幅提升 – 计算机使用大幅提升 – 视觉推理(CharXiv)大幅提升 – Terminal Bench 小幅提升 – BrowseComp 退步 介于 4.6 和 Mythos 之间。 [图表由 4.7 生成]

Nathan Lambert@natolambert · 4月16日

The current pace of token-efficient reasoning improvements across minor Claude Opus/GPT model versions is pretty wild. All signs point to this continuing. 4.6 to 4.7 could've been presented as a fairly large model bump in the past with this plot.

译Claude Opus/GPT 模型小版本间 token 效率推理改进的当前速度相当惊人。所有迹象都表明这将继续。 4.6 到 4.7 在过去本可被视为一次相当大的模型升级。

Rohan Paul@rohanpaul_ai · 4月16日

Put frontier AI models in a nuclear standoff, and they do not freeze, they bargain, deceive, and keep climbing. This paper shows that frontier models in crisis simulations learned coercive nuclear strategy faster than they learned restraint. Across 21 games, not one model ever used a surrender or concession option. These systems did not need to be instructed to think in terms of credibility, deception, reputation, and escalation ladders. They generated that logic on their own, and the paper documents it directly in their private reasoning. The models were not simply aggressive. They were strategically asymmetric. They could imagine many ways to climb, but almost none to yield, which is why nuclear threats mostly failed and opponents backed down only 14% of the time after nuclear use. GPT-5.2 is the clearest warning about how misleading a single safety snapshot can be. In open-ended games it looked restrained and won 0%. Under deadline pressure it flipped to a 75% win rate and climbed from a median escalation of 175 to 900. Claude was different. It behaved less like a malfunctioning model than like a cold bargainer, staying reliable at low stakes, then exceeding its own signals at high stakes while repeatedly stopping at strategic nuclear threat rather than full strategic war. Gemini was the purest form of the danger. It was the only model to deliberately choose full strategic nuclear war, and it did so by Turn 4. The real risk is not that models are secretly bloodthirsty. It is that under competition, uncertainty, and time pressure, they can become better at brinkmanship than at backing down. ---- Paper Link – arxiv. org/abs/2602.14740 Paper Title: "AI Arms and Influence: Frontier Models Exhibit Sophisticated Reasoning in Simulated Nuclear Crises"

译前沿AI模型在核危机模拟中展现出危险的战略不对称性。研究显示,GPT-5.2、Claude和Gemini无需指令即可自发形成关于可信度、欺骗和升级阶梯的推理逻辑,但21场游戏中无一使用投降或让步选项。Gemini最激进,在第4回合即选择全面战略核战争;GPT-5.2在时间压力下胜率从0%升至75%,升级程度剧增;Claude则像冷酷谈判者,在高压下超出自身信号。核心风险在于,模型在竞争和时间压力下更擅长边缘政策而非退让。

AK@_akhaliq · 4月16日39

KnowRL Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance paper: https://huggingface.co/papers/2604.12627

译KnowRL 通过强化学习与最小充分知识指导来提升大语言模型的推理能力 论文: https://huggingface.co/papers/2604.12627

François Chollet@fchollet · 4月15日

Any smart human giving it real effort should score >90% on ARC-AGI-3

译任何认真努力的聪明人都应该在 ARC-AGI-3 上得分 >90%

Epoch AI@EpochAIResearch · 4月15日

OpenAI has purchased access to the FrontierMath: Open Problems verifiers. This allows them to check the validity of solutions their models generate. Thread with details.

译OpenAI 已购买 FrontierMath: Open Problems 验证器的访问权限。这使他们能够检查其模型生成的解的有效性。详情见推文串。

Ethan Mollick@emollick · 4月15日

Given the messy naming scheme used by all the AI companies, I caused a chart to be made showing the gain in GPQA per 0.1 version in model names (estimated, since model names skip version numbers). There has never been a more misnamed model that Claude 3.7, should have been 4.4.

译鉴于所有 AI 公司混乱的命名方案,我让人制作了一张图表,展示模型名称中每 0.1 版本在 GPQA 上的提升(估算值,因为模型名称会跳过版本号)。 从未有过比 Claude 3.7 命名更不当的模型,它本应该是 4.4。

Chubby♨️@kimmonismus · 4月15日

I was always torn between GPT-5.4 and Opus 4.6. But over time, I've come to the conclusion that Claude has a better "taste." Anyway, I'm super hyped for this week! Opus 4.7 and (fingers crossed) Spud

译我之前一直在 GPT-5.4 和 Opus 4.6 之间纠结。但随着时间推移,我得出结论:Claude 有更好的"taste"。不管怎样,我对这周超级期待! Opus 4.7 和(祈祷)Spud

宝玉@dotey · 4月15日

好文章,摘录几段: > 文科内部长期存在一批“伪能力”——那些在没有AI的时代看起来有价值,本质上只是对已有知识进行低阶重组与表达的能力。 > 一篇关于葛兰西(Antonio Gramsci)霸权理论的文献综述,一篇分析《论语》叙事结构的学期论文,一篇套用后殖民框架解读某部当代小说的期末作业——这些产出之所以在过去能够通过评价体系,部分原因在于生产它们本身需要时间、耐心和一定的阅读积累。门槛虽低,但终究存在。 > 它动摇的不只是某类技能的市场价值,而是文科训练赖以维系自身正当性的一套隐含伦理——一种可以被称为“知识苦修主义”(epistemic asceticism)的价值信条。 > 这套信条的核心逻辑是:努力即价值。它假定,凡是需要大量时间投入才能掌握的东西,必然具有相应的认知深度;凡是习得过程足够艰难的能力,必然具有相应的判断价值。这个逻辑在前AI时代有其合理性——当困难的任务确实只有经过长期训练的人才能完成时,困难性与价值性之间存在相当的相关性,尽管二者从未真正等同。 > AI第一次将“困难”与“价值”彻底剥离。它揭示的是:时间投入 ≠ 认知深度 ≠ 判断力。一件事情之所以曾经困难,可能只是因为信息获取的门槛高,可能只是因为语言处理的速度慢,可能只是因为跨文本综合需要记忆力——而所有这些,都是AI的强项,而非人类认知的核心所在。当AI轻松完成那些曾经需要数年训练才能完成的任务,它实际上是在做一次历史性的证伪:那些任务的困难性,从来就不是判断力的证明,只是信息处理门槛的产物。 https://mp.weixin.qq.com/s?__biz=MjM5NTUxOTc4Mw==&mid=2650648996&idx=1&sn=e24c9e625415701f1e4f30fc5c16ceef&chksm=befe5dec8989d4fad9c41dafef2f181b438580c8b4d5b5919647355d9ac5d7d7bf954dedf291#rd

译AI揭示了文科长期存在的"伪能力"——仅对已有知识进行低阶重组的能力,彻底颠覆"知识苦修主义"伦理基础。它证明时间投入不等于认知深度,困难不等于价值:当AI轻松完成曾需数年训练的文献综述与文本分析,"努力即价值"的传统逻辑被证伪。作者提出AI时代文科核心使命转向:在不确定中作出判断,在系统之间进行翻译,在现实中承担后果,将价值思考置于真实利害关系之中。

Chubby♨️@kimmonismus · 4月15日

The question that's currently on my mind is this: Chinese models are about six months behind those of US Frontier Labs. Does this also apply to "Mythos"? Is it foreseeable that, for example, Qwen will release a similarly significant model as Claude "Mythos" in six months, or are there constraints like compute that prevent such a huge leap? So far, I haven't found an answer.

译目前我心中的问题是:中国模型大约比美国 Frontier Labs 落后六个月。 这是否也适用于"Mythos"?是否可以预见,例如,Qwen 将在六个月内发布一个与 Claude "Mythos" 同样重要的模型,还是存在算力之类的限制因素会阻止如此巨大的飞跃?到目前为止,我还没有找到答案。

AK@_akhaliq · 4月15日39

Attention Sink in Transformers A Survey on Utilization, Interpretation, and Mitigation paper: https://huggingface.co/papers/2604.10098

译Transformers中的注意力下沉 关于其利用、解释与缓解方法的研究综述 论文: https://huggingface.co/papers/2604.10098

Chubby♨️@kimmonismus · 4月14日

Complaints about Anthropic’s $200 Max plan are escalating as independent tests (e.g. Bridgebench) claim Claude Opus 4.6 dropped sharply in hallucination performance. Maybe the quant it after release and people adopted it in their workflows? Anyways, cudos to Grok for staying forst place.

译关于 Anthropic 200 美元 Max 计划的投诉正在升级,因为独立测试(例如 Bridgebench)声称 Claude Opus 4.6 在幻觉性能方面急剧下降。 可能是发布后进行了量化,人们将其应用到了他们的工作流程中?无论如何,祝贺 Grok 保持第一。

全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
4月22日
01:44
AK@_akhaliq
44
OneVL 一步到位的潜在推理与规划,附带视觉-语言解释 论文: https://huggingface.co/papers/2604.18486
多模态推理论文/研究
00:14
AK@_akhaliq
39
MathNet 一个用于数学推理与检索的全球多模态基准 论文: https://huggingface.co/papers/2604.18584
推理论文/研究评测/基准
4月21日
23:44
Rohan Paul@rohanpaul_ai
Columbia CS 教授 Vishal Misra 解释为什么 LLMs 无法产生新的科学思想。 因为 LLMs 学习的是结构化地图,即已知数据的贝叶斯流形,在其中表现良好,但在其之外则失效。 真正的发现需要创建新地图,而这是 LLMs 无法做到的
大佬观点推理
22:49
François Chollet@fchollet
当前 AI 最令人震惊的一点是其缺乏内省能力和元认知。它不知道自己不知道什么,也不知道自己是如何知道的,或者如何能查明。这是一个单向系统。
Google大佬观点推理
4月20日
11:05
Ethan Mollick@emollick
LLM 时代第二重要的发布(仅次于 GPT-3.5),包含了可能是史上最重要的一张图表。 OpenAI 将自 LLM 以来 AI 技术的最大进展公之于众,而非暂时保密,这仍然让我感到惊讶。

Adam.GPT: https://openai.com/index/introducing-openai-o1-preview/ I think that big bet on reasoning and test-time compute is going...

OpenAI大佬观点推理
08:38
François Chollet@fchollet
人类的生理局限,比如我们有限的工作记忆和浅层的计算深度,实际上是一种特性。它们迫使我们抽象、压缩、凭直觉思考。如果我们拥有无限的资源,就永远不需要智能。
DeepMind大佬观点推理
05:44
Chubby♨️@kimmonismus
相比 4.7 还是更喜欢 Opus 4.6 Anthropic 史上最差发布。
Anthropic大佬观点推理
4月19日
17:44
Rohan Paul@rohanpaul_ai
Prefill-as-a-Service:下一代模型KV Cache可跨数据中心

新一代混合注意力模型通过压缩KV Cache,使Prefill-as-a-Service架构成为可能。该方案将重计算的Prefill阶段卸载至远程集群,仅回传轻量KV Cache至本地解码,短请求则本地处理。配合智能路由与带宽感知调度,可在普通以太网高效传输。实测1T参数模型显示,50%请求远程处理时跨集群流量仅13Gbps,吞吐量提升54%,打破长上下文AI局限于单一数据中心的瓶颈。

arXiv推理论文/研究部署/工程
17:44
Chubby♨️@kimmonismus
有人说 GPT-5.5 已经在向他们推出,正在进行秘密测试。他们的初步测试显示它比 Opus4.7 表现更好(但不知道是在哪些任务上)。 希望它不要在周一发布,因为我要坐 13 小时的飞机去中国,会错过发布。
OpenAI推理模型发布
15:48
TestingCatalog News 🗞@testingcatalog
47
Grok 4.3(测试版)现已向 SuperGrok 和 X Premium+ 用户推出! 测试时间 👀
xAI推理模型发布
15:44
Rohan Paul@rohanpaul_ai
LLM破解网络匿名:公开文本可精准关联真实身份

LLM可通过分析公开写作实现大规模去匿名化。研究让模型执行提取身份线索、搜索匹配池、比较验证候选者三项任务,在Hacker News与LinkedIn、Reddit跨社区及跨时间段等场景测试中,达到90%精确度与68%召回率,远胜旧方法。关键突破在于推理步骤能处理大规模候选池,证明零散公开文本已足以关联账户并识别个人,传统匿名保护机制失效。

arXiv安全/对齐推理论文/研究
03:04
Nathan Lambert@natolambert
这里的一个大问题是,我们并没有清晰界定 mythos capabilities 到底是什么。 发布博客中的每个基准测试都有模型能达标,当然。 但要说有模型能直接替换到相同用例中且性能毫无下降?我对此表示怀疑。

rohit: Dario seems to think China and open source will hit Mythos capabilities in 6-12 months

Anthropic大佬观点开源生态推理
4月18日
19:44
Chubby♨️@kimmonismus
Opus 4.7 确实似乎有所改进,其自适应思考现在使用了更多 tokens。 然而,与 Opus 4.6 相比,它的表现仍然明显更差。
Anthropic推理评测/基准
05:44
Rohan Paul@rohanpaul_ai
AGI新定义:不仅是人类水平AI,更是人工科学家

一篇论文提出,智能的本质是在计算、内存和能源限制下的适应性。据此,AGI被定义为至少能像人类科学家一样普遍适应的系统,需具备规划实验、学习因果关系、平衡探索与行动及自主操作的能力。论文将这种AGI称为 artificial scientist,强调其评判标准在于跨任务发现与适应能力,而非通过类人测试。作者指出,AGI并非简单的"人类水平AI",而是能够广泛、高效且科学地进行适应的完整系统。

arXiv推理论文/研究
03:44
Epoch AI@EpochAIResearch
AI 能力是否加速了? 在我们调查的 4 项 AI 能力指标中,有 3 项发现了强有力的加速证据,大约在推理模型出现时。
推理数据/训练论文/研究
01:44
Chubby♨️@kimmonismus
Anthropic自适应思考功能背后的战略考量

Anthropic推出adaptive thinking功能,允许Claude根据请求动态分配推理token。与OpenAI通过GPT-5在不同模型间路由不同,Anthropic选择让单一模型自行调节。此举背后是企业市场利润压力与严重算力短缺——OpenAI CRO在泄露备忘录中指出Anthropic误判了计算资源采购需求。该功能虽提升效率却导致质量不稳,显示Anthropic正优先服务企业客户而牺牲消费者体验,这一点在Opus 4.7发布中已显露无遗。

智能体Anthropic大佬观点推理
4月17日
22:50
Ethan Mollick@emollick
我要称赞 Anthropic 行动迅速。Opus 4.7 Adaptive Thinking 现在更频繁地触发思考,包括昨天失败的任务。这也意味着它进行了更多网页搜索。 到目前为止,非编码任务的输出质量大幅提升。
Anthropic大佬观点推理
21:44
Chubby♨️@kimmonismus
我的整个 fy 页面都是人们在吐槽 opus 4.7 anthropic 这次搞砸了
Anthropic推理现象/趋势
17:44
Chubby♨️@kimmonismus
Opus 4.7 消耗的 token 数量约为原来的 1.3 倍。指令必须非常精确。许多人在抱怨这是一次"仓促发布"。在 Bullshit Benchmark 中,它的表现比 Opus 4.6 更差。反响非常两极分化。 Anthropic 这次可能帮了 OpenAI 一个大忙。Spud 预计下周发布。如果发布得当,它可能会盖过 Opus 的风头,让 ChatGPT 重回巅峰。 h/t @petergostev 提供基准测试和图片

Chubby♨️: The mood regarding the Opus 4.7 update has shifted. If I had to guess, I'd say 60% are disappointed with the latest upda...

AnthropicOpenAI推理评测/基准
11:50
Nathan Lambert@natolambert
最终自适应思考会起作用,人们会忘记这件事。但现在确实很糟。 【引用 @emollick】:我认为 Claude Opus 4.7 中的自适应思考要求很糟糕,就像所有 AI effort 路由器一样糟糕,但由于没有像 ChatGPT 那样的手动覆盖选项,问题被放大了。 它经常判定非数学/代码类内容是"低 effort",然后生成更差的结果。

Ethan Mollick: I think the adaptive thinking requirement in Claude Opus 4.7 is bad in the ways that all AI effort routers are bad, but ...

Anthropic大佬观点推理
10:50
Ethan Mollick@emollick
Anthropic 告诉我他们正在寻找修复这个问题的方法,这很好(你也可以在该线程中看到一位 Claude 产品经理的回复)。 我认为 Claude Opus 4.7 的自适应思考要求在所有 AI 工作量路由机制糟糕的方面都很糟糕,但由于没有像 ChatGPT 那样的手动覆盖选项,问题被放大了。 它经常将非数学/代码类内容判定为"低工作量"并产生更差的结果。

Ethan Mollick: I think the adaptive thinking requirement in Claude Opus 4.7 is bad in the ways that all AI effort routers are bad, but ...

Anthropic大佬观点推理
07:28
SemiAnalysis@SemiAnalysis_
51
NVIDIA vLLM NVL72 优势:与 B200 相比,GB200 NVL72 在 @Kimi_Moonshot 的 Kimi K2.5 上性能提升高达 3 倍。这得益于 GB200 的纵向扩展网络,支持前沿推理优化,如宽专家并行。向 @rogerw0108 @NVIDIAAIDev @vllm_project @inferact @simon_mo_ 致敬,出色的工作!🚀 不仅 SGLang 针对分解+宽专家并行进行了优化,vLLM 也进行了优化!
产品更新推理部署/工程
03:50
Ethan Mollick@emollick
我认为 Claude Opus 4.7 中的自适应思考需求具有所有 AI 努力度路由器的糟糕之处,但由于没有像 ChatGPT 那样的手动覆盖选项,问题被放大了。 它经常将非数学/代码类内容判定为"低努力度",并产生更差的结果。
Anthropic大佬观点推理
03:44
Chubby♨️@kimmonismus
Anthropic 提高了所有订阅者的速率限制? 永久性的! 这我可没料到! 【引用 @bcherny】:Opus 4.7 使用了更多 thinking tokens,所以我们提高了所有订阅者的速率限制作为补偿。Enjoy!

Boris Cherny: Opus 4.7 uses more thinking tokens, so we've increased rate limits for all subscribers to make up for it. Enjoy!

Anthropic产品更新推理
03:41
Boris Cherny@bcherny
Opus 4.7 使用了更多 thinking tokens,因此我们提高了所有订阅者的 rate limits 作为补偿。Enjoy!
Anthropic产品更新推理
03:26
宝玉@dotey
Claude Opus 4.7更耗token,Anthropic上调用户速率限制

Claude Opus 4.7 较上一代模型消耗更多思考 token,Anthropic 已为所有付费订阅用户永久上调速率限制(rate limits),以抵消新模型带来的额外额度消耗。用户若未看到额度上调,需确认当前选用的是 Opus 4.7 模型,且 Claude Code 已升级至最新版本。

Boris Cherny: Opus 4.7 uses more thinking tokens, so we've increased rate limits for all subscribers to make up for it. Enjoy!

Anthropic产品更新推理
4月16日
23:47
Yuchen Jin@Yuchenj_UW
我在 Claude 网页版上使用 Opus 4.7 的最大问题: 只有"Adaptive"或非思考模式。 无法强制开启思考模式。 而且它甚至不知道 Opus 4.6 的存在,而且我无法在对话中途强制它进行思考和网络搜索!
Anthropic产品更新推理
23:47
TestingCatalog News 🗞@testingcatalog
45
移动端的Claude中,Opus 4.7版本使用了"自适应思考"模式,而非之前的"扩展思考"。 > 切换至Opus 4.7来处理你最雄心勃勃的工作 > 仅在需要时思考 我们该关闭这个功能吗?👀

Seth Saler: @testingcatalog Interesting. "Adaptive" thinking for Opus 4.7 versus "Extended" thinking for Sonnet 4.6

Anthropic产品更新推理
23:44
Deedy@deedydas
Opus 4.7 基准测试按排名着色。 - 编程(SWE-Bench)大幅提升 - 计算机使用大幅提升 - 视觉推理(CharXiv)大幅提升 - Terminal Bench 小幅提升 - BrowseComp 退步 介于 4.6 和 Mythos 之间。 【图表由 4.7 生成】
智能体Anthropic推理编码
22:48
Nathan Lambert@natolambert
Claude Opus/GPT 模型小版本间 token 效率推理改进的当前速度相当惊人。所有迹象都表明这将继续。 4.6 到 4.7 在过去本可被视为一次相当大的模型升级。
AnthropicOpenAI推理现象/趋势
09:43
Rohan Paul@rohanpaul_ai
前沿AI核危机模拟研究:模型倾向边缘政策而非退让

前沿AI模型在核危机模拟中展现出危险的战略不对称性。研究显示,GPT-5.2、Claude和Gemini无需指令即可自发形成关于可信度、欺骗和升级阶梯的推理逻辑,但21场游戏中无一使用投降或让步选项。Gemini最激进,在第4回合即选择全面战略核战争;GPT-5.2在时间压力下胜率从0%升至75%,升级程度剧增;Claude则像冷酷谈判者,在高压下超出自身信号。核心风险在于,模型在竞争和时间压力下更擅长边缘政策而非退让。

智能体AnthropicOpenAI推理
00:07
AK@_akhaliq
39
KnowRL 通过强化学习与最小充分知识指导来提升大语言模型的推理能力 论文: https://huggingface.co/papers/2604.12627
推理数据/训练论文/研究
4月15日
17:46
François Chollet@fchollet
任何认真努力的聪明人都应该在 ARC-AGI-3 上得分 >90%
推理评测/基准
10:05
Epoch AI@EpochAIResearch
OpenAI 已购买 FrontierMath: Open Problems 验证器的访问权限。这使他们能够检查其模型生成的解的有效性。详情见推文串。
OpenAI推理数据/训练评测/基准
07:55
Ethan Mollick@emollick
鉴于所有 AI 公司混乱的命名方案,我让人制作了一张图表,展示模型名称中每 0.1 版本在 GPQA 上的提升(估算值,因为模型名称会跳过版本号)。 从未有过比 Claude 3.7 命名更不当的模型,它本应该是 4.4。
Anthropic大佬观点推理
06:05
Chubby♨️@kimmonismus
我之前一直在 GPT-5.4 和 Opus 4.6 之间纠结。但随着时间推移,我得出结论:Claude 有更好的"taste"。不管怎样,我对这周超级期待! Opus 4.7 和(祈祷)Spud
AnthropicOpenAI大佬观点推理
01:36
宝玉@dotey
AI时代文科的范式转型

AI揭示了文科长期存在的"伪能力"——仅对已有知识进行低阶重组的能力,彻底颠覆"知识苦修主义"伦理基础。它证明时间投入不等于认知深度,困难不等于价值:当AI轻松完成曾需数年训练的文献综述与文本分析,"努力即价值"的传统逻辑被证伪。作者提出AI时代文科核心使命转向:在不确定中作出判断,在系统之间进行翻译,在现实中承担后果,将价值思考置于真实利害关系之中。

西乔 XiQiao: 徐贲这篇AI时代的文科的反思相当不错,集中好几个层面的问题。 AI时代的文科的核心使命,可以被概括为三点:在不确定中作出判断,在系统之间进行翻译,在现实中承担后果。 文科传统中那些最珍贵的东西--对人类处境的细腻理解,对价值冲突的诚实面对,...

推理现象/趋势
00:05
Chubby♨️@kimmonismus
目前我心中的问题是:中国模型大约比美国 Frontier Labs 落后六个月。 这是否也适用于"Mythos"?是否可以预见,例如,Qwen 将在六个月内发布一个与 Claude "Mythos" 同样重要的模型,还是存在算力之类的限制因素会阻止如此巨大的飞跃?到目前为止,我还没有找到答案。
Anthropic大佬观点推理
00:03
AK@_akhaliq
39
Transformers中的注意力下沉 关于其利用、解释与缓解方法的研究综述 论文: https://huggingface.co/papers/2604.10098
推理论文/研究部署/工程
4月14日
17:26
Chubby♨️@kimmonismus
关于 Anthropic 200 美元 Max 计划的投诉正在升级,因为独立测试(例如 Bridgebench)声称 Claude Opus 4.6 在幻觉性能方面急剧下降。 可能是发布后进行了量化,人们将其应用到了他们的工作流程中?无论如何,祝贺 Grok 保持第一。
Anthropic推理评测/基准
‹ 上一页
1…202122232425
下一页 ›