Grok 4.20 Reasoning just took the #1 spot on the BridgeBench reasoning benchmark. 🔥 Beating GPT-5.4, Claude Opus 4.6, Google Gemini and others. Week after week, Grok keeps climbing across benchmarks. 🚀

译Grok 4.20 Reasoning 刚刚在 BridgeBench 推理基准测试中夺得第一。🔥 击败 GPT-5.4、Claude Opus 4.6、Google Gemini 等模型。周复一周，Grok 在各个基准测试中持续攀升。🚀

AK@_akhaliq · 4月14日48

Process Reward Agents for Steering Knowledge-Intensive Reasoning paper: https://huggingface.co/papers/2604.09482

译用于引导知识密集型推理的过程奖励智能体 paper: https://huggingface.co/papers/2604.09482

Rohan Paul@rohanpaul_ai · 4月13日

This Baidu paper found a way to use the clean, reliable rewards of RL on tasks like writing and subjective answers, where there is usually no single “correct” output. Instead of asking “is this response correct?”, they ask “which of these two responses is better?”, and that simple reformulation appears to improve open-ended reasoning better than standard reward-model training on their benchmarks. i.e. it turns open-ended writing into verifiable choices, and RL starts working there too. Across seven open-ended benchmarks, the method beats a matched RLHF baseline by an average 3.29 points on a 14B reasoning model. The clever part is not a better reward model. It is a change in what the model is asked to do during training. Instead of grading a poem or subjective answer directly, the system sees two candidate responses, one preferred and one rejected, and learns to identify which is better. Multiple choice creates a clean binary signal, so the model can be trained with the same kind of verifiable reward that made RL powerful in math and code, without pretending open-ended tasks have one canonical answer. The gain is probably not just better taste imitation. The paper’s DPO ablation underperforms badly, which suggests the benefit comes from learning a contrastive verification habit, not merely absorbing preference pairs. The authors also catch an important failure mode: train only on these choice tasks and responses get unnaturally short. So they mix in a small RLHF objective to keep output length from collapsing, and the resulting model appears more useful rather than merely more terse. The strongest claim here is not that open-ended evaluation is solved. It is that reasoning can be improved when you replace fuzzy scoring with structured comparison, which may be a more general lesson for alignment than this paper admits. ---- Paper Link – arxiv. org/abs/2511.02463 Paper Title: "Extending RLVR to Open-Ended Tasks via Verifiable Multiple-Choice Reformulation"

译百度论文提出将开放式任务（如写作、主观回答）重构为可验证的多项选择形式，通过"两两比较"替代直接评分，为RL提供清晰奖励信号。在7个基准测试中，14B模型平均比RLHF基线高3.29分。关键创新在于训练任务形式的改变——模型通过对比验证学习识别优劣，而非单纯吸收偏好对。研究同时发现需混合RLHF目标以防止输出长度坍缩。该方法表明，用结构化比较替代模糊评分可能是提升推理能力的普遍对齐策略。

Ethan Mollick@emollick · 4月13日

Currently, ChatGPT has the best way of viewing thinking traces, a short summary of steps in the main window, and a detailed audit in the sidebar if you want it Claude does almost as well, but more summarized and harder to see calculations and code Its a big weak spot for Gemini

译ChatGPT 的思维链展示体验当前最优，主窗口呈现步骤摘要，侧边栏可查看详细审计。Claude 表现接近但总结过度，计算与代码细节难以查看。Gemini 在此功能上存在明显短板。

Yuchen Jin@Yuchenj_UW · 4月13日

Seeing rumors that Claude Opus 4.6 got nerfed. Usually this boils down to 3 cases: - Unintentional. For example, a regression caused by changes in the inference stack or Claude Code. This is what evals are for before rolling out. - Intentional “optimizations” (quantization, reduced reasoning). If so, say it. If users pay for a model, they should get that model. - User psychology. The more you use a model, the dumber it feels.

译Claude Opus 4.6 被削弱传闻通常可归为三类：推理栈或 Claude Code 变更导致的无意回归；量化、减少推理等有意"优化"（若属实应明确告知付费用户）；以及"用得越多感觉越笨"的用户心理效应。

Rohan Paul@rohanpaul_ai · 4月12日

Reasoning tokens in LLMs are not equal. Models seem to know which parts of their own reasoning matter most. What survives pruning is usually the part doing actual computational work, not the fluent narration wrapped around it. The method is clever in a plain way. Start with a full chain of thought, delete one token at a time, and keep deleting whichever removal hurts the model’s likelihood least. The resulting order becomes a functional ranking, not of what sounds important to us, but of what the model itself seems to need. Here’s the interesting part. If a model’s reasoning were just verbose decoration, pruning should look mostly random once you preserve the answer. Instead, the paper finds structure. Symbolic math tokens survive pruning far more than grammar, narration, and referential bookkeeping, which means the model is not treating all tokens as equally useful. That matters because the test is behavioral, not rhetorical. Students trained on these greedily pruned chains do better than students trained on several other pruning baselines, including a method supervised by a frontier model, at the same reasoning length. So the pruning signal is not merely interpretable. It is useful. The deeper point is that importance is dynamic. A token that looks expendable early can become important later as surrounding context disappears, which argues against the comforting idea that reasoning has a fixed salience map you can read off once and reuse forever. And yet the signal is not inaccessible. The paper shows attention patterns alone can predict pruning scores surprisingly well, suggesting that functional importance is partly visible in the model’s internals before you do the expensive deletion game. So this is less about making chain-of-thought shorter than about making it legible. The claim is not that pruned tokens are causally irrelevant in any philosophical sense. The cleaner claim is better: LLMs appear to encode a workable internal ranking of which reasoning tokens are carrying the load.

译研究通过贪婪剪枝方法（逐个删除对模型似然度影响最小的token）评估LLM推理token的功能重要性。发现符号数学token比语法叙述更能经受剪枝，表明模型内部存在重要性排序。重要性具有动态性，早期可丢弃的token可能在上下文减少后变得关键。注意力模式可预测剪枝分数，说明功能重要性在模型内部可见。该发现有助于使chain-of-thought更可解释，而非仅仅缩短长度。

Rohan Paul@rohanpaul_ai · 4月12日

Mark Zuckerberg: Most businesses will not own frontier AI in the way Meta or OpenAI does. But many will end up with something that feels like their own AI: a customized operational layer that reflects how that company actually works. He says, "OpenAI, Google, they're building an AI. But I think we're gonna have a lot of different AI systems, just like we're gonna have, we have a lot of different apps. I think in the future, every business, just like I have a website and a phone number and an email address, a social media account, is also going to have an AI that can interact with their customers to help them sell things, help them give support." --- What he is really describing that a company’s “own AI” will usually not be a frontier model trained from scratch, but a layer built on top of shared models, shaped by its products, policies, customer history, and way of working. Support, sales, and basic operations can be handled through a system that knows the business well enough to answer, route, recommend, and escalate without feeling generic. --- From 'Cleo Abram' YT channel (link in comment)

译Mark Zuckerberg指出，未来企业不会拥有前沿AI基础模型，而是基于共享模型构建定制化运营层，反映其业务流程与客户历史，用于客户互动和支持。与此同时，Meta发布原生多模态推理模型Muse Spark，采用多智能体编排架构，多个副本可并行推理并比较结果，用比Llama 4 Maverick少10倍以上的训练计算达到类似能力，标志着AI性能提升从单一模型扩展转向运行时智能分配计算资源。

Rohan Paul@rohanpaul_ai · 4月11日

People using AI for Premier League bets are losing badly. A new betting benchmark suggests today’s best AI models still unravel when prediction has to survive a whole season. In KellyBench, every tested model lost money, and some went completely bust. KellyBench forced agents through a changing 100-150 matchday season where they had to predict outcomes, size bets, and protect a £100,000 bankroll. That setup tests something normal benchmarks miss: whether an LLM can stay coherent, adapt to new data, and manage risk over time. Claude Opus 4.6 was best at -11% ROI, GPT-5.4 came next at -13.6%, and several models hit -100%.

译KellyBench基准测试检验了主流LLM在英超赛季投注中的长期预测与风险管理能力。所有参测模型均遭遇亏损，部分资金归零。Claude Opus 4.6以-11% ROI表现最佳，GPT-5.4为-13.6%。该测试通过100-150场动态赛季模拟，暴露出现有AI在持续决策中的连贯性、数据适应性与风险控制方面存在显著缺陷。

Noam Brown@polynoamial · 4月11日

What we really need is a benchmark where AI models make AI models that play poker.

译GTOWizard 测试显示，GPT-5.4、Claude Opus 4.6、Gemini 3.1 Pro、Grok 4 等主流模型在与专业扑克 AI 的 5000 手无限注德州扑克单挑中全部落败。推主调侃，既然直接玩扑克不行，不如测试 AI 生成会玩扑克的 AI 的能力。

AK@_akhaliq · 4月11日

Rethinking Generalization in Reasoning SFT A Conditional Analysis on Optimization, Data, and Model Capability paper: https://huggingface.co/papers/2604.06628

译从优化过程、数据构成与模型能力三个条件维度，对推理 SFT 的泛化性展开分析，重新审视监督微调在推理任务中的泛化机制与关键影响因素。

AK@_akhaliq · 4月10日

DMax Aggressive Parallel Decoding for dLLMs paper: https://huggingface.co/papers/2604.08302

译DMax 提出针对扩散语言模型（dLLM）的激进并行解码方案，突破传统顺序生成限制，显著提升推理速度。论文已发布。

Ethan Mollick@emollick · 4月10日

Things that make the jagged intelligence of AI harder to deal with than the jaggedness of humans: 1) Weaknesses are not always intuitive or identifiable in advanced 2) All LLMs have similar weaknesses, so you can't just hire a different one 3) Jagged frontier is moving outward

译AI 的锯齿状智能比人类更难应对：弱点难以直观识别，各 LLM 缺陷雷同导致无法简单更换规避，且能力边界持续外扩。人类虽同样能力参差，但对其锯齿模式更为熟悉。

AK@_akhaliq · 4月10日

RAGEN-2 Reasoning Collapse in Agentic RL paper: https://huggingface.co/papers/2604.06268

译RAGEN-2 论文发布，研究智能体强化学习（Agentic RL）中的「推理崩溃」现象，即训练过程中智能体推理能力退化的问题。论文已上传至 Hugging Face。

AK@_akhaliq · 4月10日

Think in Strokes, Not Pixels Process-Driven Image Generation via Interleaved Reasoning paper: https://huggingface.co/papers/2604.04746

译新论文提出过程驱动的图像生成方法，通过交错推理模拟绘画笔触的创作过程，而非直接生成像素，实现更符合人类作画逻辑的图像合成。

Noam Brown@polynoamial · 4月9日

I'm surprised that, more than a year later, it's still the norm to compare reasoning models on evals by a single number.

译作者吐槽业界仍习惯用单一数字评估推理模型，引用观点指出 MMLU/GSM8K 等基准早已过时却仍在被报告，认为 Intelligence/$（智能性价比）才是更优指标，并以 o1-mini 发布时的多维对比图表为例说明。

Haider.@haider1 · 4月9日

quick questions: if anthropic already puts opus 4.6 at a "20%" chance of being conscious, where does mythos score on that eval? and if gpt-5.4 and opus 4.6 are already helping with phd-level research alongside people like terence tao, what will spud and mythos be capable of?

译Anthropic 称 Opus 4.6 有 20% 概率具备意识，那 Mythos 在该评估中会得多少分？GPT-5.4 和 Opus 4.6 已在协助 Terence Tao 等学者进行博士级研究，即将发布的 Spud 和 Mythos 又将具备何种能力？

Ethan Mollick@emollick · 4月9日

After playing with it a bit, Meta's Muse Spark Thinking is fine so far, but really doesn't match the current Big Three models. It also is a bit... weird. Like some strange language & tone, a little loose with facts, etc. And here is how it does on the neo-gothic shader test.

译Meta 的 Muse Spark Thinking 初体验显示，其性能不及当前三大顶级模型，且存在语言风格怪异、事实准确性欠佳的问题。在新哥特式着色器生成测试中，其表现与 GPT 5.2 Pro 差距明显。

Haider.@haider1 · 4月9日

another example of how AI is starting to change mathematics "gpt-5.4 Pro, alongside Aristotle, helped solve two research-level mathematics problems, including Erdős Problem #650, which had remained open for more than 60 years" even terrance tao recently said that AI is no longer hype when it comes to mathematical discovery

译GPT-5.4 Pro 与 Aristotle 合作解决两道研究级数学难题，包括悬而未决 60 余年的 Erdős Problem #650。数学家陶哲轩称，AI 在数学发现方面已不再是炒作。

Epoch AI@EpochAIResearch · 4月9日

We had pre-release access to Meta’s new Muse Spark model and evaluated it on FrontierMath. It scored 39% on Tiers 1-3 and 15% on Tier 4. This is competitive with several recent frontier models, though behind GPT-5.4.

译Meta Muse Spark 模型在 FrontierMath 基准测试中，Tiers 1-3 得分 39%，Tier 4 得分 15%。该成绩与近期多款前沿模型相当，但仍落后于 GPT-5.4。

Artificial Analysis@ArtificialAnlys · 4月8日

🇰🇷 South Korean AI lab Upstage has launched Solar Pro 3! Solar Pro 3 scores 26 on the Artificial Analysis Intelligence Index, a significant improvement over Solar Pro 2 and is currently the second strongest model released by a Korean lab Key benchmarking takeaways: ➤ Strength in agentic tool use and instruction following: @upstageai's Solar Pro 3 scores 71% on IFBench, which signals strong instruction following capabilities. Solar Pro 3 ranks near the frontier models in this category, scoring similarly to GLM-5 (71%) and Kimi K2.5 (70%) and is the leader among Korean models. Solar Pro 3 scores also 86% on τ²-Bench Telecom, demonstrating strong performance on agentic tool-use, making it a strong candidate for incorporation into agentic workflows. ➤ Relatively high token usage: Solar Pro 3 demonstrates relatively high token usage compared to other models in the same intelligence tier, using ~100M reasoning tokens across the Artificial Analysis Intelligence suite. This is comparable to LG’s K-EXAONE (100M reasoning tokens), another Korean model. ➤ Modest accuracy and reliability: Solar Pro 3 scores -54 on AA-Omniscience, our evaluation of knowledge reliability and hallucination, where scores range from -100 to 100 (higher is better) and a negative score indicates more incorrect than correct answers. However, with an 18% on accuracy component score, Solar Pro 3 does outperform Korean competitors in this metric. ➤ First-party and third-party API access: Solar Pro 3 is a proprietary model and is currently available through Upstage’s first-party API Other Relevant Model Details: ➤ Model type: Mixture of Experts (MoE) ➤ Size: 102B total parameters (12B active parameters) ➤ Context length: 128k ➤ Training data cut-off: July 2025 See below for further analysis

译韩国AI实验室Upstage发布Solar Pro 3，AI Index得分26，为韩国实验室第二强模型。采用MoE架构（102B总参数/12B激活参数），支持128k上下文。核心优势在于agentic工具调用与指令遵循，IFBench得分71%与GLM-5、Kimi K2.5相当，τ²-Bench Telecom达86%。但token消耗较高（约100M），可靠性不足（AA-Omniscience得分-54），准确性18%优于其他韩国模型。可通过Upstage API访问。

Haider.@haider1 · 4月8日

some important takeaways from anthropic's "mythos" model: 1) the model is extremely strong across benchmarks, so scaling has not hit a wall 2) but better scaling also brings much higher training and inference costs, and its setup was strong partly because it's expensive

译Anthropic "Mythos" 模型在基准测试中表现极强，证明模型扩展（scaling）尚未触及天花板；但更强性能伴随极高训练与推理成本，其出色表现很大程度上源于昂贵的配置投入。

François Chollet@fchollet · 4月8日

Join the ARC Prize team -- help us build ARC-AGI-4 and ARC-AGI-5

译加入 ARC Prize 团队——帮助我们构建 ARC-AGI-4 和 ARC-AGI-5

Deedy@deedydas · 4月8日

Claude Mythos just obliterated every single benchmark in AI. I can't believe what I'm reading.

译Claude Mythos 碾压了 AI 领域全部基准测试，表现惊人。推文作者直呼难以置信，表示被其成绩彻底震惊。

François Chollet@fchollet · 4月7日

With curve-fitting, you are recording a lossy approximation of the output of some generative program. With symbolic learning, you are losslessly reverse-engineering the source code of the generative program. Symbolic learning won't be the best fit for all problems, but for the ones where the latent program is reasonably simple, it will outperform by many orders of magnitude.

译曲线拟合是在记录某个生成程序输出的有损近似。符号学习则是在无损地逆向工程该生成程序的源代码。

François Chollet@fchollet · 4月7日

Paper below tested a variety of base LLMs (no TTA) on generalization-focus math problems and found that they can't reason and can't do math. All true... but the fact that base LLMs have zero fluid intelligence, while extremely controversial back in 2024, is now well established. An interesting experiment here would have been to try current LRMs on the same problems and measure the delta. I bet latest LRMs can solve most of these problems. https://arxiv.org/abs/2604.01988

译下方论文在注重泛化的数学问题上测试了多种基础 LLMs（无 TTA），发现它们无法推理也无法做数学。

AK@_akhaliq · 4月7日

Test-Time Scaling Makes Overtraining Compute-Optimal paper: https://huggingface.co/papers/2604.01411

译新论文提出，Test-Time Scaling（测试时扩展）可让 Overtraining（过度训练）实现 Compute-Optimal。传统 Chinchilla 最优假设训练与推理计算固定，而该研究表明，若允许推理阶段增加计算，过度训练模型在总成本下反而性能更优。

François Chollet@fchollet · 4月6日

Science went from the initial observation of radioactivity to a working atom bomb over 47 years via only about 9 distinct key experiments -- extremely few data points -- and symbolic models concise enough they would fit on a single page. This is what extreme generalization looks like, and it powered entirely by symbolic compression. Turn a handful of data points (deliberately collected) into a tractable plan to completely reshape reality, by reverse-engineering the causal symbolic rules behind the data.

译推文以原子弹研发为例，阐述极端泛化的本质：科学仅用47年、约9个关键实验便实现从放射性观察到核武器的突破。这种进步不依赖大数据，而源于符号压缩——将少量刻意收集的数据点提炼为单页纸可承载的因果符号规则。核心观点在于，通过逆向推导数据背后的因果逻辑，人类能够将极简信息转化为重塑现实的完整方案，展现符号推理在突破认知边界中的决定性作用。

François Chollet@fchollet · 4月4日

First update of the call, from Sachin: Gemma 4 is out now on KerasHub! Best open-source model so far for reasoning and agentic workflows.

译来自 Sachin 的会议首个更新：Gemma 4 现已在 KerasHub 上线！目前推理和智能体工作流的最佳开源模型。

Artificial Analysis@ArtificialAnlys · 4月3日

India enters the open-weights AI race with its largest models pre-trained from scratch: Sarvam 105B and Sarvam 30B @SarvamAI's Sarvam 105B and Sarvam 30B score 18 and 12 on the Artificial Analysis Intelligence Index respectively. Announced at the India AI Impact Summit 2026 and open-sourced under Apache 2.0, both are Mixture-of-Experts models trained entirely in India using compute provided under the IndiaAI Mission (@OfficialINDIAai). Both support reasoning and non-reasoning modes. These are an improvement from Sarvam's previous model, Sarvam M (8 on Intelligence Index, 23.6B parameters), which was based on Mistral Small rather than pre-trained from scratch. Sarvam 105B has 106B total parameters with ~10B active per token and a 128K context window. Sarvam 30B has 32B total parameters with ~2.4B active per token and a 65K context window. Alongside the text models, Sarvam also announced Saaras v3 (Speech to Text) and Bulbul v3 (Text to Speech) with a focus on Indic languages. Key takeaways in reasoning mode: ➤ Sarvam 105B scores 18 on the Intelligence Index. Among ~100B-class open-weights reasoning models, it trails GLM-4.5-Air (23), INTELLECT-3 (22), Mistral Small 4 (27), and gpt-oss-120B (High, 33). All four peers also activate more parameters per token ➤ Sarvam 30B scores 12 on the Intelligence Index. Among ~30B-class open-weights reasoning models, it trails GLM-4.7-Flash (30), Nemotron Cascade 2 30B A3B (28), Qwen3 30B A3B 2507 (22), and Qwen3 32B (17). Sarvam 30B activates fewer parameters than these peers. ➤ Sarvam 105B's relative strength is in select agentic tasks. Its agentic index of 25 places it ahead of INTELLECT-3 (20) and GLM-4.5-Air (21) despite trailing both on overall intelligence. Its GDPval index of 773 also edges ahead of GLM-4.5-Air (665). Both new models are a large step up from Sarvam M (Reasoning), which scored 8 on the Intelligence Index. ➤ Compared to peers, both models score lower on TerminalBench Hard (Agentic Coding & Terminal Use) and AA-Omniscience. Sarvam 105B scored 1.5% and Sarvam 30B scored 2.3% on TerminalBench Hard, compared to GLM-4.5-Air (20.5%) and INTELLECT-3 (9.1%). The AA-Omniscience Index is -60 for Sarvam 105B and -72 for Sarvam 30B. Both models have high hallucination rates relative to their accuracy, and both attempt to answer far more questions rather than abstaining, which drives the negative scores. Key model details: ➤ Modality: Text input and output only. ➤ Context window: 128K tokens (Sarvam 105B) and 65K tokens (Sarvam 30B). ➤ Pricing: Currently free on Sarvam's first-party API. ➤ License: Apache 2.0. ➤ Availability: Sarvam's first-party API; weights available on @huggingface and AIKosh.

译Sarvam AI发布印度首批从头预训练的开源权重模型Sarvam 105B与30B，采用MoE架构并在本土训练。两款模型在Intelligence Index分别得分18和12，支持推理与非推理双模式。105B在Agentic任务表现优于部分同类模型，但TerminalBench Hard编码测试成绩落后且幻觉率较高。模型采用Apache 2.0协议开源，上下文窗口128K/65K tokens，目前通过API免费提供服务。

Greg Brockman@gdb · 4月3日

OpenAI for helping resolve longstanding open mathematical problems, with short elegant proofs. Feels like we are on the edge of a new age of scientific discovery.

译OpenAI 内部模型解决三个 Erdős 经典数学难题，均给出简短优雅的证明。相关论文已发布于 arXiv，作者感慨我们正处于科学发现新时代的边缘。

Google DeepMind@GoogleDeepMind · 4月3日

Meet Gemma 4: our new family of open models you can run on your own hardware. Built for advanced reasoning and agentic workflows, we’re releasing them under an Apache 2.0 license. Here’s what’s new 🧵

译Google 发布 Gemma 4 开源模型系列，采用 Apache 2.0 许可证，支持在本地硬件运行，专为高级推理和 agentic 工作流设计。

François Chollet@fchollet · 3月30日

Let me explain what I mean using your chess analogy... Imagine a world where chess doesn't exist. In this world, humanity encounters an alien species, and they say "let's play a game of Glurg, it's our traditional pastime. Here are the rules, see you tomorrow" -- and it's the rules of chess. My claim is that following this interaction, a working group of the world's best minds, leveraging current externalized cognitive infrastructure (computers, the internet, etc.) would be able to analyze the rules and develop a working 3000 Elo chess engine within 24 hours, in time for the match. Give them an extra 3 weeks and they'd have a 3500 Elo engine that's 10x more compute efficient. So human intelligence is already at a level where we can go from "here are the rules" to "I can play at 3000 Elo" immediately. Not optimal yet, but not too far off.

译作者以"Glurg"游戏（实为 chess）假设情境论证：借助现有外部认知基础设施（计算机、互联网等），人类顶尖团队能在24小时内从规则解析开发出3000 Elo引擎，三周内可达3500 Elo且计算效率提升10倍。这表明人类智能已具备即时掌握复杂策略系统的能力，而非从零缓慢进化。该论述回应了关于现实世界更接近 chess 而非 Go 的争论，强调人类利用工具扩展认知边界的即时优势。

Deedy@deedydas · 3月29日

Legendary Don Knuth has now used AI to fully solve his Hamiltonian decomposition problem for odd and even cases. Opus 4.6 / 5.4 Pro solved the even case, wrote a proof in Lean and a “apparently flawless 14 page paper” Knuth: “We are living in very interesting times indeed.”

译Don Knuth 借助 AI 完全解决哈密顿分解问题的奇偶情况。Opus 4.6/5.4 Pro 完成偶数情形证明，以 Lean 形式化验证并生成 14 页论文。Knuth 感叹："我们确实生活在非常有趣的时代。"

Epoch AI@EpochAIResearch · 3月28日

We have removed a problem from FrontierMath: Open Problems. The problem was solved by AI, but upon review we determined that the problem didn’t meet our minimum bar for mathematical notability. This is a different problem from the one whose solution we announced on Monday.

译FrontierMath: Open Problems 移除了一道已被 AI 解决的题目。经审查，该题目未达到数学知名度的最低标准。团队强调，这与周一宣布解决的那道题目不同。

Deedy@deedydas · 3月27日

I sat down with Sam Altman (@sama) and Francois Chollet (@fchollet) at the ARC-AGI-3 launch to talk about: – how to raise kids in this new world – “spud” / Sora – what they’re bullish on and not – research we need more of – AGI timelines and more. Video dropping soon.

译在 ARC-AGI-3 发布活动上与 Sam Altman 和 Francois Chollet 对谈，讨论 AI 时代的育儿方式、Sora、AGI 时间线、看好与不看好的研究方向等话题。完整视频即将上线。

Artificial Analysis@ArtificialAnlys · 3月26日

OpenAI released GPT-5.4 mini and nano, cheaper variants of GPT-5.4 with the same reasoning modes. GPT-5.4 nano is the standout, scoring ahead of both Claude Haiku 4.5 and Gemini 3.1 Flash-Lite Preview with lower per token pricing @OpenAI released GPT-5.4 mini (xhigh, 48) and nano (xhigh, 44), the first mini and nano updates since GPT-5. Both are multimodal with image input support and feature a 400K token context window. They support the same reasoning effort levels as GPT-5.4 (xhigh, high, medium, low, none) and are priced significantly lower: mini at $0.75/$4.50 per 1M input/output tokens and nano at $0.20/$1.25, compared to GPT-5.4 at $2.50/$15. We evaluated these models across three reasoning variants: xhigh, medium, none. While both models are more intelligent than their peers in the highest reasoning efforts, they are more verbose, using 200M+ output tokens to run the Intelligence Index, higher than even select frontier models Key benchmarking takeaways from the highest reasoning variants: ➤ GPT-5.4 nano (xhigh, 44) jumps 18 points from GPT-5 nano (high, 27), with improvements across all evaluations. Compared to Claude Haiku 4.5 (Reasoning, 37) and Gemini 3.1 Flash-Lite Preview (34), GPT-5.4 nano leads on τ²-Bench (81% vs 55% and 31%), IFBench (76% vs 54% and 77%), and TerminalBench (42% vs 27% and 24%) ➤ GPT-5.4 mini (xhigh, 48) gains 7 points over GPT-5 mini (high, 41), with gains across most evaluations. Compared to Gemini 3 Flash Preview (Reasoning, 46) and Claude Sonnet 4.6 (Adaptive Reasoning, max effort, 52), GPT-5.4 mini leads on TerminalBench (52% vs 39% and 53%) and CritPt (10% vs 9% and 3%) ➤ Both models perform less on AA-Omniscience compared to peers, driven primarily by high hallucination rates. GPT-5.4 mini scores -18.7 with a 90% hallucination rate, well behind Claude Sonnet 4.6 (Adaptive Reasoning, max effort, +12.4, 46% hallucination rate) and Gemini 3 Flash Preview (Reasoning, +11.6, 92% hallucination rate but 54% accuracy). GPT-5.4 nano scores -29.6 with a 74% hallucination rate, behind Claude Haiku 4.5 (Reasoning, -4.2, 26% hallucination rate) and Gemini 3.1 Flash-Lite Preview (-15.5, 82%). Both GPT-5.4 models attempt to answer far more questions than Claude Haiku 4.5 and Claude Sonnet 4.6 rather than abstaining, which drives the higher hallucination rates ➤ Both models show strong agentic performance. GPT-5.4 mini scores 1405 on GDPval-AA (Agentic Real-World Work Tasks), ahead of Gemini 3 Flash Preview (Reasoning, 1191) but behind Claude Sonnet 4.6 (Adaptive Reasoning, max effort, 1633). GPT-5.4 nano scores 1169, close to Claude Haiku 4.5 (Reasoning, 1173) and well ahead of Gemini 3.1 Flash-Lite Preview (944) ➤ Token usage with xhigh reasoning effort is higher for both models compared to peers with highest reasoning efforts. GPT-5.4 mini used 235M output tokens to run the Intelligence Index, ~3.4x GPT-5 mini (high, 69M) and more than Claude Sonnet 4.6 (Adaptive Reasoning, max effort, 198M) despite scoring 4 points lower. GPT-5.4 nano used 210M output tokens, ~2.4x Claude Haiku 4.5 (Reasoning, 87M) and ~4x Gemini 3.1 Flash-Lite Preview (53M) ➤ Effective cost to run the Intelligence Index reflects the higher token usage. GPT-5.4 mini (xhigh) cost ~$1,406, compared to ~$278 for Gemini 3 Flash Preview (Reasoning) and ~$3,959 for Claude Sonnet 4.6 (Adaptive Reasoning, max effort). GPT-5.4 nano (xhigh) cost ~$376, compared to ~$584 for Claude Haiku 4.5 (Reasoning) and ~$94 for Gemini 3.1 Flash-Lite Preview. GPT-5.4 nano is cheaper than Claude Haiku 4.5 on an effective cost basis despite using ~2.4x more tokens, due to its significantly lower pricing. Overall, GPT-5.4 nano is the standout offering a better Intelligence vs. Cost to Run Intelligence Index tradeoff than peers and GPT-5.4 mini

译OpenAI发布GPT-5.4 mini与nano轻量模型，保留多档推理能力与400K上下文窗口，价格降至$0.20/$1.25每百万token。基准测试显示，GPT-5.4 nano在τ²-Bench等多项测试中领先Claude Haiku 4.5与Gemini 3.1 Flash-Lite Preview，但幻觉率较高且token消耗量大。得益于极低单价，nano在Intelligence Index测试中的有效成本反而低于竞品，展现出优秀的性价比优势。

Epoch AI@EpochAIResearch · 3月24日

AI has solved one of the problems in FrontierMath: Open Problems, our benchmark of real research problems that mathematicians have tried and failed to solve. See thread for more.

译AI 在 FrontierMath: Open Problems 基准测试中成功解决一道数学家长期未能攻克的真实研究难题。该基准专门收录专业数学家尝试失败的研究级开放问题。

Artificial Analysis@ArtificialAnlys · 3月20日

Mistral has released Mistral Small 4, an open weights model with hybrid reasoning and image input, scoring 27 on the Artificial Analysis Intelligence Index @MistralAI's Small 4 is a 119B mixture-of-experts model with 6.5B active parameters per token, supporting both reasoning and non-reasoning modes. In reasoning mode, Mistral Small 4 scores 27 on the Artificial Analysis Intelligence Index, a 12-point improvement from Small 3.2 (15) and now among the most intelligent models Mistral has released, surpassing Mistral Large 3 (23) and matching the proprietary Magistral Medium 1.2 (27). However, it lags open weights peers with similar total parameter counts such as gpt-oss-120B (high, 33), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), and Qwen3.5 122B A10B (Reasoning, 42). Key takeaways: ➤ Reasoning and non-reasoning modes in a single model: Mistral Small 4 supports configurable hybrid reasoning with reasoning and non-reasoning modes, rather than the separate reasoning variants Mistral has released previously with their Magistral models. In reasoning mode, the model scores 27 on the Artificial Analysis Intelligence Index. In non-reasoning mode, the model scores 19, a 4-point improvement from its predecessor Mistral Small 3.2 (15) ➤ More token efficient than peers of similar size: At ~52M output tokens, Mistral Small 4 (Reasoning) uses fewer tokens to run the Artificial Analysis Intelligence Index compared to reasoning models such as gpt-oss-120B (high, ~78M), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, ~110M), and Qwen3.5 122B A10B (Reasoning, ~91M). In non-reasoning mode, the model uses ~4M output tokens ➤ Native support for image input: Mistral Small 4 is a multimodal model, accepting image input as well as text. On our multimodal evaluation, MMMU-Pro, Mistral Small 4 (Reasoning) scores 57%, ahead of Mistral Large 3 (56%) but behind Qwen3.5 122B A10B (Reasoning, 75%). Neither gpt-oss-120B nor NVIDIA Nemotron 3 Super 120B A12B support image input. All models support text output only ➤ Improvement in real-world agentic tasks: Mistral Small 4 scores an Elo of 871 on GDPval-AA, our evaluation based on OpenAI's GDPval dataset that tests models on real-world tasks across 44 occupations and 9 major industries, with models producing deliverables such as documents, spreadsheets, and diagrams in an agentic loop. This is more than double the Elo of Small 3.2 (339) and close to Mistral Large 3 (880), but behind gpt-oss-120B (high, 962), NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 1021), and Qwen3.5 122B A10B (Reasoning, 1130) ➤ Lower hallucination rate than peer models of similar size: Mistral Small 4 scores -30 on AA-Omniscience, our evaluation of knowledge reliability and hallucination, where scores range from -100 to 100 (higher is better) and a negative score indicates more incorrect than correct answers. Mistral Small 4 scores ahead of gpt-oss-120B (high, -50), Qwen3.5 122B A10B (Reasoning, -40), and NVIDIA Nemotron 3 Super 120B A12B (Reasoning, -42) Key model details: ➤ Context window: 256K tokens (up from 128K on Small 3.2) ➤ Pricing: $0.15/$0.6 per 1M input/output tokens ➤ Availability: Mistral first-party API only. At native FP8 precision, Mistral Small 4's 119B parameters require ~119GB to self-host the weights (more than the 80GB of HBM3 memory on a single NVIDIA H100) ➤ Modality: Image and text input with text output only ➤ Licensing: Apache 2.0 license

译Mistral发布开源权重模型Mistral Small 4，采用119B参数MoE架构（每token激活6.5B参数），支持可切换的推理/非推理模式及图像输入。推理模式在Artificial Analysis Intelligence Index获27分，超越Mistral Large 3，但低于gpt-oss-120B等竞品。模型token效率优于同类，幻觉率更低（AA-Omniscience -30分），支持256K上下文窗口，采用Apache 2.0许可证。

Sam Altman@sama · 3月17日

Great first week for 5.4 in the API. Builders building fast.

译GPT-5.4 API 上线首周日处理量达 5T tokens，流量超过去年同期整个 API 总量，年化新增净收入突破 10 亿美元，增速创 OpenAI 模型发布历史纪录。

Google DeepMind@GoogleDeepMind · 3月13日

What does it take to build AI for scientific discovery? 🧠 To celebrate 10 years of AlphaGo, @ThoreG and @Pushmeet joined @fryrsquared on our podcast to discuss how mastering games has paved the way for it to help solve more complex problems. ↓ 00:00 The AlphaGo match 02:15 Why Go? 10:58 Lee Sedol vs AlphaGo 14:58 Move 37 20:55 Move 78 24:38 Reaction from the Go community 30:45 Never before seen footage 32:05 AlphaGo to protein folding 34:00 Matrix multiplication 38:00 AlphaGo and algorithmic discovery 41:40 How to verify new discoveries 47:38 Role of mathematicians 51:43 Would we be here without AlphaGo?

译AlphaGo十周年之际，DeepMind科学家Thore Graepel与Pushmeet Kohli探讨了从游戏AI到科学发现工具的演进路径。对话回顾了Move 37与Move 78等标志性时刻的技术突破，阐述AlphaGo如何延伸至蛋白质折叠、矩阵乘法优化及算法发现领域。讨论还涉及AI生成发现的验证机制、数学家的协作角色，以及游戏智能对解决复杂科学问题的方法论变革。