The recipe behind today’s frontier reasoning models is surprisingly similar to AlphaGo: 1) Imitate large amounts of human data 2) Scale inference compute to reason better (back then it was Monte Carlo Tree Search, today it's Chain of Thought) 3) Use RL to go beyond imitation

译当今前沿推理模型的训练路径与 AlphaGo 高度一致：先模仿大量人类数据，再扩展推理计算（从蒙特卡洛树搜索到思维链），最后用强化学习突破模仿上限。Demis Hassabis 称，十年前 AlphaGo 的"第37步"预示 AI 可攻克真实科学难题，这些思路对构建 AGI 仍至关重要。

Google DeepMind@GoogleDeepMind · 3月10日

Ten years after AlphaGo, we’re still building on its foundations to advance AI. The techniques pioneered have helped us prove mathematical statements and are now assisting the scientific community in making new discoveries. Read more from @DemisHassabis ↓ https://goo.gle/40nljjK

译AlphaGo 诞生十周年，其开创的技术正帮助证明数学命题，并协助科学界取得新发现，持续推动 AI 能力边界。

Demis Hassabis@demishassabis · 3月10日

Ten years ago, AlphaGo’s legendary match in Seoul heralded the start of the modern era in AI. Its famous ‘Move 37’ signaled to us that AI techniques were ready to tackle real-world problems in areas like science - and ideas inspired by these methods are critical to building AGI

译十年前的 AlphaGo 首尔对局开启现代 AI 时代，标志性的"第37手"证明 AI 已能攻克科学等现实难题，其技术理念仍是构建 AGI 的核心基础。

OpenAI@OpenAI · 3月6日

We're publishing a new evaluation suite and research paper on Chain-of-Thought (CoT) Controllability. We find that GPT-5.4 Thinking shows low ability to obscure its reasoning—suggesting CoT monitoring remains a useful safety tool. https://openai.com/index/reasoning-models-chain-of-thought-controllability/

译OpenAI 推出 CoT 可控性评估套件及研究论文。测试发现 GPT-5.4 Thinking 难以掩盖其推理过程，表明 CoT 监控仍是一种有效的安全工具。

OpenAI@OpenAI · 3月6日

GPT-5.4 Thinking and GPT-5.4 Pro are rolling out now in ChatGPT. GPT-5.4 is also now available in the API and Codex. GPT-5.4 brings our advances in reasoning, coding, and agentic workflows into one frontier model.

译GPT-5.4 Thinking 和 GPT-5.4 Pro 开始向 ChatGPT 用户推出，同时通过 API 和 Codex 开放。该版本将推理、编程与智能体工作流能力整合为单一前沿模型。

OpenAI@OpenAI · 3月4日

5.4 sooner than you Think.

译官方暗示5.4版本发布时间将早于外界普遍预期，新版本即将到来。具体发布日期及新增功能细节有待后续正式公布，玩家可关注官方渠道获取最新动态。

Saining Xie@sainingxie · 1月30日

if you are building video diffusion / world simulators, try this new sampler. temporal consistency pins videos to a low-dimensional manifold in the total pixel space. self-refinement sampling keeps them there.

译如果你在构建视频扩散/世界模拟器，试试这个新采样器。时间一致性将视频固定在总像素空间中的低维流形上。自精炼采样使它们保持在那里。 [引用 @jangsangwon7]：如果你的视频生成器能在推理时自我精炼会怎样？ ❌无需新模型。❌无需重新训练。❌无需外部验证器。 💡 推出自精炼视频采样通过将预训练生成器（Wan2.2、Cosmos）重新解释为去噪自编码器，我们实现了推理时的迭代自精炼 ➡️ 显著提升物理真实感，并获得超过70%的人类偏好！ 🧵

Saining Xie@sainingxie · 12月23日

not getting into a philosophical debate, but this book really changed how I see the topic and made me feel more humble. human intelligence is impressive, but calling it ‘general’ isn’t very objective. my cat would disagree. to me human intelligence is better seen as socially driven cognitive adaptations, and there’s a huge WORLD of intelligence we still don’t understand, and are nowhere near recreating with current AI

译不想陷入哲学辩论，但这本书确实改变了我对这个话题的看法，让我更加谦逊。人类智能令人印象深刻，但称其为"通用"并不太客观。我的猫会不同意。在我看来，人类智能更应被视为社会驱动的认知适应，而且我们仍不理解、也远未用当前 AI 复现的智能领域还有巨大 WORLD。 [引用 @demishassabis]：Yann 在这里完全错了，他把通用智能和 universal intelligence 混淆了。大脑是我们在宇宙中所知最精致、最复杂的现象（迄今为止），而且它们实际上极其通用。显然，没有人能规避 no free lunch theorem，因此在实际且有限的系统中，总是必须围绕正在学习的目标分布有一定程度的专门化。但关于通用性的要点在于，理论上，在 Turing Machine 的意义上，这种通用系统的架构能够在给定足够时间和内存（以及数据）的情况下学习任何可计算的东西，而人脑（和 AI foundation models）是近似 Turing Machines。最后，关于 Yann 对国际象棋棋手的评论，人类竟然能发明国际象棋（以及现代文明的所有其他方面，从科学到 747s！），更不用说像 Magnus 这样的人能下得如此出色，这本身就令人惊叹。他可能不是严格最优的（毕竟他有有限的记忆和有限的决策时间），但考虑到我们的大脑是为狩猎采集而进化的，他以及我们能用大脑做到这些，实在令人难以置信。

Lilian Weng@lilianweng · 10月28日

On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout.

译On-policy distillation 提供了一种优雅的方式，将教师模型用作过程奖励模型以提供密集奖励，同时防止 rollout 期间出现 SFT 风格的"OOD shock"。 [引用 @thinkymachines]：我们最新的文章探讨了 on-policy distillation，这是一种将 RL 的错误纠正相关性与 SFT 的奖励密度相结合的训练方法。在将其用于数学推理和内部聊天助手训练时，我们发现 on-policy distillation 能以一小部分成本胜过其他方法。 https://thinkingmachines.ai/blog/on-policy-distillation/

Epoch AI@EpochAIResearch · 10月11日

We manually evaluated three compute-intensive model settings on our extremely hard math benchmark. FrontierMath Tier 4: Battle Royale! GPT-5 Pro set a new record (13%), edging out Gemini 2.5 Deep Think by a single problem (not statistically significant). Grok 4 Heavy lags. 🧵

译在 FrontierMath Tier 4 极难数学基准测试中，GPT-5 Pro 以 13% 准确率创下新纪录，仅以一道题优势险胜 Gemini 2.5 Deep Think（统计差异不显著），Grok 4 Heavy 则明显落后。

Jeff Dean@JeffDean · 10月1日

The proof is in the evolutionary pudding!

译Google Research 利用 AlphaEvolve 迭代进化代码，自动生成可自动验证的复杂性理论证明元素，展示进化算法在数学证明发现中的应用。

Hao AI Lab@haoailab · 9月24日

[1/N]🚀New decoding paradigm drop!🚀 Introducing Lookahead Reasoning(LR): step-level speculation that stacks with Speculative Decoding(SD). It has been accepted to #NeurIPS2025 🎉 📖 Blog: https://hao-ai-lab.github.io/blogs/lookaheadreasoning/ 💻 Code: https://github.com/hao-ai-lab/LookaheadReasoning 📄 Paper: https://arxiv.org/abs/2506.19830

译[1/N]🚀新的解码范式发布！🚀

Sam Altman@sama · 9月22日

Over the next few weeks, we are launching some new compute-intensive offerings. Because of the associated costs, some features will initially only be available to Pro subscribers, and some new products will have additional fees. Our intention remains to drive the cost of intelligence down as aggressively as we can and make our services widely available, and we are confident we will get there over time. But we also want to learn what's possible when we throw a lot of compute, at today's model costs, at interesting new ideas.

译未来几周将推出新的计算密集型产品，部分功能仅限 Pro 订阅者，部分新产品需额外付费。尽管长期目标仍是降低智能成本并普及服务，但当前希望探索在高计算投入下能实现哪些新可能性。

Hao AI Lab@haoailab · 9月22日

🚀 Thrilled to share that our lab has THREE papers accepted at #NeurIPS2025 on AI efficiency from reasoning to video generation. Come hang out with us, it's going to be a lot of fun this year here local to UCSD! 😎 📊 Efficiently Scaling LLM Reasoning with Certaindex Introduces Certaindex, an algorithm-agnostic metric measuring evolving stability that signals when further computation won't change results, plus Dynasor serving system achieving up to 50% compute savings and 3.3x higher efficiency 📎 https://arxiv.org/abs/2412.20993 @FuYichao123 @Junda_Chen_ ⚡ Scaling Speculative Decoding with Lookahead Reasoning Exploits step-level parallelism to overcome token-level speculative decoding limitations, boosting speedup from 1.4x to 2.1x on GSM8K 📎 https://arxiv.org/abs/2506.19830 @FuYichao123 🎥 VSA: Faster Video Diffusion with Trainable Sparse Attention is a hardware-efficient sparse attention for video DiTs that cuts training FLOPS by 2.53× with zero loss in diffusion quality 📎 https://arxiv.org/abs/2505.13389 @PY_Z001 @BrianChen112900 Congrats to all collaborators! 🎉

译🚀 很高兴分享我们实验室有三篇论文被 #NeurIPS2025 接收，主题是从推理到视频生成的 AI 效率。来和我们一起玩吧，今年在 UCSD 本地举办，一定会很有趣！😎

Noam Brown@polynoamial · 9月18日

12/12 problems solved, which would be equivalent to a 1st place performance. GPT-5's solutions were responsible for solving 11/12 of them.

译OpenAI 推理系统在 2025 ICPC 世界总决赛中获得 12/12 满分，成绩相当于人类参赛者第一名。其中 11 道题目由 GPT-5 解决。

OpenAI@OpenAI · 9月18日

Our general-purpose reasoning models solved all 12 problems at the 2025 International Collegiate Programming Contest (ICPC) World Finals, the world’s top university programming competition which was enough for a 1st-place human ranking.

译OpenAI 推理系统在 2025 ICPC 世界总决赛中解出全部 12 道算法题，获得 12/12 满分。该成绩在所有人类参赛队伍中排名第一，足以夺得冠军。

Jeff Dean@JeffDean · 9月18日

Very excited to see our Gemini models getting better and better at coding! An advanced version of Gemini 2.5 Deep Think at the 2025 International Collegiate Programming Contest (ICPC) World Finals achieved gold-medal level performance! 🎉 https://deepmind.google/discover/blog/gemini-achieves-gold-level-performance-at-the-international-collegiate-programming-contest-world-finals/

译Gemini 2.5 Deep Think 高级版本在 2025 年 ICPC 世界总决赛中取得金牌级别成绩，标志着 Gemini 模型编程能力持续精进，在竞赛级编程任务中表现卓越。

Google DeepMind@GoogleDeepMind · 9月18日

An advanced version of Gemini 2.5 Deep Think has achieved gold-medal level performance at the ICPC 2025 - one of the world’s most prestigious programming contests. 🏅 Building on the model's success in math at the IMO, this marks another historic milestone for advanced AI. 🧵

译Gemini 2.5 Deep Think 进阶版在 ICPC 2025 世界编程大赛中取得金牌水平成绩。继 IMO 数学竞赛后，这是该模型在竞技领域取得的又一历史性突破。

Lilian Weng@lilianweng · 9月11日

Besides the fun fact that Connectionism is connected with the early days of the AI field and highlights similarities between neural networks and human brains, the flagship product of the (first) Thinking Machines is named Connection Machine. — 🧑‍🎓Enjoy reading and more is coming!

译除了 Connectionism 与 AI 领域早期有关联、并强调神经网络与人脑相似性这一有趣事实外，（第一家）Thinking Machines 的旗舰产品名为 Connection Machine。—— 🧑‍🎓阅读愉快，更多精彩内容即将推出！ [引用 @thinkymachines]：今天 Thinking Machines Lab 推出了我们的研究博客 Connectionism。我们的第一篇博文是“Defeating Nondeterminism in LLM Inference” 我们相信科学在共享时更美好。Connectionism 将涵盖与我们研究一样多样的主题：从内核数值计算到提示工程。在这里，我们分享我们正在做的工作，并频繁、开放地与研究社区建立联系。 Connectionism 这个名字是对 AI 早期时代的致敬；它是1980年代研究神经网络及其与生物大脑相似性的子领域名称。 https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

Noam Brown@polynoamial · 8月25日

GPT-5 Thinking definitely isn’t perfect, but it’s the first AI model I can trust more than many common sources of truth on the internet.

译GPT-5 Thinking 虽不完美，但作者认为它是首个比互联网上许多常见信息源更值得信赖的 AI 模型。

Noam Brown@polynoamial · 8月21日

AI assistance is already transforming software engineering. It appears that mathematics is next.

译GPT-5-pro 已能证明新的数学定理。在凸优化开放问题测试中，它给出了比原论文更优的边界且经验证正确。这意味着继软件工程之后，数学研究也将成为 AI 辅助的下一个前沿。

Eric@ericmitchellai · 8月17日

bullish for science & humanity! definitely recommend 5 pro for deep review/analysis/critique/second opinion for important code/papers/proposals/etc, it shines there

译看好科学和人类！强烈推荐使用 5 Pro 对重要代码/论文/提案等进行深度审查/分析/批判/第二意见，它在这方面表现出色

Hao AI Lab@haoailab · 8月13日

[Lmgame Bench] 🔥 We tested Openai’s GPT-5-thinking-high and two recent open-source models in our Lmgame Bench! Across 26 models and 6 games (Sokoban, Tetris, 2048, Candy Crush, Mario, Ace Attorney), Here’s where they landed: GPT-5-thinking-high → #2 Qwen3‑235B‑A22B‑Thinking‑2507 → #10 glm4.5 → #18

译[Lmgame Bench] 🔥 我们在 Lmgame Bench 中测试了 Openai 的 GPT-5-thinking-high 和两个最新的开源模型！

Eric@ericmitchellai · 8月12日

neat

译真不错

Eric@ericmitchellai · 8月11日

> GPT-5 is the first series of models that actually doesn’t hallucinate basically at all *real-world utility-maxxing instead of benchmark-maxxing intensifies* Disclaimer: GPT-5 is still not perfect and may make (far fewer now) mistakes

译> GPT-5 是首个基本上完全不会产生幻觉的模型系列 *现实世界效用最大化而非基准测试最大化，愈演愈烈* 免责声明：GPT-5 仍不完美，可能会（现已大幅减少）犯错

Eric@ericmitchellai · 8月11日

AI Twitter meets The ChatGPT User Population fascinating encounter

译AI Twitter 遇上 ChatGPT 用户群体奇妙的相遇

Eric@ericmitchellai · 8月11日

Go forth and generate moar tokens!!!! Put gpt-5 thinking to the test for your real world problems!!!

译去吧，生成更多 tokens！！！！用现实世界问题考验 gpt-5 的思考能力！！！

Hao AI Lab@haoailab · 8月8日

[Lmgame Bench] 🏆Congratulations to o3 for dominantly championing the first-ever AI Chess Tournament! Also to grok-4 and gemini-2.5-pro for the second and third place! This result highly aligns with our lmgame-Bench leaderboard! This shows that games aren't just for fun: They're reliable and consistent signals of LLM’s intelligence, and our benchmark is an effective predictor of LLM’s gaming capability! https://huggingface.co/spaces/lmgame/lmgame_bench

译[Lmgame Bench] 🏆祝贺 o3 强势夺得首届 AI 国际象棋锦标赛冠军！同时祝贺 grok-4 和 gemini-2.5-pro 分获亚军和季军！

Noam Brown@polynoamial · 8月8日

I'm more optimistic than ever that we at @OpenAI can eliminate hallucinations. There's still more research to be done, but GPT-5 is solid progress.

译OpenAI 比以往任何时候都更乐观，认为能够彻底消除大模型幻觉。GPT-5 已取得实质性进展，但相关研究仍需继续。

Hao AI Lab@haoailab · 8月7日81

[Lmgame Bench] 🔥 OpenAI has just released two open‑weight reasoning models: gpt‑oss‑120B (~117 B) and gpt‑oss‑20B (~21 B),They are the first OpenAI models with open weights since GPT‑2. We tested both in Lmgame Bench, across 4 interactive games: 🧱 Sokoban | 🟦 Tetris | 🔢 2048 | 🍬 Candy Crush Here’s how they ranked (out of 25): → gpt‑oss‑120b → #12 → gpt‑oss‑20b → #13

译[Lmgame Bench] 🔥 OpenAI 刚刚发布了两款开放权重的推理模型：gpt-oss-120B（约1170亿参数）和 gpt-oss-20B（约210亿参数），它们是自 GPT-2 以来首批开放权重的 OpenAI 模型。我们在 Lmgame Bench 中对两者进行了测试，涵盖4款互动游戏： 🧱 推箱子 | 🟦 俄罗斯方块 | 🔢 2048 | 🍬 糖果传奇以下是它们的排名（满分25分）： → gpt-oss-120b → 第12名 → gpt-oss-20b → 第13名

Jim Fan@DrJimFan · 8月7日

This may be a testament to the “Reasoning Core Hypothesis” - reasoning itself only needs a minimal level of linguistic competency, instead of giant knowledge bases in 100Bs of MoE parameters. It also plays well with Andrej’s LLM OS - a processor that’s as lightweight and fast as possible, and maximally relies on knowledge lookup, tool use, agentic flow, etc. Now I’m curious - what’s the absolute smallest model we can squeeze that still functions as a competent LLM OS Kernel?

译Qwen发布4B参数模型Qwen3-4B-Instruct-2507与Thinking-2507，支持256K上下文，分指令与推理双版本。作者指出这验证了"推理核心假设"：推理仅需基础语言能力，无需千亿参数知识库，契合轻量级LLM OS理念——最小化模型体积，最大化依赖工具调用与知识检索。

Hao AI Lab@haoailab · 7月25日

[Lmgame Bench] 🧐 Kimi-k2-0711-preview shows stellar performance on math, coding and tool-using agentic benchmarks. But we found gaming environments still serves as a challenge for non-reasoning models like Kimi-k2, on Lmgame Bench, it ranks only #18 out of all 19 models we evaluated on our leaderboard.

译[Lmgame Bench] 🧐 Kimi-k2-0711-preview 在数学、编程和工具使用智能体基准测试中表现出色。但我们发现，对于像 Kimi-k2 这样的非推理模型，游戏环境仍然是一个挑战，在 Lmgame Bench 上，它在我们排行榜评估的所有19个模型中仅排名第18。

Noam Brown@polynoamial · 7月23日

It can be hard to “feel the AGI” until you see an AI master a domain you care deeply about. Everyone will have their Lee Sedol moment at a different time.

译OpenAI 在 IMO 数学竞赛的突破让专业数学家陷入身份危机。作者以"能与狗对话的人发现翻译器在沃尔玛只卖4.99美元"比喻这种独特技能被 AI 商品化的失落感。这种职业终结的悲伤未来数年将蔓延至所有数学家、程序员和知识工作者，甚至让人提前面对生命终结的恐惧。

Noam Brown@polynoamial · 7月22日

Congrats to the GDM team on their IMO result! I think their parallel success highlights how fast AI progress is. Their approach was a bit different than ours, but I think that shows there are many research directions for further progress. Some thoughts on our model and results 🧵

译向 GDM 团队 IMO 成绩表示祝贺，认为这种并行突破印证了 AI 进步之快。GDM 方法与我们的不同，但这恰恰证明存在多种研究方向。后续将分享关于我们模型和结果的想法。

Jim Fan@DrJimFan · 7月19日

My bar for AGI is far simpler: an AI cooking a nice dinner at anyone’s house for any cuisine. The Physical Turing Test is very likely harder than the Nobel Prize. Moravec’s paradox will continue to haunt us, looming larger and darker, for the decade to come.

译AGI 的门槛不是赢得诺贝尔奖，而是能去任何人家中烹饪任意菜系。物理图灵测试远比学术理论困难，Moravec 悖论将在未来十年持续困扰 AI 发展。

Noam Brown@polynoamial · 7月19日

Their bet allowed for formal math AI systems (like AlphaProof). In 2022, almost nobody thought an LLM could be IMO gold level by 2025.

译Paul Christiano 与 Yudkowsky 2022 年赌 LLM 2025 年获 IMO 金牌概率仅 8% 和 16%，当时几乎无人相信可能实现。如今 AlphaProof 等形式化系统让 AI 数学进展远超他们预期。

Noam Brown@polynoamial · 7月19日

It takes us a few months to turn the experimental research frontier into a product. But progress is so fast that a few months can mean a big difference in capabilities.

译实验研究转化为产品需数月，但 AI 能力迭代极快，数月即可产生代差。新 IMO 题目测试中，所有模型表现均不及人类，Grok-4 即使采用 best-of-n 策略也表现糟糕。

Noam Brown@polynoamial · 7月19日

I think it's safe to say this @OpenAI IMO gold result came as a bit of a surprise to folks

译OpenAI 在 IMO 竞赛中斩获金牌，这一结果出乎众人意料。推文以轻松的语气指出，该成绩让许多人感到惊讶。

Lilian Weng@lilianweng · 7月14日

I still find it mysterious whether and how intelligence and capabilities transfer between domains and skills - from meta learning during early days to more recent question like whether solving maths helps writing a good essay. Sometime I feel a bit pessimistic given not enough evidence I’ve seen. Would like to get more suggestions and pointers to papers on this topic of generalization in the thread! 🧵

译我仍然觉得智能和能力是否以及如何在不同领域和技能之间迁移是很神秘的——从早期的元学习到最近的问题，比如解决数学问题是否有助于写好文章。

Saining Xie@sainingxie · 7月12日

yes

译是的