The recipe for “classic” reasoning benchmarks is simple: text-only, several-hour time horizons, easy to grade, with expert human baselines. What next? In this week’s Gradient Update, @GregHBurnham argues it’s as easy as dropping one of these four ingredients.

译“经典”推理基准的配方很简单：纯文本、数小时的时间跨度、易于评分，并带有专家人类基线。接下来呢？在本周的Gradient Update中，@GregHBurnham 认为只需舍弃这四种成分之一即可。

Rohan Paul@rohanpaul_ai · 5月6日68

GPT-5.5 & Opus 4.7 score <1% on ARC-AGI-3

译GPT-5.5 与 Opus 4.7 在 ARC-AGI-3 上的得分低于 1%

Artificial Analysis@ArtificialAnlys · 5月6日58

MiniMax-M2.7 is now available across six inference providers on Artificial Analysis, with significant differentiation in speed and price @SambaNovaAI leads on speed at 435 output tokens/s, >3x faster than any other provider. @FireworksAI_HQ, @novita_labs, @togethercompute, and @GMI_cloud have all matched @MiniMax_AI's first-party API pricing, while SambaNova is 2x higher. Key takeaways: ➤ Fireworks and SambaNova are on the Pareto frontier for Speed vs. Price. At 127 output tokens/s and ~$0.22 per 1M tokens blended, Fireworks is ~2.2x faster than MiniMax's first-party API at the same blended price, whereas SambaNova delivers 435 output tokens/s but at ~2-3.5x the blended price of the other providers (depending on cache usage) ➤ SambaNova is the fastest provider at 435 output tokens/s, ~3.4x the next fastest provider (Fireworks at 127 output tokens/s). The remaining providers run substantially slower: MiniMax’s first-party API at 57 output tokens/s, Novita at 54, GMI at 41, and Together AI at 29 ➤ Cache discounts vary across providers. Fireworks, MiniMax, Novita, and Together AI offer 80% cache hit discounts, while GMI and SambaNova do not offer a discount. For cache-heavy workloads, this can materially increase the relative pricing for GMI and SambaNova ➤ Optimal provider choice depends on workload. SambaNova may be more suited to latency-sensitive deployments, albeit at a higher cost, while Fireworks may be more suitable for high-volume workloads that are not as latency-sensitive

译MiniMax-M2.7模型已在六家推理服务商上线，各提供商在速度和价格上差异明显。SambaNovaAI以每秒435个输出令牌的速度领先，比其他提供商快3倍以上，但其价格也高出约2倍。FireworksAI、Novita Labs等四家则与MiniMax官方API定价持平。分析指出，Fireworks和SambaNova在速度与价格的权衡中处于帕累托前沿：前者性价比高，后者则以高价换取极致速度。此外，各家的高速缓存折扣政策不同，这对缓存密集型工作负载的成本影响显著。因此，最优选择高度依赖于具体工作负载对延迟和成本的敏感度。

Luma@LumaLabsAI · 5月5日71

Multimodal at the frontier. Built around your business.

译Luma Labs 推出的 UNI-1.1-Max 和 UNI-1.1 多模态模型在 Image Arena 的文本生成图像与图像编辑综合排名中位列第三，且未采用智能体搜索技术。具体来看，在文本生成图像竞技场中，两款模型分别排名第六和第七；在多图像编辑和单图像编辑竞技场中，它们均进入前十一名，其中 UNI-1.1-Max 在单图像编辑中排名第七。这一成绩标志着 Luma Labs 在多模态前沿领域取得了扎实进展。

Deedy@deedydas · 5月5日62

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

译SWE-Bench 的创建者刚刚发布了一个非常简单的新基准测试，所有 LLM 都得 0 分。 ProgramBench 提出的问题是：模型能否在没有互联网的情况下从零开始重建真实可执行程序（ffmpeg、SQLite、ripgrep）？我们在模型质量上还远未饱和。

OpenRouter@OpenRouter · 5月5日65

We analyzed GPT 5.5 vs GPT 5.4 and found that costs increased between 49-92%. The 2x price hike of GPT 5.5 is mitigated by the model generating 19-34% fewer completion tokens for longer prompts. More analysis here: https://openrouter.ai/announcements/gpt55-cost-analysis

译我们分析了GPT 5.5与GPT 5.4，发现成本增加了49-92%。 GPT 5.5价格翻倍的影响因模型生成长提示时补全令牌减少了19-34%而有所缓解。更多分析请见：https://openrouter.ai/announcements/gpt55-cost-analysis

Berryxia.AI@berryxia · 5月5日58

该说不说没有特别火热的AI工具，贼好。因为不会降智！ Grok 4.3 最近不错。

译Grok 4.3近期在Vals AI的私有基准测试中，于法律和金融领域展现出领先的智能推理能力。其在针对真实加拿大法庭案例的CaseLaw (v2)测试中，以79.31%的准确率超越GPT-5.1；在基于复杂多页信贷协议的CorpFin (v2)测试中，准确率达68.53%。这些测试聚焦深度法律推理与金融合同理解等高难度现实任务，结果表明Grok 4.3在真实世界高风险领域的卓越性能，印证了xAI致力于构建世界级推理引擎的目标。

SemiAnalysis@SemiAnalysis_ · 5月5日71

MINECRAFT STEVE ALERT: GB300 ultra NVL72 is already 2.7x faster 🚀 than GB200 NVL72 on one of the industry standard inference engine known as @vllm_project. On paper, GB300 only has ~1.5x faster NVFP4 FLOP & 1.5x more HBM capacity & same HBM BW than GB200 but due to the full stack optimization with compounding gains, in the middle of the curve where most providers serve at, GB300 is up to 2.7x faster. End to End performance is the gold standard of performance, not on paper theoretical flops. Thanks to the 10x engineers at NVIDIA & @inferact & @coreweave for this temporary gb300 for open source projects!

译在行业标准推理引擎vLLM上的测试显示，NVIDIA GB300 NVL72的实测端到端性能已达GB200 NVL72的2.7倍。尽管其纸面参数仅显示NVFP4算力提升约1.5倍、HBM容量增加1.5倍且带宽相同，但在大多数服务商实际运行的中段负载区间，凭借全栈优化的复合增益，GB300实现了远超理论算力提升的性能飞跃。此次测试基于NVIDIA、Inferact和CoreWeave为开源项目提供的临时GB300系统完成，结果印证了端到端实测性能才是衡量硬件效能的黄金标准，而非单纯的纸面理论算力。

swyx 🇸🇬@swyx · 5月5日61

seeing lot of people saying that Opus 4.7 is a net regression vs 4.6, but it seems quite anecdotal. offline and online evals point towards a clean step up. what's not being captured? "personality"?

译看到很多人说Opus 4.7相比4.6是净退步，但这似乎只是些个例。离线和在线评估都指向明确的进步。那是什么没被捕捉到呢？“个性”吗？

Artificial Analysis@ArtificialAnlys · 5月5日69

A new anonymous model debuts at #8 in the Artificial Analysis Text to Image Arena! Peanut’s weights are expected to be released soon, which would make it the leading Text to Image Open Weights Model. Peanut is positioned to be the new leading open weights Text to Image model, surpassing Z-Image Turbo, Qwen-Image, and FLUX.2 [dev]. Further details (and weights) coming soon. See example generations from Peanut in the Artificial Analysis Image Arena below 🧵

译一款新的匿名模型在Artificial Analysis文本转图像竞技场中首次亮相，位列第8！Peanut的权重预计即将发布，这将使其成为领先的文本转图像开源权重模型。 Peanut定位为新的领先开源权重文本转图像模型，超越了Z-Image Turbo、Qwen-Image和FLUX.2 [dev]。更多详细信息（及权重）即将公布。查看下方🧵中Artificial Analysis图像竞技场里Peanut的生成示例。

Elon Musk@elonmusk · 5月5日41

Try Grok

译在“Vals AI”的私人基准测试中，Grok 4.3在法律和金融领域展现出领先的智能水平。它在CaseLaw (v2)测试中以79.31%的准确率排名第一，该测试基于真实加拿大法庭案例，评估深度法律推理和先例理解能力，表现优于GPT-5.1。同时，它在针对复杂长期信贷协议的CorpFin (v2)测试中以68.53%的准确率夺冠，评估了对多页金融合同条款、风险的理解。这些模拟高风险现实挑战的测试表明，Grok 4.3在最困难的任务中具备卓越的推理能力。xAI正致力于构建世界所需的推理引擎。

Epoch AI@EpochAIResearch · 5月5日46

Are AI benchmarks doomed? @GregHBurnham and @tmkadamcz join @ansonwhho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like. (0:00:00) - Preview (0:00:36) - Intro: Are AI benchmarks doomed? (0:03:13) - The costs and benefits of benchmark development (0:11:48) - MirrorCode and scalable benchmarks (0:20:57) - AI speed-up in benchmark development (0:23:28) - The benchmark-reality gap (0:38:26) - Can an AGI benchmark exist? (0:43:18) - Beyond automated scoring (1:00:45) - How AI changes benchmark building in practice

译针对“AI基准测试是否已失效”的悲观论调，讨论者进行了反驳，并深入探讨下一代AI基准测试的可能形态。核心议题包括基准测试开发的成本与收益、可扩展基准（如MirrorCode）的构建、AI技术对基准开发本身的加速作用，以及当前基准测试与现实应用能力之间存在的差距。对话还触及了构建通用人工智能（AGI）基准的可行性，并展望了超越自动化评分的更全面评估方法。

Chubby♨️@kimmonismus · 5月4日62

A little-known startup just landed on the @ArtificialAnlys AI Video leaderboard, now ranked among the top 6 in the world. Very cool @video_rebirth

译初创公司Video Rebirth的文本生成视频模型Bach-1.0 Preview在Artificial Analysis的全球AI视频排行榜上首次亮相即位列第六。其性能与Vidu Q3 Pro、Kling 3.0 Omni 1080p (Pro)及grok-imagine-video等知名模型相当。该模型计划于五月下旬广泛发布。

Ethan Mollick@emollick · 5月3日57

Its getting hard to benchmark frontier agent performance on longer tasks. Repeated measurement is very expensive and there are differences between using models in harnesses versus via APIs. I suspect benchmarks understate progress, they are built for models, not harnessed agents

译对前沿智能体在较长任务上的性能进行基准测试正变得越来越困难。重复测量的成本非常高，而且使用受控框架中的模型与通过API使用模型之间存在差异。我怀疑基准测试低估了进展，它们是为模型设计的，而非为受控智能体。

Eric@ericmitchellai · 5月3日50

What has ChatGPT helped you learn? How does it fall short as a learning or teaching tool?

译用户通过对比GPT-5.4和GPT-5.5的教学效果，指出两者在解释概念时存在关键差异。GPT-5.4倾向于先阐述概念，再让学习者回溯关联标签，增加了认知负担。而GPT-5.5采用更清晰的方式：先给出明确标签（如“导数”），再立即附上概念解释（如“描述变化速率”）。这种“标签优先”的结构使解释流畅连贯，无需大脑反复回溯重组信息，从而在长期教学对话中能更好地维持学习者的注意力。

Chubby♨️@kimmonismus · 5月2日51

Nice! Google is preparing for I/o. New models soon

译不错！Google 正在为 I/O 大会做准备。新模型即将推出

Elon Musk@elonmusk · 5月2日39

Grok Voice is used by Starlink right now

译Grok Voice 目前正被 Starlink 使用 [引用 @XFreeze]：Grok Voice 在 τ-voice 基准测试中占据绝对优势 Grok 得分为 67.3%，而 Gemini 为 43.8%，GPT Realtime 为 35.3% 这遥遥领先于竞争对手，优势巨大目前最优秀的实时推理语音助手

TestingCatalog News 🗞@testingcatalog · 5月2日66

GOOGLE 🚨: A new Gemini Flash model has been spotted on LM Arena. Besides that, Vertex AI customers who still use Gemini Flash 2 received an email that it will be distributed soon. > Transition to Gemini 3.1 Flash Lite - Generaly Available soon! Soon 🔜 h/t @hishtadlut

译谷歌新的Gemini Flash模型已在LM Arena上出现。同时，Vertex AI客户收到邮件，Gemini 3.1 Flash Lite即将正式发布。引用推文指出，虽然模型在竞技场中仍显示为“Gemini 3 Flash”，但其输出质量已跃升两个层级，性能更接近当前的Gemini 3.1 Pro，是一次重大升级，实际版本可能是3.1、3.2或3.5 Flash。

François Chollet@fchollet · 5月2日37

If you want to help the world make sense of AGI and accelerate its arrival, consider joining the ARC Prize foundation. Two roles currently open: Game Platform Engineering Lead, and Model Testing & Analysis Lead https://arcprize.org/jobs

译若你希望帮助世界理解AGI并加速其到来，可以考虑加入ARC Prize基金会。目前开放两个职位：Game Platform Engineering Lead，以及Model Testing & Analysis Lead https://arcprize.org/jobs

François Chollet@fchollet · 5月2日56

The latest crop of models remains below 1% on ARC-AGI-3 -- for now. Where will the scores be by the end of the year?

译最新一批模型在ARC-AGI-3上的得分目前仍低于1%。到今年年底，得分会达到多少呢？

François Chollet@fchollet · 5月2日70

RL is a bit of a double edged sword: in known territory performance increases, but in unknown territory the model tends to hallucinate that it is performing a completely different task it was trained on

译强化学习在已知领域能提升模型性能，但在未知领域可能导致模型产生幻觉，误以为在执行其他训练过的任务。这一现象在GPT-5.5等大模型的ARC AGI 3基准测试中有所体现，其得分仅为0.43%，与Claude 4.6、Gemini 3.1等模型表现相近。分析指出GPT-5.5的主要失败原因包括：局部效应正确但世界模型错误、从训练数据中提取的抽象层级不当，以及虽解决问题却未强化奖励机制。深入分析此类失败案例，有助于全面理解大模型在特定模态上的能力局限与改进方向。

PixVerse@PixVerse_ · 5月1日49

Thanks @TomLikesRobots ! Wishing you a happy and cozy weekend!

译主推文感谢了用户@TomLikesRobots分享的文本生成视频模型对比。对比在SeeDance 2.0和HappyHorse 1.0之间进行，使用了统一的提示词来生成具有低保真、温馨、赛璐珞风格动漫美学的视频。其中，HappyHorse由@PixVerse_提供，目前对会员免费。由于两个模型自带的音频效果不佳，创作者最终使用@Suno来生成背景音轨。

TestingCatalog News 🗞@testingcatalog · 5月1日55

Grok 4.3 got to the 7th spot on the Artificial Analysis Index, surpassing Muse Spark from Meta.

译Grok 4.3 在 Artificial Analysis Index 中升至第 7 位，超越了 Meta 的 Muse Spark。

Rohan Paul@rohanpaul_ai · 5月1日43

The LongCat team just released LARYBench, a benchmark built to test whether an AI model truly learns action from video, instead of only looking good when attached to a robot policy later. It evaluates latent actions, meaning the hidden motion signals a model extracts from video, across 1.2M+ clips, 620K+ image pairs, 595K trajectories, 151 action classes, and 11 robot platforms. A latent action representation tries to store the change between frames as something like reach, pick, place, move left, or close gripper, rather than memorizing raw pixels. The key point is that robot training data is scarce, while human and robot videos are abundant, so the whole field wants a way to turn cheap video into useful action knowledge. The paper argues that older evaluations mixed too many things together, because a robot succeeding on a task depends on the policy, training recipe, environment, and controller, so you could not tell whether the action representation itself was actually good. LARYBench splits the problem into 2 cleaner tests, where one asks whether the representation knows what happened and the other asks whether it preserves enough detail for how to move. The biggest result is that general self-supervised vision models beat specialized embodied LAMs, with V-JEPA 2 reaching 76.62% average action classification accuracy, while DINOv3 gives the best overall control regression score at 0.19 MSE, far ahead of embodied models clustered around 0.87 to 0.97. The deeper point is that strong visual representations already contain a surprising amount of action knowledge, and the paper also shows that latent feature spaces map to robot control better than pixel reconstruction spaces, which helps explain why some robotics systems may be building on the wrong intermediate representation. 🧵 1.

译LongCat团队推出LARYBench基准，旨在评估AI模型是否从视频中真正学习动作，而非仅在后端机器人策略中表现良好。该基准聚焦模型从视频提取的潜在动作表示，通过超过120万视频片段等数据，将评估拆分为动作分类与控制回归两个清晰测试。关键发现是，通用自监督视觉模型（如V-JEPA 2和DINOv3）表现优于专用具身模型，表明强大视觉表示已蕴含丰富动作知识，且潜在特征空间比像素重建更利于机器人控制映射。这为利用丰富视频数据解决机器人训练数据稀缺问题提供了新方向。

Artificial Analysis@ArtificialAnlys · 5月1日57

All three leading open weights models were released last week. Progress continues for open weights models alongside proprietary ones, with the gap to GPT-5.5, the leading proprietary model, sitting at 6 points on the Artificial Analysis Intelligence Index @Kimi_Moonshot’s Kimi K2.6 (Reasoning) and @Xiaomi's MiMo V2.5 Pro (Reasoning) tie as the leading open weights models on the Artificial Analysis Intelligence Index at 54, with @deepseek_ai's DeepSeek V4 Pro (Reasoning, Max Effort) at 52. This places the best open weights models within 3-6 points of the leading proprietary models: @OpenAI's GPT-5.5 (xhigh) at 60, and @Google's Gemini 3.1 Pro Preview and @AnthropicAI's Claude Opus 4.7 (Adaptive Reasoning, Max Effort) at 57. For context: just one year ago the highest-scoring open weights model was DeepSeek V3 0324 which achieved 22 on the Intelligence Index, and was ~13 points below the highest-scoring proprietary model, Claude 3.7 Sonnet (Reasoning) at 35. Key takeaways: ➤ The top three most intelligent open weights models are trillion-plus-parameter MoE architectures with permissive licenses. Kimi K2.6 (Reasoning) has 1T total / 32B active parameters with 256K context window, MiMo V2.5 Pro (Reasoning) has 1T total / 42B active with 1M context window, and DeepSeek V4 Pro (Reasoning, Max Effort) has 1.6T total / 49B active with 1M context window. ➤ The gap to proprietary remains wide on the hardest reasoning and agentic coding evaluations. On HLE (Humanity's Last Exam) the three top open weights models score 34-36%, vs 44% for GPT-5.5 (xhigh) and 45% for Gemini 3.1 Pro Preview. On CritPt (Research-level Physics) they score 4-12%, vs 27% for GPT-5.5 (xhigh). On TerminalBench Hard (Agentic Coding & Terminal Use) they score 43-46%, vs 61% for GPT-5.5 (xhigh) and 54% for Gemini 3.1 Pro Preview. ➤ Omniscience (knowledge + hallucination) shows a large gap to proprietary models, with DeepSeek V4 Pro (Reasoning, Max Effort) hallucinating significantly more than its open weights peers. DeepSeek V4 Pro (Reasoning, Max Effort) scores -10, MiMo V2.5 Pro (Reasoning) +4, and Kimi K2.6 (Reasoning) +6. By comparison, GPT-5.5 (xhigh) scores +20, Claude Opus 4.7 (Adaptive Reasoning, Max Effort) +26, and Gemini 3.1 Pro Preview +33.

译上周，Kimi K2.6、MiMo V2.5 Pro和DeepSeek V4 Pro三大领先开源模型发布，在Artificial Analysis Intelligence Index上得分达52-54分，与顶尖闭源模型GPT-5.5的60分差距缩小至6分以内，相比一年前22分的开源模型进步显著。这些模型均为万亿参数规模的MoE架构。然而，在复杂推理、智能体编码及知识准确性方面，开源模型与闭源模型仍存在明显差距。例如在HLE、CritPt和TerminalBench Hard等专项评估中得分大幅落后；在Omniscience评估中，DeepSeek V4 Pro的幻觉问题尤为突出。

elvis@omarsar0 · 5月1日58

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on @FireworksAI_HQ inference. This is the first time I feel like there is an open-weight model that can reason at the level of Claude and Codex. And it does this in a cost-effective way with support for 1M context length. To be clear, I am using DeepSeek-V4-Pro inside of Pi without any special configuration. It works out of the box. It's exciting that there is a model that can just be plugged into a basic harness like Pi, and it just works. I've never seen that before. Most models require lots of configuration and setup. @deepseek_ai's DeepSeek-V4-Pro is clearly good at agentic coding (probably the best from the open-weight models), but the model is also great on knowledge-intensive tasks where reasoning matters. The agent pulled agentic engineering best practices from different company docs (Anthropic, OpenAI, Google, Stripe, Meta, Modal, DeepSeek, Mistral, Cohere), searched and digested Reddit and HN threads, summarized arxiv papers, and surfaced trending GitHub repos. Then it distilled everything into actionable tips across categories. I love the Wiki it built. The quality is really good. Here is a snapshot of what the wiki looks like: https://github.com/dair-ai/dair-workshops/tree/main/agentic-engineering-wiki DeepSeek-V4-Pro handled the task without breaking stride. Multi-step research queries, code generation for scaffolding, context-heavy reasoning across disparate sources. For coding specifically, this is the first open-weight model that genuinely feels like a Codex or Claude Code experience. It compares in capability and actual multi-turn agentic work. What made the loop feel so responsive was Fireworks' inference speed (the fastest in the market) and the fact that they actually validate models at the systems level before shipping. No corrupted reasoning traces. Just fast, reliable iteration. The hybrid CSA and HCA attention design cuts KV cache to just 10% and inference FLOPs by nearly 4x at 1M-token context. This is what makes the agent loop actually fast and cheap enough to run in practice. For devs who've been watching open-weight models close the gap but haven't found one that actually delivers in practice, this is the closest I've seen. Try it here: https://app.fireworks.ai/models/fireworks/deepseek-v4-pro

译测试者使用 DeepSeek-V4-Pro 在 Pi 编码智能体上构建了一个 LLM 知识库，对其开箱即用的表现感到震撼。这是首个在推理能力上媲美 Claude 和 Codex 的开源权重模型，且成本效益高，支持 100 万上下文长度。该模型无需复杂配置即可在基础框架中直接运行，擅长智能体编码和知识密集型推理任务，能跨公司文档、论坛、论文和代码库进行多步骤研究、代码生成与上下文推理。其高效运行得益于 Fireworks 的市场最快推理速度及混合注意力设计，将 KV 缓存降至 10%，推理计算量减少近 4 倍，实现了快速且低成本的实践部署。

Ethan Mollick@emollick · 5月1日61

The new Grok comes in below the latest Chinese open weights models, Grok 4 was at the frontier when released. (& Artificial Analysis: please stop using GDPval-AA which is not a useful test of anything except a model’s ability to impress Gemini as a judge)

译xAI发布Grok 4.3，其在Artificial Analysis智能指数得分53，性能优于Grok 4.20、Muse Spark等模型。核心改进在于“性价比”：输入与输出价格较前代分别降低约40%和60%，且基准测试套件运行成本下降。该版本在GDPval-AA等现实智能体任务上表现显著提升，指令遵循与客服任务强劲。但推文指出，其表现仍落后于最新的中国开源模型，并批评GDPval-AA测试本身价值有限。

OpenRouter@OpenRouter · 5月1日68

The new Grok-4.3 from @xai is live on OpenRouter! Grok-4.3 releases at a lower price than Grok-4.2, while seeing a large jump in agentic performance: a 321 point increase to 1500 ELO on @ArtificialAnlys GDPval-AA, surpassing other top models despite the lower price.

译@xai 的新模型 Grok-4.3 现已在 OpenRouter 上线！ Grok-4.3 以比 Grok-4.2 更低的价格发布，同时在代理性能上实现大幅跃升：在 @ArtificialAnlys 的 GDPval-AA 基准上 ELO 分数提升 321 点至 1500，尽管价格更低，但仍超越了其他顶级模型。

Rohan Paul@rohanpaul_ai · 5月1日58

Frontier AI can now autonomously chain complex, expert-level cyber attacks end-to-end, at superhuman speed and near-zero marginal cost. GPT-5.5 essentially tied with Mythos Preview - within the margin of error — both far ahead of earlier models (GPT-4o, Claude Opus 4.x, etc.). - GPT-5.5: 71.4% (±8.0%) - Mythos Preview: 68.6% (±8.7%) AISI has been running controlled, realistic cybersecurity evaluations on the latest AI models. These include: - Narrow CTF-style tasks (expert-level challenges like exploiting memory corruptions, breaking crypto, reverse-engineering stripped binaries, etc.). - Multi-step “cyber range” simulations — a full 32-step corporate network attack chain (recon → initial access → lateral movement → privilege escalation → full network takeover). A human expert needs ~20 hours for this. They previously tested Mythos Preview, and now OpenAI’s GPT-5.5. One hard reverse-engineering task (custom virtual machine) takes a human expert ~12 hours with professional tools. GPT-5.5 solved it in under 11 minutes at a cost of $1.73.

译前沿AI已能以超人速度和近乎零边际成本自主完成端到端的复杂专家级网络攻击链。在AISI的网络安全评估中，GPT-5.5与Mythos Preview表现相当，均远超GPT-4o等早期模型。GPT-5.5在包含32个步骤的企业网络攻击模拟中成功完成端到端攻击，而人类专家需约20小时。在一项人类专家需12小时完成的反向工程任务中，GPT-5.5仅用11分钟、花费1.73美元即告解决。

Chubby♨️@kimmonismus · 5月1日60

/1 Gemma 4 31B just crushed Qwen 3.6 27B in a local LLM gamedev contest inside @atomic_chat_hq (prompt is below) Device: MacBook Pro M5 Max, 64GB RAM Results: Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens So what is more important: tokens per second, or the quality of the final answer? Qwen made a very long response and showed more creativity and visual style. But Gemma gave a shorter, clearer, and more logical answer in much less time. In this one-shot Pac-Man gamedev contest, Gemma 4 31B was the clear winner. Its game logic was stronger: click reactions were smoother, and it handled interactions with elements like walls, ghosts, and particle effects better. But this was only one test. Maybe Qwen 3.6 27B can show better results with better settings. Open the comments, try our prompt, and share your result below.

译在@atomic_chat_hq平台的本地LLM游戏开发竞赛中，Gemma 4 31B与Qwen 3.6 27B于MacBook Pro M5 Max上对决。尽管Qwen生成速度更快（32 tokens/秒）且回答更具创意，但Gemma仅用3分51秒和6209个token，输出了更简短、清晰、逻辑性强的答案。在具体的吃豆人游戏逻辑实现上，Gemma在点击反应、与墙壁/幽灵的交互及粒子效果处理方面表现更优。作者强调此为单次测试，Qwen或可通过调整设置提升表现，并邀请社区验证。

Artificial Analysis@ArtificialAnlys · 5月1日65

Ant Group has just released Ling 2.6 1T, an open weights, non-reasoning model with high cost efficiency and a reasonable intelligence tradeoff. Ling 2.6 1T scores 34 on the Artificial Analysis Intelligence Index, a 15-point jump from Ling-1T Ling 2.6 1T is the latest model from Ant Group’s @TheInclusionAI lab. Ant Group recently released Ling 2.6 Flash, a 104B total parameter non-reasoning model. Ling 2.6 1T’s weights have been publicly released on Hugging Face. Key takeaways: ➤ Comparable intelligence to similarly sized non-reasoning models: At 1T total parameters, Ling 2.6 1T sits near DeepSeek V3.2 (non-reasoning, 32) and Kimi K2.5 (non-reasoning, 37) in intelligence. This is a marked improvement from Ling-1T, which scores 19 on the Intelligence Index. However, there remains a ~10-point gap to frontier non-reasoning open weights models such as GLM-5.1 (non-reasoning, 44) and Kimi K2.6 (non-reasoning, 43). ➤ Strong performance in scientific reasoning and knowledge: Ling 2.6 1T scores 75% on GPQA and 8% on Humanity’s Last Exam (HLE), indicating solid performance on graduate-level reasoning and knowledge recall tasks. This is comparable to DeepSeek V3.2 (non-reasoning), which achieves 75% on GPQA and 11% on HLE. ➤ Efficient token usage: Ling 2.6 1T uses ~16M output tokens to run the Artificial Analysis Intelligence Index, making it more efficient than MiMo V2 Flash (non-reasoning, ~17M), and significantly more efficient than GLM-5.1 (non-reasoning, ~75M) and Kimi K2.6 (non-reasoning, ~27M) ➤ Strong cost-to-intelligence positioning: At $0.30 per million input tokens and $2.50 per million output tokens on InclusionAI’s first-party API, Ling 2.6 1T costs only ~$95 to run the full Artificial Analysis Intelligence Index. This positions it competitively for large-scale workloads relative to models in a similar intelligence tier. ➤ Relatively weak factual reliability: Ling 2.6 1T scores -51 on AA-Omniscience, our benchmark for factual accuracy and hallucination. This is primarily driven by a high hallucination rate (92%), which is similar to GPT-5.5 (non-reasoning, 91%). However, its 21% accuracy is broadly in line with comparable non-reasoning models. Additional model details: ➤ Size: 1T total parameters ➤ Pricing: $0.30 / $2.50 per 1M input/output tokens (via Novita API) ➤ License: Weights not yet released ➤ Availability: First-party API through InclusionAI

译蚂蚁集团InclusionAI实验室发布开源非推理模型Ling 2.6 1T。该模型拥有1万亿参数，在Artificial Analysis Intelligence Index上得分为34分，较前代Ling-1T提升15分，智能水平接近DeepSeek V3.2等同类模型。其在科学推理与知识任务上表现扎实，GPQA得分达75%。模型运行效率较高，执行该指数仅需约1600万输出tokens，成本效益突出，通过官方API运行全套指数成本约95美元。但其事实可靠性较弱，在AA-Omniscience基准上得分为-51分，主要因幻觉率高达92%。模型权重已在Hugging Face公开。

Artificial Analysis@ArtificialAnlys · 5月1日46

GPT-5.5 Pro achieves a small bump on GPT-5.4 Pro with 60% lower cost and token use in our frontier science eval, CritPt CritPt tests models on graduate-level physics research problems contributed by 60+ researchers from 30+ institutions globally. When CritPt was released in November 2025, the highest score was 9% (Gemini 3 Pro Preview). ~4 months later, GPT-5.4 Pro (xhigh) tripled this score with 30%. Now, GPT-5.5 Pro (xhigh) has surpassed this result by half a percentage point at 60% lower cost. The model is priced identically per token, but used fewer tokens to complete the evaluation. According to OpenAI, GPT-5.5 Pro “uses more compute to think harder and provide consistently better answers” than GPT-5.5. Congratulations @OpenAI and @sama on this result

译在名为CritPt的尖端科学评估中，GPT-5.5 Pro (xhigh) 以比前代GPT-5.4 Pro (xhigh) 低60%的成本和令牌使用量，实现了0.5个百分点的性能提升，将得分推至30.5%。CritPt评估包含全球30多家机构的60多名研究人员贡献的研究生级别物理问题。自2025年11月发布以来，最高分从Gemini 3 Pro Preview的9%跃升至GPT-5.4 Pro的30%。OpenAI指出，GPT-5.5 Pro相比GPT-5.5“使用了更多计算资源进行深度思考，以提供更稳定的优质答案”。该模型每令牌定价相同，但通过使用更少的令牌完成了评估。

Chubby♨️@kimmonismus · 5月1日46

GPT-5.5 on par with Claude Mythos on mutli-step cyber-attack simulations? OpenAI: come back of the year.

译GPT-5.5在多层网络攻击模拟方面与Claude Mythos旗鼓相当？ OpenAI：年度回归。

Sam Altman@sama · 5月1日43

lisan say more mean things about us you're being too nice

译lisan 多说点我们的坏话你太客气了 [引用 @scaling01]：GPT-5.5 is on par with Claude Mythos - GPT-5.5 平均通过率 71.4% (±8.0%) - Mythos Preview 68.6% (±8.7%) - GPT-5.5 在 11 分钟内以 1.73 美元成本完成了一项人类专家需约 12 小时的任务

Artificial Analysis@ArtificialAnlys · 5月1日64

Alibaba's Qwen3.6 27B is the new open weights leader under 150B parameters scoring 46 on the Artificial Analysis Intelligence Index, but uses ~3.7x the output tokens and costs ~21x more than Gemma 4 31B (39) to run the full Intelligence Index @Alibaba_Qwen has released two open weights models in the Qwen3.6 family: Qwen3.6 27B (Dense, 46 on the Intelligence Index) and Qwen3.6 35B A3B (MoE, 43). The MoE variant has 36B total parameters but only activates 3B per forward pass. Both are Apache 2.0 licensed, support 262K context, include native multimodal input, and use the unified thinking/non-thinking hybrid architecture. Unlike Qwen3.5, Alibaba has not released larger Qwen3.6 models as open weights - Qwen3.6 Plus and Qwen3.6 Max Preview remain proprietary, so the Qwen3.6 open weights family is currently all under 50B models. All scores below are for reasoning mode. The Intelligence Index is our synthesis metric incorporating 10 evaluations covering agentic tasks, coding, and scientific reasoning. Key takeaways: ➤ Qwen3.6 27B is the most intelligent open weights model under 150B parameters. At 46 on the Intelligence Index, Qwen3.6 27B is ahead of Qwen3.6 35B A3B (43), Qwen3.5 27B (42), and Gemma 4 31B (39). It is also ahead of larger open weights models including NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), Qwen3.5 122B A10B (42) and gpt-oss-120b (high, 33). In native BF16 precision, the 27B takes ~56GB to store the weights, fitting on a single H100, and in 4-bit quantization the weights fit on consumer hardware with 16GB+ of RAM ➤ Qwen3.6 35B A3B is the most intelligent open weights model with ~3B active parameters, 6 points ahead of Qwen3.5 35B A3B (37) and 13 points ahead of GLM-4.7-Flash (30). Other ~3B active peers include Gemma 4 26B A4B (31), Qwen3 Coder Next (80B total, 28), and NVIDIA Nemotron Cascade 2 30B A3B (28) ➤ AA-Omniscience improvement is driven entirely by abstention rather than accuracy. Qwen3.6 27B's hallucination rate falls from 80% to 48% versus Qwen3.5 27B, while accuracy is roughly flat - consistent with our finding that AA-Omniscience accuracy typically correlates with total parameter count and Qwen3.6 27B retains the same 27B parameter count as its predecessor. The 35B A3B shows the same pattern whereby hallucination drops from 84% to 50% while accuracy remains equivalent ➤ Token usage is up across both models versus Qwen3.5 and significantly higher than Gemma 4 31B. Qwen3.6 27B used ~144M output tokens to run the Intelligence Index (~1.5x Qwen3.5 27B at 98M, ~3.7x Gemma 4 31B at 39M). Qwen3.6 35B A3B used ~143M (~1.4x Qwen3.5 35B A3B at 100M, ~3.7x Gemma 4 31B) ➤ The 27B got materially more expensive while the 35B A3B is roughly flat versus predecessor. Per-token pricing on Alibaba Cloud moved differently, with the 27B going from $0.30/$2.40 to $0.60/$3.60 while the 35B A3B (Reasoning) remains nearly flat at $0.248/$1.485 (vs $0.25/$2.00 for Qwen3.5 35B A3B). Qwen3.6 27B costs ~$659 to run the Intelligence Index, ~2.2x Qwen3.5 27B (~$299) and ~21x Gemma 4 31B (~$31 at median third-party pricing of $0.14/$0.40 per 1M input/output tokens). Qwen3.6 35B A3B costs ~$280, roughly tied with Qwen3.5 35B A3B (~$302) and ~9x Gemma 4 31B ➤ Qwen3.6 27B is competitive with leading models on agentic real-world work tasks despite its size. At 1414 Elo on GDPval-AA, Qwen3.6 27B is ahead of recent open weights peers Qwen3.6 35B A3B (1297), Qwen3.5 27B (1157) and Gemma 4 31B (1115), but trails larger open weights leaders including DeepSeek V4 Pro (Reasoning, Max Effort, 1554) and GLM-5.1 (Reasoning, 1535). It matches DeepSeek V4 Flash (Reasoning, High Effort, 1414) at 284B total parameters, and sits roughly in line with GPT-5.4 mini (xhigh, 1436) and Muse Spark (1421). ➤ Non-reasoning variants remain equivalent versus Qwen3.5. Qwen3.6 27B (Non-reasoning, 37) is effectively tied with Qwen3.5 27B (Non-reasoning, 37); Qwen3.6 35B A3B (Non-reasoning, 32) is equivalent to Qwen3.5 35B A3B (Non-reasoning, 31). The Qwen3.6 generation gains are concentrated in reasoning mode Other information: ➤ Context window: 262K tokens (equivalent to Qwen3.5) ➤ License: Apache 2.0 ➤ Multimodality: Native vision input (text and image), text output ➤ API pricing (Alibaba Cloud): Qwen3.6 27B: $0.60/$3.60, Qwen3.6 35B A3B (Reasoning): $0.248/$1.485 ➤ Availability: Available on Alibaba Cloud first-party API. Qwen3.6 35B A3B is available on several third-party APIs such as @DeepInfra, @parasail_io, @clarifai and @novita_labs

译阿里巴巴开源了Qwen3.6系列两款模型：27B密集模型和35B A3B混合专家模型。其中，Qwen3.6 27B在Artificial Analysis智能指数上得分46，成为150B参数以下最智能的开源模型，领先于Gemma 4 31B等。但其运行完整测试消耗的输出token约为后者的3.7倍，成本高出约21倍。两款模型均采用Apache 2.0许可，支持262K上下文，具备多模态能力。值得注意的是，其幻觉率较前代大幅下降，但准确率基本持平。更大的Plus和Max Preview版本未开源。

Artificial Analysis@ArtificialAnlys · 4月30日56

Tencent has released Hy3-preview, an open weights reasoning model scoring 42 on the Artificial Analysis Intelligence Index, trailing recent open weights peers Hy3-preview is the latest model from @TencentHunyuan. It is a 295B total / 21B active parameter Mixture-of-Experts model, smaller than its December 2025 predecessor Tencent HY 2.0 (406B total / 32B active). Recent leading open weights reasoning models include Qwen3.6 27B (Reasoning, 46), DeepSeek V4 Flash (Reasoning, Max Effort, 47, 284B / 13B) and GLM-5.1 (Reasoning, 51, 744B / 40B). The Intelligence Index is the Artificial Analysis synthesis metric incorporating 10 evaluations covering agentic tasks, coding and scientific reasoning. Key takeaways: ➤ Hy3-preview trails recent open weights peers on GDPval-AA. Hy3-preview scores an Elo of 1235 on GDPval-AA, our agentic real-world work tasks benchmark, behind Qwen3.6 27B (Reasoning, 1414), DeepSeek V4 Flash (Reasoning, Max Effort, 1388) and GLM-5.1 (Reasoning, 1535). GDPval-AA tests models on real-world tasks across 44 occupations and 9 major industries. ➤ Hy3-preview ties GLM-5.1 (Reasoning) on CritPt despite scoring nearly 10 Intelligence Index points lower. Hy3-preview scores 4.6% on CritPt (research-level physics), matching GLM-5.1 (Reasoning, 51 on the Intelligence Index) and ahead of Qwen3.6 27B (Reasoning, 1.1%) but behind DeepSeek V4 Flash (Reasoning, Max Effort, 7.1%). It trails the open weights leaders, including DeepSeek V4 Pro (Reasoning, Max Effort, 12.9%) and Kimi K2.6 (8.0%). ➤ Hy3-preview used ~125M output tokens to run the Intelligence Index. This is ~12% more than GLM-5.1 (Reasoning, 112M) and less than Qwen3.6 27B (Reasoning, 144M) and DeepSeek V4 Flash (Reasoning, Max Effort, 241M). ➤ AA-Omniscience is a relative weakness compared to peers. Hy3-preview scores -35 on the Artificial Analysis Omniscience Index with 28% accuracy and an 87% hallucination rate. This trails DeepSeek V4 Flash (Reasoning, Max Effort, -23), Qwen3.6 27B (Reasoning, -20) and GLM-5.1 (Reasoning, 2). Other information: ➤ Size: 295B total parameters, 21B active parameters ➤ Context window: 256K tokens ➤ License: Tencent HY Community License Agreement, with restricted commercial use ➤ Availability: Weights are available on @huggingface Face and the model is also available on @SiliconFlowAI at $0/$0 per 1M input/output tokens

译腾讯发布开源混合专家模型Hy3-preview，总参数量2950亿，激活参数量210亿。其在Artificial Analysis综合智能指数上得分42，落后于近期开源的GLM-5.1、DeepSeek V4 Flash及Qwen3.6 27B等推理模型。具体评测表现不均衡：在真实世界任务基准GDPval-AA上落后于主要竞品，但在研究级物理评测CritPt上与高分模型GLM-5.1持平；其相对弱项在于AA-Omniscience指数，幻觉率较高。模型采用Tencent HY社区许可协议，商业使用受限，已在Hugging Face和SiliconFlowAI平台提供。

阿绎 AYi@AYi_AInotes · 4月30日64

讲真，看到百度排第一我属实是没想到的哈哈哈

译LMArena文本榜显示，百度文心5.1 Preview以1476分位列国内第一、全球前十五，成为榜单中唯一国产模型，排名超过GPT-5.5等。尽管当前AI热点集中于Agent、多模态等领域，但DeepSeek V4与文心5.1 Preview仍以文本为核心。文章强调，文本能力是大模型的基础，代码、推理等多模态能力均从中“生长”，文本差距直接决定上层能力水平，因此仍是衡量模型差距的关键分水岭。

SemiAnalysis@SemiAnalysis_ · 4月30日53

GB300 NVL72 Rack Scale Dynamo SGLang disaggregation has up to 6.5x better performance than B200 on DeepSeekv4 Pro 1.6T 🚀 The high throughput configuration uses @deepseek_ai 's MegaMoe kernels which fully fuses & overlaps EP dispatch & EP combine & the GEMMs into an single kernel. This performance is achieved from the 10x engineers @BanghuaZ, Tom & the rest of the team at @radixark, @lmsysorg & @NVIDIAAI for rapidly enabling this performance! Big Shoutout to @CoreWeave to contributing temporary GB300 NVL72 racks towards the open source performance optimization for all to benefit!

译在 DeepSeek-V4 Pro 1.6T 模型上，采用机架级解耦设计的 GB300 NVL72 系统性能达到 B200 的 6.5 倍。这一高吞吐配置得益于 DeepSeek-AI 的 MegaMoe 内核，该内核将专家分派、专家组合及 GEMM 运算完全融合并重叠至单一内核中。性能突破由 Radixark、LMSYS 和 NVIDIA AI 的工程师团队快速实现。CoreWeave 为此项开源性能优化贡献了临时的 GB300 NVL72 机架资源，使整个社区受益。

Ethan Mollick@emollick · 4月30日56

Gemini now can create documents, and it is a nice start, but not up to the frontier yet, as you can see from my "LBO of Hogwarts" test. PowerPoints are substantially worse than NotebookLM, spreadsheets are primitive, still no thinking trace, it doesn't think hard enough, either.

译Gemini现在可以创建文档了，这是个不错的开始，但尚未达到前沿水平，正如你从我“霍格沃茨杠杆收购”测试中看到的那样。 PowerPoint比NotebookLM差得多，电子表格功能简陋，仍然没有思考轨迹，它的思考也不够深入。