Logan Kilpatrick@OfficialLoganK · 6月7日54

you could build a top tier venture firm just focusing investment decisions short and long term based on deep model benchmarking / evals find capability overhang, find areas models suck and track trajectory, etc

译你可以建立一家顶级风投公司，仅基于深度模型基准测试/评估来做出短期和长期投资决策。发现能力过剩，发现模型糟糕的领域，并追踪轨迹等。

向阳乔木@vista8 · 6月6日37

今晚跟 @tuturetom 直播的AI总结，把所有经验都毫无保留分享了。 Open Design最常见的使用场景：做前端设计和原型、做PPT、做海报等。另外直播中，大家讨论了一个非常主观的LLM前端审美排名，仅供参考： Claude opus 4.8 > kimi2.6 > GPT 5.5 > Deepseek v4 pro > GLM 5.1> Deepseek v4 Flash

译今晚跟 @tuturetom 直播的AI总结，把所有经验都毫无保留分享了。 Open Design最常见的使用场景：做前端设计和原型、做PPT、做海报等。另外直播中，大家讨论了一个非常主观的LLM前端审美排名，仅供参考： Claude opus 4.8 > kimi2.6 > GPT 5.5 > Deepseek v4 pro > GLM 5.1 > Deepseek v4 Flash

AYi@AYi_AInotes · 6月6日60

一定要给你的龙虾或者Hermes配上多模态大模型，我今天实测下来，目前多模态大模型性价比最高的就是Qwen3-VL / Qwen3.5 VL系列，比Gemini 3.5 Flash 输出便宜 22 倍,读图能力一样在，我自己用的是qwen/qwen3.5-flash($0.1/$0.4,多模态图+视频,1M 上下文)，供大家参考。

译用户实测推荐，目前多模态大模型性价比最高的是Qwen3-VL / Qwen3.5 VL系列，其输出价格比Gemini 3.5 Flash便宜22倍，读图能力相当。作者使用的具体模型是qwen/qwen3.5-flash，价格为$0.1/$0.4，支持多模态图片+视频，上下文窗口达1M。

Artificial Analysis@ArtificialAnlys · 6月6日52

Google’s newly released open weights model, Gemma 4 12B, supports transcription but is far from the frontier, scoring 8.8% on AA-WER (#58) Gemma 4 12B is the latest release from @GoogleDeepMind in the Gemma 4 family. With a score of 8.8% on AA-WER, it is able to capture a reasonable amount of conversation context, but underperforms compared to transcription-focused open weights models like Voxtral Mini Transcribe 2 (3.6% WER, with 4B parameters) and slightly larger open weights language models like Voxtral Small (2.8% WER, with 12B parameters). The new model launched alongside their local dictation app, Eloquent, available on MacOS and iOS. Gemma 4 12B is the largest in the Gemma 4 family to support transcription, alongside Gemma 4 E4B and Gemma 4 E2B, with Gemma 4 31B and Gemma 4 26B A4B supporting text, image and video input only. These models are available on a variety of platforms including Hugging Face, Ollama and LMStudio. We are currently running Gemma 4 12B through the full Artificial Analysis Intelligence Index and will share results soon.

译Google DeepMind 发布开源权重模型 Gemma 4 12B，支持语音转录，在 AA-WER 基准上得分为 8.8%（排名第 58），远低于专注转录的开源模型 Voxtral Mini Transcribe 2（4B 参数，WER 3.6%）和 Voxtral Small（12B 参数，WER 2.8%）。该模型是 Gemma 4 系列中支持转录的最大型号（另有 E4B、E2B），而 31B 和 26B A4B 仅支持文本、图片和视频输入。Google 同步推出本地听写应用 Eloquent（MacOS/iOS）。模型已在 Hugging Face、Ollama 和 LMStudio 上架。

Rohan Paul@rohanpaul_ai · 6月6日76

Arena just released a real-world agent leaderboard that ranks AI models by how well they complete actual user jobs, not isolated benchmark questions. The system tracks agents using web search, files, and terminal tools while people ask them to write code, build apps, research topics, create documents, and analyze files. The problem with almost all traditional AI benchmarks is that they test clean tasks, while agents now handle messy work like coding, research, documents, web browsing, files, and terminal commands. Agent Arena tries to measure agents inside real work sessions, where users correct them, approve results, complain, download files, and expose tool failures as the task unfolds. Its core idea is to treat each model choice like a test condition, then estimate how much that model improves task outcomes compared with a baseline. The leaderboard combines 5 signals: confirmed task success, praise versus complaint, ability to follow corrections, recovery from terminal errors, and whether the agent invents tools that do not exist. The data is large enough to show real behavior patterns, with 300K+ tasks, 2M+ tool calls, and 40M lines of code produced by agents. The score combines task success, steerability, bash recovery, praise vs. complaint, and tool hallucination, which means the model is judged by whether it finishes, recovers, accepts correction, and avoids fake tool calls. GPT-5.5 High leads with +10.7% net improvement, followed by Claude Opus 4.7 Thinking at +9.5% and GPT-5.4 High at +8.9%. The most useful detail is that agents fail like workers under pressure: they can leave one part incomplete, claim the job is done, or sound confident while backing down after correction. Arena’s strongest contribution is treating agents as working systems, where model choice, tool use, recovery behavior, and user satisfaction all count together.

译Arena 推出基于真实用户任务的智能体排行榜，评估模型在代码编写、应用构建、文档分析等工作中的表现，而非孤立基准。排行榜基于30万+任务、200万+工具调用和4000万行代码，综合任务成功、纠正遵从性、错误恢复、用户表扬与抱怨、工具幻觉等信号。前三名：GPT-5.5 High（+10.7%）、Claude Opus 4.7 Thinking（+9.5%）、GPT-5.4 High（+8.9%）。

Berryxia.AI@berryxia · 6月5日70

大模型都不再卷推理，都开始卷规划能力！腾讯混元联合人大高瓴人工智能学院直接开源了PlanningBench，一个专门测、训LLM真实规划能力的框架。里面塞了30多个来自真实世界的规划任务，覆盖调度、生产、旅行、资源分配、应急响应等六大类，每一个都有清晰的成功标准和全自动验证机制。你既可以用它测出当前最强模型到底在规划上有多拉胯，也能直接拿来继续微调，让模型从“会说”真正进化到“会干”。以前整个行业都在卷参数、卷上下文、卷工具调用，好像规划能力是自然就会长出来的。现在PlanningBench用30多个可验证任务直接把真相摊开：规划才是agent从玩具走向生产力的真正分水岭。腾讯这次把论文、代码、数据集全甩到GitHub和Hugging Face，等于把这个最难、最核心的能力从黑盒拉到了公开赛道。

译腾讯混元联合人大高瓴人工智能学院开源PlanningBench，一个可扩展、可验证的框架，用于评估和训练大语言模型（LLM）的真实规划能力。该框架包含30多个来自调度、生产、旅行、资源分配、应急响应等六大类的真实世界规划任务，每项任务都有清晰的成功标准和全自动验证机制。用户既可用它评测当前最强模型在规划上的短板，也可直接用于微调，让模型从“会说”进化到“会干”。论文、代码和数据集已全部在GitHub和Hugging Face开源。

Logan Kilpatrick@OfficialLoganK · 6月5日40

the amount of alpha you can have right now creating good public AI benchmarks is wild, such a big opportunity

译现在创建好的公共AI基准所能获得的alpha量是疯狂的，这是一个巨大的机会。

Rohan Paul@rohanpaul_ai · 6月5日53

Nemotron 3 Ultra vs GPT-5.5 on atomic[.]chat, a desktop app that runs LLMs locally. Nemotron 3 Ultra gave almost similar result on a test to build HTML5 canvas with real physics, while being 10X cheaper. - Nemotron 3 Ultra: 11.3k tokens, $0.051 - GPT 5.5: 11.0k tokens, $0.57 Nemotron 3 Ultra has 550 bn total parameters (55 bn active per token), because it is a Mixture-of-Experts model.

译在 atomic.chat 本地桌面应用中，Nemotron 3 Ultra（MoE 架构，总参数 550B，每 token 活跃 55B）与 GPT-5.5 在构建带物理引擎的 HTML5 canvas 任务（旋转水桶、高尔顿板、极端质量块碰撞）上表现几乎相同。Nemotron 3 Ultra 消耗 11.3k tokens、花费 $0.051，GPT-5.5 消耗 11.0k tokens、花费 $0.57，前者成本仅为后者的约 1/10，质量差距远小于价格差距。

DogeDesigner@cb_doge · 6月5日31

ChatGPT vs Grok Asked both to turn this pixelated logo into a high-resolution image. ChatGPT failed badly while Grok delivered a clean, sharp, high resolution image. Grok is the clear winner.

译ChatGPT vs Grok 让两者将这个像素化的logo转化为高分辨率图像。 ChatGPT严重失败，而Grok提供了干净、清晰的高分辨率图像。 Grok是明显的赢家。

swyx@swyx · 6月5日55

Finally! the first eval ship from cog!!!!!!!!!! 👼🏼 To contextualize: @METR_Evals cap out at ~16 hours. Cog has private enterprise evals up to 100hrs, and is confident enough to put a financial guarantee on it 🤯 METR dataset: ML eng, GPU kernels, cybersecurity > "METR (2026) used a combination of GPT-4o and GPT-5 to estimate the human-equivalent times from compressed Claude Code transcripts. These transcripts were collected from 7 METR technical staff on 34 sessions labeled on human ground truth". rlog of 0.83 Cog dataset: real life java/typescript/python/c# feature dev, bugfixes, migrations > "We collected a ground-truth dataset by asking Devin users to review recent representative sessions, and estimate how long each completed session would have taken without Devin. Our dataset consists of 258 sessions from 126 users across a diverse set of enterprise customers." rlog of 0.74 on held out set this is pioneering real world evals work and part 1 of a broader frontier code evals drop that I'm really looking forward to writing up. huge kudos to @annarmitchell and @ryanbai1412 for leading the unglamorous last mile data collection!!

译Cognition发布企业级AI代码评估（eval），支持长达100小时深度测试（METR仅约16小时），并附带财务担保：若Devin产出价值低于费用，Cognition将补贴至达标，最高1000万美元。METR数据集覆盖ML工程、GPU内核、网络安全，使用GPT-4o和GPT-5从Claude Code转录估算人类时间，rlog=0.83。Cognition数据集来自126位Devin用户的258个真实会话（Java/TS/Python/C#功能开发、bug修复、迁移），保留集rlog=0.74。

Artificial Analysis@ArtificialAnlys · 6月5日65

Nemotron 3 Ultra was launched today, including a focus on low latency agentic performance. We tested it against peers under restricted turn-usage limits on Terminal-Bench v2.1 - @NVIDIA Nemotron 3 Ultra completes tasks at a much faster pace than peers due to its high inference speed while scoring competitively on the benchmark. In this analysis each model is given a ‘turn limit’ within which it can complete tasks, inside a customized version of the Terminus 2 harness which advises it of this limit. We apply 4 increasing turn limits and trace each result’s tradeoff of task latency and performance. Time per task, on the X axis, is calculated as decode time based on token usage and measured endpoint output speeds (for Nemotron 3 Ultra, speeds were measured on a pre-release deployment on @blackboxai), plus the actual time spent executing tools to complete the benchmark. Nemotron 3 Ultra is the fastest across all turn limits and sits on the Pareto frontier for performance versus time per task for this evaluation.

译NVIDIA 今日发布 Nemotron 3 Ultra，重点优化低延迟智能体性能。在 Terminal-Bench v2.1 上，该模型与竞品在 4 个递增轮次限制下对比测试。Nemotron 3 Ultra 凭借高推理速度（基于 token 用量与 blackboxai 预部署测得的端点输出速度，以及工具执行实际耗时），在每个轮次限制下完成任务的速度均快于竞品，同时保持了有竞争力的基准分数，处于该评测性能-时间帕累托前沿的领先位置。

Artificial Analysis@ArtificialAnlys · 6月4日74

NVIDIA has just released Nemotron 3 Ultra, the new most intelligent US open weights model, with leading speed for its intelligence Nemotron 3 Ultra scores 47.7 on the Artificial Analysis Intelligence Index, well ahead of the next strongest US open weights models, Gemma 4 31B (39.2), Nemotron 3 Super (36.0) and gpt-oss-120b (33.3), but behind the Chinese-led open weights frontier (Kimi K2.6 at 53.9). We partnered with @NVIDIA to evaluate this model for intelligence and speed ahead of its public release. These figures use the final NVFP4 weights that NVIDIA recommends for inference, but our tests show minimal intelligence impact compared to BF16 testing, with higher precision resulting in an Artificial Analysis Intelligence Index score of 48.2 vs. the NVFP4 score of 47.7. Key Takeaways: ➤ Nemotron 3 Ultra leads in speed for its intelligence: through BlackBox AI ahead of release, Nemotron 3 Ultra is served at over 400 output tokens per second - this is slightly faster than the typical serving speed of gpt-oss-120b despite being >4X larger, and comes with significantly greater intelligence ➤ Largest Nemotron 3 model so far: with approximately 550 billion total parameters and 55 billion active, Nemotron 3 Ultra is significantly larger than its siblings and is the largest and most intelligent US open weights model release ever ➤ Nemotron 3 Ultra is the leading US open weights model on the Artificial Analysis Intelligence and Agentic Indexes by far, but Gemma 4 31B scores ~1 point higher on the Coding Index (comprised of Terminal-Bench Hard and SciCode)

译NVIDIA 发布 Nemotron 3 Ultra，为目前最智能的美国开源权重模型。在 Artificial Analysis Intelligence Index 得分 47.7，领先 Gemma 4 31B（39.2）、Nemotron 3 Super（36.0）和 gpt-oss-120b（33.3），但低于中国开源模型 Kimi K2.6（53.9）。模型总参数约 550B，激活 55B，推理速度超 400 tokens/s，较 gpt-oss-120b 略快且智能显著更高。NVFP4 精度得分 47.7，BF16 得分 48.2，精度差异极小。

karminski-牙医@karminski3 · 6月4日64

给大家带来 MiniMax-M3 实测! 本次测试包含了复杂前端, 后端 Agentic Coding, Agent 能力测试, 以及我的使用经验总结. 来看结论: 前端能力上, 可以完全适配 KCORES2026p2 的前端测试题目, 无论是空间理解, 建模精确度, 场景美学都十分在线, 其中我最满意的是美学部分, 它的颜色运用非常好. 不足的地方主要体现在复杂需求不能一次性写对(比如光追引擎), 需要迭代一下就可以了. 后端能力测试这次也是突飞猛进, 得分超过了 deepseek-v4-pro 和其他一众国产大模型, 略逊于 GPT-5.4-Pro(xhigh). Agent 能力上表现同样亮眼, 达成了榜单第二的接单量, 证明它的规划能力特别强。下面是我在测试和实际使用中, 总结出来的 M3 使用经验, 供大家参考: 我的体感是 M3 特别喜欢推理, 它可以单次执行超长的推理. 在咱们的这些前端测试中, 它最长的输出甚至达到了我规定的 64k token上限, 所以, 不要上来就写一个超级复杂的 prompt 让它执行, 而是需要先把需求形成 plan, 然后让 agent 蜂群去执行, 这样才能得到理想的效果, 所以 M3 先天适合放在带 plan 模式的 Coding Agent 中使用. 如果把它嵌入到 Agent 框架中使用, 那么 prompt 编排就一定要做好, 不要一股脑把大量的 tool call 或者超大的 system prompt 丢给它. 还是需要下功夫好好编排一下的. 本次 M3 相比之前的 2.7 版本有了大幅度的提升, 模型偏好上来看, M3 是一个规划能力极强的模型, 所以特别适合用在一些规划性质的 Agent 框架中, 比如任务拆分, 日程管理, 流程设计等. 而本次暴露出来的不足则是执行过程中约束不够强, 比如 prompt 中设置的复杂规则, 一定要增加代码级别的 harness 闭环流程来进行约束, 而不能只靠模型本身来管理自己的行为. #minimaxm3 #minimax #agenticcoding #aiagent #harness

译MiniMax-M3 实测：前端适配 KCORES2026p2，空间理解、建模精度、美学表现优秀，颜色运用佳；复杂需求如光追引擎需迭代。后端得分超 deepseek-v4-pro 及国产模型，略逊 GPT-5.4-Pro (xhigh)。Agent 能力达榜单第二接单量，规划突出。使用经验：M3 偏好长推理，单次输出可达 64k token，适合嵌入带 plan 模式的 Coding Agent，需做好 prompt 编排，避免大量 tool call；执行约束不足，需增加代码级 harness 闭环。

宝玉@dotey · 6月4日57

最近 Codex GPT-5.5 给我的感觉是干活不如 Claude Opus 4.8，当然可能是因为我在开发 Mac 应用，Opus 更擅长一些

译宝玉 (@dotey) 表示，Codex GPT-5.5 在干活上不如 Claude Opus 4.8，尤其在开发 Mac 应用时 Opus 更擅长。@jesselaunz 也反馈 Codex 突然“降智”，原本预期 2 天的目标仅 20 分钟就交付，用户给出了评分以来最低的 5/10 分。

Chubby♨️@kimmonismus · 6月4日67

A blind Stanford-led study of nearly 3,000 anonymized matchups found law professors across 16 schools preferred AI-generated answers to student contract-law questions over those written by fellow professors 75% of the time, and judged the AI responses far less likely to be pedagogically harmful (3.5% vs. 12%). "The team tested a range of systems, including commercial tutoring tools and Google's NotebookLM." Now imagine the performance of models in 6-12 months.

译一项由斯坦福大学领导的盲测研究，对近3000场匿名对决的分析发现，16所法学院的法律教授在合同法问题中，有75%的时间更偏好AI生成的答案，而非教授自己写的答案，并且认为AI回答的教学危害性远低于后者（3.5% vs 12%）。 “研究团队测试了多种系统，包括商业辅导工具和Google的NotebookLM。” 现在想象6-12个月后模型的表现。

Artificial Analysis@ArtificialAnlys · 6月4日67

StepFun's Step 3.7 Flash sits on the Intelligence vs Output Speed Pareto frontier, scoring 43 on the Artificial Analysis Intelligence Index and is served at over 400 output tokens/s Step 3.7 Flash (open weights, Apache 2.0) is a significant upgrade on Step 3.5 Flash and stands out for its speed and gains in agentic performance (particularly GDPval-AA). 400 output tokens/s is more than double other models of a similar size class. Contributing to this speed is that the model has only 11B active parameters and the model ships with trained Multi-Token Prediction heads (3) that predict several tokens in a single forward pass, letting it decode multiple tokens at once using speculative decoding. Key results for Step 3.7 Flash with the high reasoning level: ➤ 4 point Intelligence Index improvement: Step 3.7 Flash scores 42.6 on the Artificial Analysis Intelligence Index, up 4 points from Step 3.5 Flash 2603 (38.5). It is equivalent to Qwen3.5 122B A10B (41.6) and trails MiniMax-M2.7 (49.6) and DeepSeek V4 Flash (Max Effort, 46.5) ➤ Speed-intelligence frontier: Step 3.7 Flash achieves ~400 output tokens/s on StepFun's first-party API, placing the model on the Intelligence vs Output Speed Pareto frontier. StepFun has released the weights for this model and we expect several third-party providers to serve this model ➤ Agentic capability improvements: Step 3.7 Flash improves over Step 3.5 Flash 2603 across our agentic evaluations, in both GDPval-AA (real-world agentic tasks) and TerminalBench Hard (agentic coding and terminal use). It achieves a GDPval-AA Elo of 1298, up from 1070 for Step 3.5 Flash 2603, and it's TerminalBench Hard score increases to 35.6% from 32.6%. AA-LCR (Long Context Reasoning) improves to 63.7% from 54.3%. Scores for other evals remain relatively flat ➤ Weaker on knowledge and hallucination than peers: While Step 3.7 Flash trails competitors overall on AA-Omniscience (-38), it improves from Step 3.5 Flash 2603 (-44). It has an AA-Omniscience accuracy of 25.4% and a hallucination rate of 84.4% ➤ Native multimodal support, new in this generation: Step 3.7 Flash introduces a 1.8B-parameter vision encoder for native image understanding, where Step 3.5 Flash was text-only. On MMMU-Pro (multimodal reasoning) it scores 75.3%, roughly matching Qwen3.5 122B A10B (75.0%). Among its same-size open weights peers, MiniMax-M2.7, DeepSeek V4 Flash, and gpt-oss-120b are text-only Key model details: ➤ Context window: 256K tokens ➤ Parameters: 198B total, 11B active (MoE). At BF16 native precision, Step 3.7 Flash requires ~400GB to store the weights. StepFun has also released FP8 (~200GB) and NVFP4 (~100GB) versions for lower-memory deployment ➤ License: Apache 2.0 ➤ Availability: Currently Step 3.7 Flash is available on @StepFun_ai 's first-party API

译StepFun 开源 Step 3.7 Flash（Apache 2.0），总参数 198B、激活 11B（MoE），上下文 256K。在 Artificial Analysis 智能指数上得分 42.6，较 Step 3.5 Flash 提升 4 分，输出速度超 400 tokens/s，通过 Multi-Token Prediction（3 个 token）加速。新增 1.8B 视觉编码器支持原生多模态，MMMU-Pro 得分 75.3%。代理能力提升：GDPval-AA Elo 从 1070 升至 1298，TerminalBench Hard 达 35.6%，AA-LCR 63.7%。知识/幻觉仍弱：AA-Omniscience 准确率 25.4%，幻觉率 84.4%。提供 BF16、FP8、NVFP4 精度权重以降低部署成本。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 6月4日44

PAPER: We used state-of-the-art LLMs to prove AI still can't do X THE STATE-OF-THE-ART LLMS:

译论文：我们使用最先进的大语言模型来证明AI仍无法做到X 最先进的大语言模型：

Artificial Analysis@ArtificialAnlys · 6月4日71

Jensen Huang’s keynote at Computex used Artificial Analysis benchmarks to communicate the performance of Nemotron 3 Ultra Jensen used our Artificial Analysis Intelligence Index vs. Output Speed chart to communicate the performance of NVIDIA’s new Nemotron 3 Ultra model. The presentation also highlighted GDPval-AA, Artificial Analysis' benchmark that uses OpenAI's GDPval dataset to evaluate models on economically valuable tasks NVIDIA additionally highlighted Artificial Analysis Text to Image and Image to Video Arena Elos to promote the NVIDIA Cosmos 3 model family. Congratulations @NVIDIAAI on the launches!

译Jensen Huang 在 Computex 主题演讲中引用 Artificial Analysis 的 Intelligence Index vs. Output Speed 图表，介绍 NVIDIA 新模型 Nemotron 3 Ultra 的性能。演讲还提及 GDPval-AA——Artificial Analysis 基于 OpenAI 的 GDPval 数据集评估模型在经济价值任务上的基准。NVIDIA 同时用 Artificial Analysis 的文生图和图生视频 Arena Elo 评分推广 Cosmos 3 模型族。

StepFun@StepFun_ai · 6月4日44

Great demo by @atomic_chat_hq. Step 3.7 Flash was designed for real-world agentic coding tasks — not just generating code fast, but keeping logic, visuals, and execution coherent across complex outputs. Love seeing builders test it in creative ways!

译阶跃星辰（StepFun）称其 Step 3.7 Flash 在与 DeepSeek V4-Flash 的物理编程测试中全面胜出。测试要求在不使用库的情况下，生成一个包含高尔顿板、旋转六边形弹球和同步节拍器三个场景的自包含 HTML5 canvas 动画，并实现真实物理。Step 3.7 Flash 输出 59.6k tokens（耗时 9分57秒），DeepSeek V4-Flash 输出 52.5k tokens（耗时 6分21秒）。尽管 DeepSeek 更快，但 StepFun 模型在物理模拟、视觉效果和逻辑渲染上均占优。主推文指出 Step 3.7 Flash 专为真实世界 agentic 编码任务设计，能保持复杂输出中逻辑、视觉和执行的一致性。

Saining Xie@sainingxie · 6月3日67

how does the brain build and track an internal state of the world from (possibly incomplete and noisy) visual observations? i believe visual state tracking will be the grand challenge for vision in the coming years, and i hope this benchmark can be a useful starting line. enjoy!

译研究团队推出VSTAT基准测试，用于评估多模态大语言模型（MLLMs）在视频中追踪动态状态的能力。测试任务看似简单，包括计数杯子、识别键入的文字、统计翻页次数等，人类可以轻松完成，但当前MLLMs表现欠佳。该测试旨在推动视觉状态跟踪这一前沿方向的发展，解决模型从不完整、有噪声的视觉观察中建立和更新内部世界状态的核心挑战。

DogeDesigner@cb_doge · 6月3日21

Grok Imagine is pretty cool with logos. 🔥

译Grok Imagine 处理 Logo 的效果相当不错。🔥

fofr@fofrAI · 6月3日37

The way K2 handles style reference strength is really nice.

译K2 处理风格参考强度的方式真的很棒。

Ethan Mollick@emollick · 6月3日47

Law professors wrote questions they were asked during office hours. Gemini 2.5 & humans answered them then other law professors blindly judged the results: -Gemini had a 75% win rate vs. professors -Gemini's answers were rated LESS harmful than humans -Newer models do even better

译法学教授们写下了他们在办公时间被学生问到的问题。Gemini 2.5 和人类分别作答，然后其他法学教授在不知道答案作者的情况下对结果进行了评判： - Gemini 的胜率为 75%，击败了教授们 - Gemini 的答案被评为比人类的答案危害更小 - 更新的模型表现甚至更好

Lee Robinson@leerob · 6月3日58

Quick rant on AI model benchmarks: - Some of the most popular ones are no longer helpful (SWE-bench¹) - It can be very hard to reproduce reported results (so lots of variance) - Take them with a grain of salt, look at the average across many We need some creative new ideas for AI model marketing. Supportive of a Survivor spin-off (who is the AI Jeff Probst!?). I get why every model release shows benchmark scores as the headline. It's actually pretty hard to describe how a model has improved without it sounding like fluff. And also it sounds boring to say the same thing over and over ("it's better at following instructions" repeat x10). Benchmarks make it very clear there is a number, which likely started bad, and is now going up. Yay! The reality is that benchmarks are most useful to those *training* the model so they know where to improve. Model labs use these benchmarks to measure progress, which is why having non-saturated benchmarks is extremely helpful. If you see models getting 90% on an eval, it's probably time to make a harder version. I do think there's a word of caution for everyone interpreting benchmarks. It's very hard to get exactly the same scores, which is why some benches show error bars and do the average over multiple runs. But even further, the hardware and GPUs the evals are running on really matter! Small differences there, or minor tweaks to the prompt, can swing scores by multiple percentage points². All of that to say, it's important to look at many different benchmarks, and then actually use the model to make your own opinion. For example, there's recently been a lot of debate on here about Opus 4.8 not benchmarking as well as other models. But personally I've found the model really good from my own usage. Your mileage may vary! There aren't many high-quality public benchmarks that measure things like the UX of the model responses, the style of the messages, the warmth or directness of the "personality". These things matter *a lot* for the day-to-day usage. How the model performs in the real world is often different from very specific benches. In summary, benchmarks matter but they are not a substitute for extensively testing the model yourself with real work. ¹: https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified ²: https://www.anthropic.com/engineering/infrastructure-noise

译Lee Robinson 批评当前AI模型基准测试存在局限，如 SWE-bench 已过时且结果难以复现。评测分数易受硬件、GPU差异和prompt微小改动影响，波动明显。这些基准对模型训练者衡量进展有价值，但对普通用户，当分数饱和时便失去参考意义。他指出，模型的交互风格、个性等重要因素无法被现有公共基准充分衡量。因此，建议用户综合参考多个基准，并亲自使用模型以形成判断。

Artificial Analysis@ArtificialAnlys · 6月3日62

Krea 2 Medium debuts at #6 on the Artificial Analysis Text to Image Leaderboard, trailing only models from OpenAI, Google, and NVIDIA! Krea 2 is @krea_ai's first image model family trained entirely from scratch (Krea 1 was developed in collaboration with Black Forest Labs). Krea 2 is available in two variants: Krea 2 Medium, and Krea 2 Large, which is more comparable to FLUX.2 [pro] in our arena. Notably, Krea 2 Medium outranks the larger, more expensive Krea 2 Large in our arena. Krea describes Medium as smaller and faster, with extensive post-training that makes its outputs especially stable and consistent across generations. While Large is positioned as the more capable model, our leaderboard results align with Krea's view that Medium "handles the broadest range of use cases reliably." Both models generate at 1K resolution and share a distinct set of generation controls via the API: ➤ Style transfer: Krea can extract the style of up to 10 reference images, with each image being able to be weighted in terms of importance ➤ Creativity Setting: A configurable API parameter (raw, low, medium, high) that sets how closely the model follows the prompt versus reinterpreting it ➤ Moodboards: A collection of images that can be collected in the application to apply a style transfer onto the image (separate from individual style reference images) At $30 per 1k images via Krea's API, Krea 2 Medium is priced below comparable models such as Nano Banana Pro at $134/1k images or grok-imagine-image-quality at $50/1k images. Krea 2 Large is priced at $60 per 1k images, and both models' prices increase with the use of the Style Transfer and Moodboard features. Both models are available in the Krea app, via Krea's API, and on official third-party launch partners. Congratulations to @krea_ai on the launch! See below for comparisons between Krea 2 and other leading models in our Artificial Analysis Image Arena 🧵

译Krea AI自研的文生图模型Krea 2 Medium在Artificial Analysis排行榜上位列第6，仅落后于OpenAI、Google和NVIDIA的模型。值得注意的是，体积更小、速度更快的Medium版本在排名上超过了定位更强大的Large版本。两款模型均支持通过API进行风格迁移和创意控制等操作，生成1K分辨率图像。定价方面，Krea 2 Medium为30美元/千张，Krea 2 Large为60美元/千张。

Krea@krea_ai · 6月3日57

Krea 2 is now on @ArtificialAnlys #1 image model from an independent research lab and #6 globally on text-to-image leaderboard. open-source cooking and coming soon.

译Krea 2 现已上线 @ArtificialAnlys 独立研究实验室排名第一的图像模型，全球文本到图像排行榜第六。开源版本正在制作中，即将推出。

Rohan Paul@rohanpaul_ai · 6月2日65

Most video models look better than they understand and Video quality is only the easiest thing to notice. LongCat just released WBench, it turned video world model testing from a beauty contest into a stress test for control, multi-turn memory, instruction-following, and physical plausibility. It exposed the gap between beautiful video generation and controllable world simulation. A pretty clip is not enough, because a usable world model must keep the same scene, obey later actions, move the camera correctly, preserve objects, and avoid impossible cause-and-effect. WBench tests this with 289 cases, 1,058 interaction turns, 20 models, 5 dimensions, and 22 automatic metrics, covering navigation, subject actions, event edits, perspective switches, and both viewpoints. Across all those 20 evaluated models, the paper finds that no model dominates all dimensions, which means current systems have not yet merged high-quality rendering, reliable control, long-horizon memory, and physical rule-following into one stable capability. Its design separates the world setup from the user action, so researchers can identify whether a failure comes from weak rendering, poor scene setup, bad control, lost state, or broken physics. Navigation has near-zero connection with visual quality, consistency, or physics, meaning a model can look strong while still failing to move on command. The key shift: stop asking only “does the video look good?” and start asking “can the model keep a controllable world alive across many turns?” 🧵 1.

译美团LongCat发布视频世界模型评测基准WBench。该基准将测试重点从画面美观转向控制、多轮记忆、指令遵循和物理合理性等核心能力。它包含289个案例、1058个交互轮次，评估了20个模型在导航、主体动作、事件编辑等5个维度的表现，共使用22项自动指标。研究发现，没有任何模型能在所有维度上占据主导，这表明现有系统尚未将高质量渲染、可靠控制、长期记忆与物理规则遵循整合为稳定能力。WBench的设计能区分失败是源于渲染、场景设置、控制还是物理问题，并指出导航能力与视觉质量基本无关。

宝玉@dotey · 6月2日61

Cursor 在为用户增加使用额度。最近我重度使用了 Cursor 的 Agent，效果相当不错。我常用的 GUI Agent 里面，Codex App > Cursor > Claude Desktop。几个亮点： 1. 它的 multitask 模式可以开启多个后台任务并行，速度很不错。 2. 它可以灵活选择各种模型，不像 Codex 和 Claude Code 只能选择自家模型，composer 2.5 在普通任务上的能力和速度都还可以 3. Plan 模式比较详细，列有详细的 Steps，配合 multitask 模式通常效果很稳定不足之处：还不支持 /goal、手机版还没有类似于 Codex 的 Chrome use + Computer use 的调试功能，只有内置浏览器的调试

译Cursor 宣布提升所有团队用户使用额度，并推出 Premium 团队席位。用户反馈其 Agent 模式效果好，支持多任务并行、灵活选择各类模型，且 Plan 模式步骤详细。对比中，用户认为其表现优于 Claude Desktop，略低于 Codex App。当前不足包括不支持 /goal 与手机版，且调试功能仅限内置浏览器。

宝玉@dotey · 6月2日59

Cursor 在为用户增加使用额度。最近我重度使用了 Cursor 的 Agent，效果相当不错。我常用的 GUI Agent 里面，Codex App > Cursor > Claude Desktop。几个亮点： 1. 它的 multitask 模式可以开启多个后台任务并行，速度很不错。 2. 它可以灵活选择各种模型，不像 Codex 和 Claude Code 只能选择自家模型，composer 2.5 在普通任务上的能力和速度都还可以 3. Plan 模式比较详细，理由详细的 Steps，配合 multitask 模式通常效果很稳定不足之处：还不支持 /goal、手机版还没有类似于 Codex 的 Chrome use + Computer use 的调试功能，只有内置浏览器的调试

译Cursor 宣布为所有 Teams 用户提升使用额度，并受其 Ultra 计划启发，将推出一个提供 5 倍用量、价格 3 倍的 Premium 团队席位。有用户分享了重度使用其 Agent 的体验，认为效果不错，亮点包括：可开启多任务并行的 multitask 模式、能灵活选择各种模型（如 composer 2.5），以及步骤详细的 Plan 模式配合使用效果稳定。目前不足之处是不支持 /goal、手机版，以及缺乏类似 Codex 的 Chrome use 和 Computer use 调试功能，仅有内置浏览器调试。

MiniMax (official)@MiniMax_AI · 6月2日54

26% improvement on BU Bench 👀 more to come

译BU Bench上提升26% 👀 还有更多

Artificial Analysis@ArtificialAnlys · 6月2日61

Overview of our recently launched AA-WER Streaming benchmark, measuring streaming Speech to Text models on accuracy and latency for voice agent use cases Streaming Speech to Text (STT) powers real-time transcription in voice agents and live captioning, where models must balance accuracy against speed. Fast transcripts keep responses feeling natural and free up the response-time budget for reasoning and tool calls. Accuracy matters too, since errors can compound downstream. Streaming STT models transcribe audio as it is fed in, sharing outputs continuously, unlike offline (batch) models that process the entire file at once and are typically slower. Models from Cartesia, ElevenLabs, and Deepgram sit on the accuracy-latency Pareto frontier. Cartesia Ink-2 leads on final transcript accuracy at 3.59% WER (210ms), closely followed by ElevenLabs Scribe v2 Realtime at 3.64% WER (140ms). Deepgram Flux is fastest at ~20ms on final transcript latency (7.36% WER). In this video, Kiriill Butler, Member of Technical Staff at Artificial Analysis, walks through the benchmark and key results.

译Artificial Analysis 团队推出 AA-WER Streaming 基准，用于评估流式语音转文本模型在语音智能体场景中的表现，主要考察准确性与延迟。流式模型需要在这两者间取得平衡。测评结果显示，Cartesia Ink-2 在最终转录准确性上领先，词错率为 3.59%，延迟为 210ms；ElevenLabs Scribe v2 Realtime 以 3.64% 词错率和 140ms 延迟紧随其后；Deepgram Flux 延迟最低（约 20ms），但词错率为 7.36%。这三家模型处于准确性-延迟帕累托前沿。

karminski-牙医@karminski3 · 6月1日56

给大家带来 Qwen3.7-Max 实测! 这次使用了全新的前端测试集, 直接说结论, Qwen3.7-Max 应该是之前测试过的版本中进步特别大的一个, 甚至这次前端测试能完成之前3.6版本不能完成的测试项目. 而且后端能力测试直接刷了个榜一! 它是参与测试的34个模型中唯一一个实现了 IVF-PQ + ADC 索引方案的模型! 直接把后端测试从之前 GPT-5.5-Pro(xhigh) 的4000分拉到了现在的6947分! 不过需要注意的是, 它的测试表现分布并不是很稳定, 所以建议使用中要多review代码, 来达成更高的输出质量. 另外, 本次 Agent 能力也有提升, 达到了第一梯队的水平. 最后, 我还用 Qwen3.7-Max 实现了一个基于AI的磁盘恢复系统, 用来测试模型的实际工程能力. 编写过程很顺畅, 没有遇到阻碍, 大家可以直接看视频里的效果. #qwen #阿里千问 #qwen37max #AIAgent

译Qwen3.7-Max实测显示其前端测试能力较3.6版本有显著进步。后端能力测试在34个参与模型中表现突出，以6947分登顶，远超此前GPT-5.5-Pro（xhigh）的4000分，且是唯一实现IVF-PQ + ADC索引方案的模型。测试也指出其输出分布稳定性有待提升，建议使用中多加review代码。此外，其Agent能力已达到第一梯队水平，并可实际用于构建AI磁盘恢复系统等工程任务。

MiniMax (official)@MiniMax_AI · 6月1日62

I could watch SVG tests all day! Send me more with M3 👀

译我可以整天看SVG测试！快用M3多发点给我 👀

MiniMax (official)@MiniMax_AI · 6月1日53

love to see it 🙌 go try M3 in @orca_build with @opencode

译太棒了 🙌 快去 @orca_build 里用 @opencode 试试 M3 [引用 @JinjingLiang]：MiniMax M3 让我惊艳我一直在 @orca_build 里用 @opencode 智能体免费使用它目前主要做 UI 任务和代码审查，但感觉性能与 Opus-4.7 不相上下没想到它这么好用。（而且目前免费）

Artificial Analysis@ArtificialAnlys · 6月1日81

NVIDIA just announced the release of Nemotron 3 Ultra in Jensen Huang's Computex keynote: at 550B parameters (55B active), this is the largest Nemotron 3 model to date, and it is the most intelligent US open weights model We partnered with @nvidia to evaluate this model for intelligence and speed - these figures use the model’s BF16 weights, but as with Nemotron 3 Super the model will be made available in NVFP4 quantization as well for higher inference performance. ➤ New leader for US open weights intelligence: Nemotron 3 Ultra scores 48 on the Artificial Analysis Intelligence Index. This is well ahead of the next strongest US open weights models, Gemma 4 31B (39), Nemotron 3 Super (36) and gpt-oss-120b (33), but behind the Chinese-led open weights frontier (Kimi K2.6 at 54). ➤ Leading speed for its intelligence: on a pre-release @DeepInfra endpoint, Nemotron 3 Ultra served over 300 tokens per second. Peer models in its size class from China-based labs such as DeepSeek and Moonshot (Kimi) are generally served at speeds of 50-100 tokens per second in the market today. gpt-oss-120b is served at speeds similar to this level, but with significantly lower intelligence. ➤ Largest Nemotron 3 model so far: at approximately 550 billion total parameters and 90% sparsity, Nemotron 3 Ultra is significantly larger than its siblings and is the largest recent US open weights model release We’ll be sharing additional analysis and full benchmarks at release.

译NVIDIA在Computex上发布了Nemotron 3 Ultra，总参数达550B（激活参数55B），是目前最大的Nemotron 3模型。该模型在美国开放权重模型中智能性最强，在Artificial Analysis Intelligence Index评测中得分为48，超越了Gemma 4 31B（39分），但仍落后于月之暗面（Kimi）的K2.6（54分）。在推理速度方面，其在预发布端点上超过了300 tokens/s，远高于同级别中国模型通常的50-100 tokens/s。该模型将提供BF16权重及NVFP4量化版本以提升推理性能。

swyx@swyx · 6月1日39

every evals/analytics startup is going through a onetime generational upgrade into a continual learning platform in 2026 many will fail but as always the tasteful ones win

译每家评估/分析初创公司都将在2026年经历一次性的代际升级，转型为持续学习平台。许多公司会失败，但一如既往，有品味的公司会胜出。

DogeDesigner@cb_doge · 5月31日70

NEW: Grok Imagine Video 1.5 Preview just hit #1 in the Image-to-Video Benchmark on Video Arena. A massive +52 point jump over the previous Grok Imagine Video model, beating Seedance 2.0, HappyHorse, and Veo 3.1. xAI is moving fast. 🚀

译新消息：Grok Imagine Video 1.5 Preview 刚刚在 Video Arena 的图生视频基准测试中排名第一。相比之前的 Grok Imagine Video 模型，分数大幅提升了 52 分，超越了 Seedance 2.0、HappyHorse 和 Veo 3.1。 xAI 进展迅速。🚀

Chubby♨️@kimmonismus · 5月31日59

Opus 4.8 is a solid jump over Opus 4.7 on DeepSWE, while also lowering the average cost per task. However, GPT-5.5 xhigh still beats it by a pretty clear margin while being cheaper. OpenAI has been cooking insanely hard with its models lately. Really excited to see what GPT-5.6 brings. That said, I have to admit: I’m starting to really like Opus 4.8 as well. We’ve entered a moment where both frontier labs keep shipping genuinely impressive models.

译Anthropic 的 Opus 4.8 在 DeepSWE 基准测试中表现较 Opus 4.7 有显著提升，同时降低了每项任务的平均成本。具体而言，在默认高思考努力（xhigh）设置下，其得分比 Opus 4.7 xhigh 高出 6%。然而，GPT-5.5 xhigh 在该项测试中仍以明显优势领先，且成本更低。推文作者对 OpenAI 近期的模型发布印象深刻，并期待 GPT-5.6，同时也开始认可 Opus 4.8，认为当前正处于两家前沿实验室持续推出真正令人印象深刻模型的时刻。

Rohan Paul@rohanpaul_ai · 5月31日60

atomic[.]chat (a desktop app that runs LLMs locally) ran a very revealing comparison for local AI agents, on a MacBook Pro M5 Max, 64GB. Liquid’s much smaller LFM2.5-8B-A1B beat gpt-oss-20b by finishing every required tool call, cutting runtime by more than half, and using 4.8GB RAM instead of 11GB. The task was not normal chat, because the model had to plan a trip by calling outside tools for 3 weather checks, 2 currency conversions, 1 email, and 1 reminder. The striking part is that LFM2.5-8B-A1B is much smaller in active compute, yet it hit every required call at 266tok/s, while gpt-oss-20b used 11GB RAM, made only 3/7 calls, and ran at 146tok/s. Now, tool calling is a control problem before it is a language problem. The model has to preserve a checklist across context, decide when language should stop and action should begin, and resist the temptation to answer as if partial completion were enough. A smaller mixture-of-experts model with only a fraction of its parameters active can win if its training shaped those control habits more sharply than a larger model’s general fluency did.

译在MacBook Pro M5 Max 64GB上的本地测试中，Liquid的LFM2.5-8B-A1B模型在需要完成7个工具调用的旅行规划任务上，显著优于OpenAI的gpt-oss-20b。LFM2.5-8B-A1B仅使用4.8GB内存，以266tok/s的速度成功完成了全部7/7工具调用，耗时6.9秒。相比之下，gpt-oss-20b消耗了11GB内存，仅完成3/7工具调用，速度为146tok/s，耗时15秒。这表明，一个活跃参数规模更小（1B）的MoE模型，通过更精准的训练，在工具调用这一智能体任务上可以战胜活跃参数规模约其2.5倍的更大模型。

elvis@omarsar0 · 5月31日55

The efficiency frontier! Where do you think GPT-5.6 will land?

译效率前沿！你认为 GPT-5.6 会落在哪里？