I think Epoch does a great job benchmarking, but I continue to believe that open weights models are much more fragile, especially out-of-distribution, than their benchmarks indicate. Vibe-wise, I don’t think they were only 3 months behind last year or only 4 months behind today.

译Epoch AI 使用其综合指标 Epoch Capabilities Index 测量发现，开源模型与闭源模型的能力差距平均约为三个月。但主推文作者对此表示怀疑，认为开源大语言模型的实际表现（尤其是在分布外任务上）比评测分数所显示的更为脆弱，真实的体感差距可能远不止三四个月。

Tibo@thsottiaux · 5月30日24

Do you still trust benchmarks or do you just listen to your friends? What makes you try a new model?

译你还会相信评测基准吗，还是只听朋友的？是什么让你尝试一个新模型？

Chubby♨️@kimmonismus · 5月30日56

According to research by EpochAI, open-weight models lag behind frontier closed-source models by four months. Four months. That's very little. And impressive at the same time.

译根据 EpochAI 的研究，开源权重模型落后于前沿闭源模型四个月。四个月。这非常短暂。同时也令人印象深刻。

MiniMax (official)@MiniMax_AI · 5月30日43

MiniMax M2.7 + CyOps = the scorecard speaks for itself 💪

译MiniMax M2.7 + CyOps = 评分说明一切 💪

Berryxia.AI@berryxia · 5月29日42

OPus 4.7 VS OPus 4.8 直观感受没有很强烈的对比~

StepFun@StepFun_ai · 5月29日72

Excited to see Step 3.7 Flash available on @ModelScope2022 🚀 Can’t wait to see what builders create with it!

译阶跃星辰发布的多模态模型 Step 3.7 Flash 已在 ModelScope 平台上线。该模型采用 198B 总参数的 MoE 架构，每个 token 激活 11B 参数，推理速度最高达 400 tok/s，支持 256K 上下文窗口，并提供低、中、高三个推理级别以平衡速度与效果。其在 ClawEval-1.1 榜单位列第一（67.1分），在 SWE-bench Pro 上排名第二（56.3分）。模型具备原生多模态能力，由语言骨干与视觉编码器组成，原生支持解析密集UI、图表及财报。该模型采用 Apache 2.0 协议开源，并兼容 vLLM 等多种推理框架。

Berryxia.AI@berryxia · 5月29日72

兄弟们！现在已经可以在 ZenMux 上免费体验 Claude Opus 4.8 了！我第一时间用它跑了那个Hugging Face大佬M 硬核的「Three.js 纯图元造飞机测试」，要求只用内置几何体（Box、Cylinder、Cone、Sphere…），不准用任何模型加载器，纯手搓一架高细节波音 747-400。（见视频-Prompt 见评论区） Opus 4.8 从输入提示词到生成完整可运行的 HTML 网页（后掠机翼 ~35°、四台发动机精准吊装、驼峰上层客舱、可收放起落架动画、翼梢小翼、导航灯频闪），一次成型！整体效果非常惊艳：比例严谨到离谱、从正面/侧面/俯视/3/4 视角一眼就是 747、连发动机吊架的角度都对！熟悉的老朋友都知道，ZenMux 每次新模型都是 ZeroDelay 首发，并且限时免费额度体验！ Anthropic 旗舰刚发布，现在立刻就能通过 API 调用！另外平台还“有赔付保障的生产级 AI Gateway”，统一接入 + 路由 + 可用性 + 赔付保障，快速尝鲜首选复杂空间推理 + 一次成型的工程代码能力是真的没话说，几乎不用返工。专为 Agent 与长程编码设计，在 SWE-bench、Terminal-Bench、Agentic Coding 等多项榜单直接拿下第一！代码与多模态理解全面超越上一代，复杂三维结构、物理比例、动画时序都拿捏得死死的。完全兼容主流 API 格式，无需改动现有工具链。支持按量计费 + Builder 套餐。 👇 Promot直接体验见评论区：

译Anthropic 旗舰模型 Claude Opus 4.8 现已在 ZenMux 平台提供免费体验。实测中，该模型根据提示词一次生成可运行的 HTML 网页，仅用 Three.js 内置几何体纯手搓出一架包含后掠机翼、四台发动机、可收放起落架等复杂结构的高细节波音 747-400，比例精准、效果惊艳。模型在 SWE-bench、Terminal-Bench、Agentic Coding 等多项榜单排名第一，代码与多模态理解能力较上一代有显著提升。ZenMux 平台以 ZeroDelay 方式首发新模型，并提供限时免费额度。

Ethan Mollick@emollick · 5月29日56

Interesting that the GPT-5 Pro series models have consistently been the best models for single-shot attempts at the hardest problems since last summer. There has been no real competition in all that time.

译有趣的是，自去年夏天以来，GPT-5 Pro系列模型在单次尝试解决最难问题方面一直是最强的模型。这段时间内没有真正的竞争。

karminski-牙医@karminski3 · 5月29日62

Claude-Opus-4.8 实测! medium 不太行? Claude-Opus-4.8 刚刚发布! 赶紧给大家带来实测! 这次使用了全新打磨的测试集, 使用光线追踪渲染一个3D场景, 多光源多材质. 可以看到定格后开始去噪渲染效果还是不错的. 但是需要注意一点, 滚动的那个光源应该垂直撞向墙壁的, 而不是水平, 所以怀疑这一带 opus 的空间理解能力可能是下降了的. 演示视频中用的是 xhigh. 如果使用 medium, 是无法完成这个测试的, 写的 shader 有问题直接炸了. 详细测试稍后放出! 敬请期待! (感觉已经堆了好多了, 都在测, 我尽量不鸽...) #claudeopus48 #opus48 #claude

译Claude-Opus-4.8 刚刚发布，用户使用光线追踪渲染一个多光源多材质3D场景的全新测试集进行了实测。在 xhigh 设置下，初始去噪渲染效果不错，但发现一处本应垂直撞墙的光源移动轨迹呈水平，疑似模型空间理解能力下降。在 medium 设置下，因生成的 shader 有问题，测试直接失败无法完成。详细测试报告将后续发布。

OpenRouter@OpenRouter · 5月29日68

Don't rely on benchmarks; look at the full picture! Try our new Compare page, which also lets you visualize model performance: https://openrouter.ai/compare/openai/gpt-5.5/anthropic/claude-opus-4.7/anthropic/claude-opus-4.8

译不要只依赖基准测试；要看全面情况！试试我们的新比较页面，它还能让你可视化模型性能：https://openrouter.ai/compare/openai/gpt-5.5/anthropic/claude-opus-4.7/anthropic/claude-opus-4.8

Rohan Paul@rohanpaul_ai · 5月29日56

WallStreetPrep did a very practical AI benchmarking exercise for real-world finance. It tested financial modeling agents on a real analyst assignment, not a toy prompt with a neat answer key. The task was a serious analyst job: build Apple’s historical and forecast financial statements, cite sources, link assumptions, add schedules, and make the workbook auditable. Primer, an AI financial modeling tool, came out ahead in this test, but the more useful point is why: its output looked less like a spreadsheet patched together cell by cell and more like a connected financial system that could be audited. Primer treats Excel as the final output format, not the agent’s working language, so the AI can build a stronger 3-statement financial model first and then convert it into an auditable spreadsheet. Primer represents the workbook as structured records such as revenue, cost of sales, cash, debt, assumptions, formulas, source links, comments, and dependency checks. That means the AI can query and validate the finance logic directly, for example “show me every formula feeding cash flow” or “find balance sheet plugs,” instead of visually navigating Excel and editing fragile cell references one by one. This is what I am seeing in many areas, that professional AI agents will be judged less by chat quality and more by whether their artifacts survive audit

译测试评估了AI金融建模智能体在构建苹果公司历史与预测财务报表这一真实分析师任务中的表现。其中，工具Primer表现突出，关键在于其生成了可审计的关联财务系统，而非逐单元格拼接的表格。Primer将Excel视为最终输出格式，先构建完整的三表模型，再将其转化为结构化记录（如收入、成本、假设、公式链接等），使AI能直接查询和验证财务逻辑。这指出专业AI智能体的价值将更多取决于其产出物能否通过审计。

🚨 AI News | TestingCatalog@testingcatalog · 5月29日69

ANTHROPIC 🔥: Claude Opus 4.8 achieves 69.2% score on SWE Bench Pro against 64.3% for Opus 4.7. Benchmarks 👀

译ANTHROPIC 🔥: Claude Opus 4.8 在 SWE-bench Pro 上取得 69.2% 的分数，而 Opus 4.7 为 64.3%。 Benchmarks 👀

Artificial Analysis@ArtificialAnlys · 5月29日80

Anthropic just launched Claude Opus 4.8, and it is the new leader on our GDPval-AA benchmark for agentic real-world work tasks Opus 4.8 scored 1890 on GDPval-AA at launch with its 'max' effort setting, +137 points from Opus 4.7 and +121 points ahead of the next-best model, GPT-5.5 xhigh. Compared head-to-head on the GDPval task set, this implies a ~67% win rate against GPT-5.5 xhigh. @AnthropicAI shared access with us ahead of the public release to benchmark this model and we’re glad to see our benchmarks referenced in today’s launch. The rest of the Artificial Analysis Intelligence Index is in progress - we’ll share final results soon!

译Anthropic 正式发布了 Claude Opus 4.8 模型。该模型在人工智能分析公司的 GDPval-AA 基准（专注于智能体的现实工作任务）上，以“max”努力设置获得了 1890 分。这一成绩比前代 Opus 4.7 高出 137 分，并以 121 分的优势领先于次优模型 GPT-5.5 xhigh。在直接对比中，这意味着 Opus 4.8 对 GPT-5.5 xhigh 拥有约 67% 的胜率。Anthropic 在模型公开发布前，为人工智能分析公司提供了早期访问权限以进行评测。

Artificial Analysis@ArtificialAnlys · 5月28日70

Announcing AA-WER Streaming, our new benchmark measuring streaming Speech to Text models on accuracy and latency for voice agent use cases. Pareto optimal models on this new benchmark include those from Cartesia, ElevenLabs, and Deepgram Streaming Speech to Text (STT) powers real-time transcription in voice agents and live captioning, where models must balance accuracy against speed. Fast transcripts are especially important for keeping responses feeling natural and leaves more of the response-time budget for reasoning and tool calls. Accuracy also matters since transcription errors compound in downstream reasoning and speech generation. Streaming STT models transcribe audio as it is fed in, sharing outputs continuously, unlike offline (batch) models that process the entire file at once and are typically slower. What we measure: AA-WER Streaming reports Word Error Rate and latency together, measured from the moment end of speech is detected, with a Pareto line of increasing accuracy as time to transcript received increases. For direct comparability to offline models on accuracy, we test these streaming models on the same ~8 hours of audio as our offline benchmark, AA-WER v2.0: AA-AgentTalk, Earnings22-Cleaned-AA, VoxPopuli-Cleaned-AA. We measure WER and latency as paired metrics at two points after Silero VAD-detected end of speech: First Final Transcription: WER is measured on the first final-denoted transcript returned after end of speech is detected. Latency is the time in seconds from end of speech to that final-denoted transcript. This is more useful for understanding performance as a standalone streaming transcription model, and for higher accuracy. First Partial Transcription: WER is measured on the first transcript-bearing event (partial or final) returned after end of speech is detected. Latency is the time in seconds from end of speech to that first transcript event. This is more useful for near instantaneous transcription for lower-accuracy tasks like responding to "yes" or "no" questions, or for speculative decoding. Key results: ➤ Highest accuracy on Final after End of Speech: @Cartesia Ink-2 (semantic endpoints) at 3.59% WER, 0.21s latency, followed by ElevenLabs Scribe v2 Realtime (3.64%, 0.14s) and Cartesia Ink-2 (external endpoints) (3.66%, 0.09s) ➤ Highest accuracy on First Partial after End of Speech: @ElevenLabs Scribe v2 Realtime at 3.65% WER, 0.13s latency, followed by Cartesia Ink-2 (external endpoints) (4.33%, 0.07s) and @AssemblyAI U3 Realtime Pro (4.46%, 0.47s) ➤ Fastest transcription: @DeepgramAI Flux leads both Final and Partial at 0.020s and 0.019s respectively (both 7.36% WER). On Final, it's followed by @soniox_ai Realtime and Deepgram Nova-3 Realtime (both 0.06s); on First Partial, it’s followed by @NVIDIA Nemotron 3 ASR 80ms (0.04s) and Soniox Realtime (0.05s) Charts below include a Pareto frontier of accuracy vs. speed, so you can shortlist the models that best fit your latency constraints while still achieving high accuracy. See below for further detail ⬇️

译AA-WER Streaming是一个新基准，用于测量流式语音转文本模型在语音智能体场景下的准确率与延迟。该测试基于约8小时音频，报告词错误率与延迟。关键结果显示：Cartesia Ink-2（语义端点）在最终转录中准确率最高（WER 3.59%，延迟0.21秒）；ElevenLabs Scribe v2 Realtime在首次部分转录中准确率最高（WER 3.65%，延迟0.13秒）；Deepgram Flux在速度上领先，最终和首次部分转录延迟分别为0.020秒和0.019秒。

Berryxia.AI@berryxia · 5月28日73

Qwen新发布的Qwen-Image-Bench，把T2I评测从“生成”直接拉到“创作”： 56个细粒度facet + ρ=0.92人类对齐Q-Judger，OpenAI、Gemini、Grok、Flux全得重排座次！大家还在死磕提示词对齐，Qwen却证明：真实世界保真度和创意生成能力才是真正差距。新基准1000条prompt+56个rubric，可解释诊断，现有SOTA模型差距肉眼可见。那么，对于我们有什么实际使用价值呢？实际怎么用？（收藏） 1. 开发者/研究者：把自己的T2I pipeline（不管是Qwen自家模型、GPT-4o图像、Gemini的Imagen系列、Grok的Flux集成还是开源SD3）扔到这个benchmark上跑一遍。重点看Real-world Fidelity和Creative Generation两个支柱的得分，就能知道真实差距在哪。 2. Prompt工程师：以后写复杂创意prompt时，可以用Q-Judger先自测一下生成结果在56个facet上的表现，快速迭代，而不是靠人工肉眼判断。 3. 企业/产品方：要选T2I供应商或者自研图像生成时，把Qwen-Image-Bench当作新标杆。别再只看“prompt alignment”这种基础分了，直接看创意和保真度得分，更接近真实商业场景。 4. 对比实验：论文已经证明，它在区分领先模型上的分离度远超老基准。想验证自己模型有没有进步？用这个跑前后对比，数据会说话。 Qwen这次的打法很清晰：不光自己卷模型，还把评测标准往前推了一大步。就像当年Scaling Law出来后大家才知道该怎么卷参数一样，这次Qwen-Image-Bench把“从生成到创作”的评价框架给立住了。

译阿里通义千问（Qwen）推出了新的文本到图像（T2I）评测基准Qwen-Image-Bench。该基准包含56个细粒度评估维度，并配备与人类对齐度达ρ=0.92的评判模型Q-Judger。其核心理念是将T2I模型评价从基础的“提示词对齐”，提升至关注“真实世界保真度”和“创意生成能力”两大支柱，通过1000条测试提示词能更清晰地区分现有SOTA模型表现。该基准为开发者、提示词工程师及企业提供了一个更贴近实际创作需求的新评估框架。

Alibaba Cloud@alibaba_cloud · 5月28日62

📢Qwen3.7-Max just hit #3 on ITbench-AA — a fresh benchmark testing how well models handle real-world enterprise IT tasks, agentic-style. 🔧Agentic era, go with Qwen.🏃🏃 API: https://int.alibabacloud.com/m/1000413314/

译通义千问（Qwen）团队宣布，其Qwen3.7-Max模型在新兴的ITBench-AA基准测试中位列第三。该测试由Artificial Analysis与IBM Research合作推出，旨在评估模型解决真实企业IT任务的能力，当前聚焦于站点可靠性工程（SRE）领域。测试包含59个Kubernetes故障诊断任务。结果显示，Claude Opus 4.7以47%的得分排名第一，GPT-5.5（xhigh）以46%紧随其后，Qwen3.7-Max以42%排名第三。所有前沿模型得分均低于50%，表明该测试具有较高挑战性。

Artificial Analysis@ArtificialAnlys · 5月28日62

Overview of our recent launch of Coding Agent benchmarks on Artificial Analysis and our first Youtube Video! We walk through the performance, cost, token usage and speed differences across different coding agents. This includes looking at Opus 4.7 in Claude Code's leading performance and Composer 2.5's strong positioning on the Coding Agent Index / Cost Pareto frontier. We have also launched our YouTube channel! Come say hi and subscribe: https://www.youtube.com/@ArtificialAnalysisAI

译我们近期在 Artificial Analysis 上发布了编程智能体基准测试，并推出了首个 YouTube 视频！我们详细分析了不同编程智能体在性能、成本、token 使用量和速度方面的差异。其中包括 Claude Code 中 Opus 4.7 的领先表现，以及 Composer 2.5 在编程智能体指数/成本帕累托前沿上的强劲定位。我们还推出了 YouTube 频道！欢迎访问并订阅：https://www.youtube.com/@ArtificialAnalysisAI

Alibaba Cloud@alibaba_cloud · 5月28日59

📢Qwen3.7-Max just hit #3 on ITbench-AA — a fresh benchmark testing how well models handle real-world enterprise IT tasks, agentic-style. 🔧Agentic era, go with Qwen.🏃🏃

译由 Artificial Analysis 和 IBM Research 合作推出的首个评估模型处理真实企业IT任务能力的基准测试 ITBench-AA，聚焦于站点可靠性工程（SRE）任务。测试结果显示，通义千问（Qwen3.7-Max）以 42% 的分数排名第三。该测试中，所有前沿模型得分均低于 50%，其中 Claude Opus 4.7 以 47% 领先，GPT-5.5（xhigh）以 46% 紧随其后。在开源模型中，GLM-5.1（Reasoning）以 40% 领衔。该基准未来将扩展到财务运营（FinOps）等任务。

Qwen@Alibaba_Qwen · 5月28日60

📢Qwen3.7-Max just hit #3 on ITbench-AA — a fresh benchmark testing how well models handle real-world enterprise IT tasks, agentic-style. 🔧Agentic era, go with Qwen.🏃🏃

译Artificial Analysis与IBM Research联合推出ITBench-AA，首个评估AI智能体在企业IT运维任务上表现的基准。首批测试聚焦站点可靠性工程（SRE），包含59项Kubernetes事件响应任务。模型需在限定轮次内，通过分析日志、追踪依赖等方式，诊断出导致事件的根本原因实体。该基准采用Stirrup框架，以“全召回下的平均精度”作为评分标准。关键发现显示，Claude Opus 4.7以47%的得分领先，GPT-5.5得46%，通义千问Qwen3.7 Max以42%位列第三。所有前沿模型得分均低于50%，表明该基准极具挑战性。开源模型中，GLM-5.1（推理）以40%领先。

Tibo@thsottiaux · 5月28日63

Excited to see more independent benchmarks like that which are not contaminated (trained on by major models).

译新发布的独立基准测试 DeepSWE 结果更贴近开发者日常体验。测试显示，在编程任务上，GPT-5.5 得分为 70%，而 Claude Sonnet 得分为 32%，两者差距显著。DeepSWE 聚焦于 AI 智能体在真实工作流中的核心能力，即能否仅凭简短提示词，准确定位代码库并干净地完成修改，无需用户列举具体文件。原文指出，这验证了许多开发者长期以来的观察，并批评了 SWE-Bench 因数据集污染和验证机制较弱而常无法反映真实能力的问题。

Artificial Analysis@ArtificialAnlys · 5月28日37

We're excited to work with Harvey to launch the full leaderboard for Legal Agent Benchmark - coming soon to Artificial Analysis!

译我们很高兴与Harvey合作，即将在Artificial Analysis推出法律智能体基准测试的完整排行榜！

Rohan Paul@rohanpaul_ai · 5月28日60

Datacurve launches DeepSWE, a tougher coding benchmark made to show where leading models truly separate. GPT-5.5 hits 70%, while GPT-5.4 reaches 56% and Claude Opus 4.7 reaches 54%, making a gap that older benchmarks largely hid. Its a long-horizon software engineering benchmark. - DeepSWE differs from older coding benchmarks in the source of the exam: older tests often reuse public GitHub issues and PRs, while DeepSWE uses original tasks, so models are less likely to have seen the answer during training. - The work is also bigger even when the prompt is shorter, because older tests often tell the model what area to touch, while DeepSWE makes the agent search the repo, understand the design, edit multiple files, and avoid breaking old behavior. On DeepSWE, prompts are half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens. - The grading is different too, because many older benchmarks reuse tests from one merged PR, while DeepSWE checks whether the requested behavior actually works, even if the model solves it in a different valid way.

译Datacurve发布了新编程基准DeepSWE，旨在揭示模型在长期软件工程任务上的真实能力差距。在该基准上，GPT-5.5得分为70%，而GPT-5.4为56%，Claude Opus 4.7为54%，突显了模型间的显著差异。与旧有基准不同，DeepSWE使用原创任务，要求智能体在代码库中自主搜索、理解设计并修改多个文件。其解决方案所需代码量是SWE-bench Pro的5.5倍，输出token约2倍，反映了开发者日常工作中的实际挑战。

SemiAnalysis@SemiAnalysis_ · 5月28日36

there's a really important lesson here, but some of yall aren't ready for that conversation yet

译这里有一个非常重要的教训，但你们中的一些人还没准备好进行这场对话。

Artificial Analysis@ArtificialAnlys · 5月28日71

Artificial Analysis and IBM Research are launching ITBench-AA, the first in a new series of benchmarks evaluating models on agentic enterprise IT tasks, starting with Site Reliability Engineering tasks where frontier models score below 50% ITBench-AA’s SRE tasks benchmark model performance on Kubernetes incident response, where models must diagnose live systems by reading logs, tracing dependencies, and identifying root-cause entities across complex infrastructure. The underlying ITBench dataset has been developed by @IBM's Software Innovation Lab, leveraging IBM’s deep expertise in enterprise IT operations Artificial Analysis has worked closely with IBM over the last 6 months to develop a implementation of the dataset for frontier AI evaluation, beginning with Site Reliability Engineering (SRE) and expanding to Financial Operations (FinOps) and Chief Information Security Officer (CISO) tasks over time ITBench-AA SRE overview: ➤ 59 SRE tasks in total: 40 public tasks and 19 brand new, held-out tasks ➤ Each task provides a Kubernetes incident snapshot containing alerts, events, traces, metrics, logs, and application topology. The model must identify the minimal set of independent root-cause Kubernetes entities responsible for the incident ➤ Faults span typical SRE failure modes including infrastructure, service, application, and chaos-injected incidents, such as resource quota exhaustion, rollout failures, connection pool exhaustion, and network partitions Methodology details: ➤ Agentic harness: each task is solved by the model running in our open-source Stirrup reference harness, with shell access to a sandboxed file system containing the relevant logs and snapshots. 100-turn cap per task, 3 repeats per task ➤ Models submit a list of root-cause entities (Kubernetes Deployments, Services, Pods, etc.) they believe caused the incident. Each submission is compared against a ground-truth set of root causes provided by IBM Research ➤ Scoring uses average precision at full recall: if a model misses any of the ground-truth root causes, it scores 0.0 for that repeat. If it identifies all of them, it is awarded a score equal to its precision - the share of its submitted entities that are actual root causes, i.e. true positives / (true positives + false positives). The headline score is the average across 59 tasks × 3 repeats. ➤ The harness (Stirrup) is held constant across all evaluated models, allowing an apples-to-apples comparison between models. Key findings: ➤ Claude Opus 4.7 (Adaptive Reasoning, Max Effort) leads at 47%, followed by GPT-5.5 (xhigh) at 46% and Qwen3.7 Max at 42% ➤ All frontier models score below 50%, making ITBench-AA SRE one of the least saturated agentic benchmarks in our suite. For context, frontier models score considerably higher on Terminal-Bench ➤ Turn counts vary nearly 3x and longer trajectories do not translate to higher accuracy. GPT-5.5 (xhigh) averages 31 turns per task at 46%, while Gemini 3.1 Pro Preview averages 83 turns at 30%. Models that over-investigate tend to surface upstream fault-injection mechanisms or co-occurring symptoms as false positives ➤ GLM-5.1 (Reasoning) leads open weights models at 40%, effectively tied with Gemini 3.5 Flash (high). DeepSeek V4 Pro (Reasoning, Max Effort) follows at 38%, with Gemma 4 31B (Reasoning) at 37%, ahead of Gemini 3.1 Pro Preview at 30%

译Artificial Analysis与IBM Research联合推出ITBench-AA，首个评估AI智能体在企业IT任务中表现的基准，首发任务为站点可靠性工程（SRE）。该基准包含59项Kubernetes事件响应任务，所有前沿模型得分均未超过50%。其中，Claude Opus 4.7以47%领先，GPT-5.5得46%，通义千问（Qwen3.7 Max）得42%。开源模型中，智谱GLM-5.1（推理）得分40%，与Gemini 3.5 Flash持平；深度求索（DeepSeek V4 Pro）得38%。分析还发现，模型推理轮次差异近3倍，但更长轮次并不保证更高准确率。

Berryxia.AI@berryxia · 5月27日61

鹅厂好的新基准测试，叫Chronicles-OCR。腾讯HY实验室和四家机构一起做的，专门测AI对3000年中国古文字的识别能力。 2800张专家标注的图像，覆盖甲骨文、金文、篆书、隶书、楷书、行书、草书七大类。结果28个前沿多模态模型全军覆没。最强的VLLM在甲骨文上也只拿到14%的准确率。端到端检测的H-mean最高才16.5%。 GPT-5和Gemini 2.5 Pro直接接近0。更反直觉的是，开启reasoning模式反而让表现变差。 Chain-of-thought在感知失败的时候，反而放大了幻觉。模型其实根本没在认字，它认的是载体。古文字分类准确率能到96.7%，靠的是看到龟壳、青铜器这些容器，而不是看懂上面的字符。到底非遗中的价值，AI的攻克只有九牛一毛。

译腾讯HY实验室与四家机构发布了专门测试AI对中国古文字识别能力的基准Chronicles-OCR，包含2800张专家标注图像，覆盖甲骨文、金文等七大类。测试显示，28个前沿多模态模型集体表现不佳：VLLM在甲骨文上准确率仅14%，GPT-5与Gemini 2.5 Pro得分近零。值得注意的是，开启推理模式反而损害性能，因模型实为识别龟壳、青铜器等载体（准确率96.7%），而非真正识别字符本身。

Chubby♨️@kimmonismus · 5月27日58

Phoronix just published one of the first public benchmarks of NVIDIA's Vera CPU. I went through the full 11-page review this morning and the results are genuinely impressive. For those who don't follow server hardware: Vera is NVIDIA's new ARM-based data center processor with 88 custom-designed Olympus cores. The idea is straightforward. Agentic AI doesn't just need powerful GPUs. It needs CPUs that can keep up with code execution, tool calls, orchestration and data pipelines, all running concurrently at scale. The numbers are strong. Vera compiled a default Linux kernel in 20 seconds, the fastest result in Phoronix's tested field. Across all tested workloads, it delivered about 1.55x the performance of Intel's Xeon 6980P. Against AMD's EPYC 9575F, it came out about 10% ahead on a geometric mean basis. The memory story might be even more interesting. Vera uses LPDDR5X with up to 1.2 TB/s of bandwidth and delivers more than 4x the memory bandwidth per core compared to traditional x86 server CPUs. In the STREAM TRIAD benchmark, it sustained 90% of its rated peak bandwidth, the highest ratio Phoronix has measured on any CPU. If you're running agentic workloads with dozens of parallel processes and concurrent data queries, that kind of consistent memory performance matters more than core count on a spec sheet. Compared to NVIDIA's own Grace CPU, Vera is 1.63x faster in the geometric mean. That is an unusually large generation-over-generation jump for a CPU. Michael Larabel, who founded Phoronix and has been benchmarking Linux hardware for over two decades, said he's never seen any ARM processor compete with Intel and AMD at this level. I was at GTC in March when Jensen announced Vera. The thesis that agentic AI creates entirely new CPU demand made sense to me then. These benchmarks are the first real numbers behind that thesis. And they deliver. Vera ships to partners in H2 2026. The server CPU market just got a whole lot more interesting. Full 11-page review on Phoronix. Worth your time, all sources below.

译Phoronix发布了NVIDIA Vera CPU的首份公开基准测试。这款ARM架构数据中心处理器拥有88个Olympus核心，专为智能体AI（Agentic AI）所需的代码执行、工具调用与数据管道设计。测试数据显示，Vera编译Linux内核耗时20秒，为测试最快。其整体性能较Intel Xeon 6980P提升约1.55倍，较AMD EPYC 9575F平均领先约10%。内存方面，Vera采用LPDDR5X，提供高达1.2 TB/s的带宽，每核内存带宽是传统x86 CPU的4倍以上，且在STREAM TRIAD测试中达到了90%的峰值带宽利用率。与上一代Grace CPU相比，Vera性能平均提升1.63倍。该处理器预计于2026年H2出货给合作伙伴。

AYi@AYi_AInotes · 5月27日62

Damn，DeepSWE 这个新基准有一件事让我想明白了：以前的顶级模型可能没我们想的那么强🤔 而且我感觉这次AI 编码评测好像出了个超狠的东西，我觉得老基准可能全测错了。以前 SWE-Bench 上，顶级模型分数全挤在 54%-64%，看起来半斤八两，但 DeepSWE 不一样，不是单纯的考你改一行代码，它会让你真干活：找文件、复现 bug、改完验证、处理边缘 case。 @theo 说这是他第一次感觉和日常写代码体验对上了的基准。经过这么一测，差距直接炸开：GPT-5.5 是 70%，Claude Opus 是 54%，其他直接腰斩。最狠的还不是分数差距他们用一个很简单的 mini-swe-agent 去跑，结果和各大 lab 自己调了半天的官方工具差不多。这意味着很多好成绩不是模型强，是 prompt 工程刷的。但是DeepSWE 不给你准备时间，直接来，差距一下子就出来了。以前是大家都化好妆站一排，现在是直接掀帘子进浴室🤣 所以我自己的判断是： 1. 以后看模型真实 coding 能力，多看一眼这种长任务基准，少看短平快刷分榜 2. 选开发工具时，别看它主页上标的分，自己扔一个真 bug 让它改，改完跑通才算现在新基准这面照妖镜举起来了，后面刷分的怕是要睡不着了 hhh

译DeepSWE新基准模拟了真实的长链编程任务，如定位文件、复现bug和验证修复，挑战了旧有基准的局限性。测试显示，在顶级模型上分数差异模糊的SWE-Bench，被新基准拉开了差距：GPT-5.5达到70%，而Claude Opus为54%。研究发现，使用简单的mini-swe-agent即可取得与复杂定制工具相当的成绩，表明许多高分可能源于提示词工程。该基准作者@theo评论称，这是首个与真实编码体验相符的评测。

karminski-牙医@karminski3 · 5月27日56

Qwen3.7-max 这次编程能力相当不错, Code Arena (LMArena 测试项目) 中得分仅次于 Anthropic 几个模型, 于是我赶紧测了一波. 让 Qwen3.7-max 使用 Rust 写了个磁盘恢复软件. 实测效果很不错, 从头到尾没遇到过去那种卡编译的问题. 能很熟练的使用 Rust 的各种语法和特性. 这个磁盘恢复软件我设计了3层, 第一层是直接扫描已删除文件, 这个能达到100%的恢复率. 然后第二层是快速格式化 carve 模式, 即如果只是执行了快速格式化, 那么还是有概率迅速找回文件的. 第三层则是全盘扫描, 重建索引, 而且会在文件名称丢失的情况用 Qwen3.7-max 根据内容重建文件名称, 甚至尝试AI重建文件内容(会标记AI重建). 目前运行起来很流畅, 视频演示中用的就是 Qwen3.7-max 写的这个磁盘恢复软件做的. 稍后为大家带来 Qwen3.7-max 完整的性能测试, 敬请期待! #qwen #阿里千问 #qwen37max #AIAgent

译测试显示，Qwen3.7-max 在 Code Arena 上的编程得分仅次于 Anthropic 模型。使用该模型通过 Rust 开发了一款磁盘恢复软件，实测运行流畅。该软件设计了三层恢复功能，并利用该模型智能重建文件名和内容。

歸藏(guizang.ai)@op7418 · 5月27日65

Qwen 3.7 Max 在 Arena Coding Agent 上排第四

Qwen@Alibaba_Qwen · 5月27日68

🚀🚀 Qwen3.7-Max just hit #4 on Code Arena, on par with Claude Opus 4.6 ，top-ranked Chinese lab on the board! @arena More to ship. Stay tuned. 🕶️

译🚀🚀 Qwen3.7-Max 刚刚在 Code Arena 上升至第 4 名，与 Claude Opus 4.6 持平，是榜单上排名最高的中国实验室！@arena 更多内容即将发布。敬请期待。🕶️

Berryxia.AI@berryxia · 5月27日72

iPhone 上直接用 App Store 下载就行： 👉 Bonsai Studio — PrismML 官方 iOS 应用，免费安装，模型在手机本地跑我觉得给学校老师做一些素材展示，或者幼儿园小朋友教学还是可以的。不需要额外的TOKEN费用，风格支持的挺多的。中文文字还是乱码但是可以快速理解你的意境（图2）技术背景： Bonsai Image 4B 基于 FLUX.2 Klein，把模型权重压缩成 1-bit/3-bit，体积从 7.75GB 压到 0.93GB iPhone 上生成一张 512×512 图约占 1.5GB 内存， 1024×1024 大概 2GB，iPhone 15 Pro 以上没问题完全本地推理，不联网就可以跑起来！ Android 暂时没官方 App，只能走 WebGPU 网页版。我实际在iPhone 17 Pro Max 测试了一下，出一张5125*512的画的速度不到几十秒就可以出来一张。浏览器中需要下载1.8G 左右模型就可以玩~ 地址在评论区👇🏻

译PrismML发布官方iOS应用Bonsai Studio，用户可免费下载，在iPhone上本地离线运行其Bonsai Image 4B扩散模型。该模型基于FLUX.2 Klein，其1-bit压缩版仅0.93GB，比全精度版小8.3倍。在iPhone 15 Pro及以上机型生成512×512图像耗时约几十秒，内存占用约1.5GB。应用支持多种风格，但中文文字生成目前为乱码。Android用户可通过WebGPU网页版体验。

meng shao@shao__meng · 5月27日63

连续两个月，每天数小时，Codex 与 Claude Code 并行使用后，@AlexFinn 决定转向 Codex，为什么？在 Alex 的判断中，关键变量是：模型智商或代码生成速度已不是关键，自测闭环更重要，Codex 会在内置浏览器里验证每次改动，形成「改 → 测 → 修」的自动化循环。 Codex 自测闭环后，从 40% 的改动首次交付就有 bug，到 ≤3%，可靠性明显提升，更容易进入心流。我的补充：除内置浏览器外，Codex 还有 Computer Use 和 Chrome 扩展可以搭配使用，做网站自动化验证测试。

译开发者 AlexFinn 在连续两个月、每天数小时并行使用 Codex 和 Claude Code 后，决定转向 Codex。其核心原因在于 Codex 拥有强大的自测闭环功能：每次代码改动后，它会在内置浏览器中自动验证，形成“改→测→修”的自动化循环。这一机制将首次交付有 bug 的改动比例从约 40% 显著降低至 ≤3%，可靠性大幅提升，更利于开发者保持心流。他建议开发者不要对任何公司忠诚，应始终使用当下最好的工具。

Artificial Analysis@ArtificialAnlys · 5月27日60

Gemini 3.5 Flash is a step forward for Google on speed and agentic capabilities but comes at a trade-off of being higher cost than prior models We have measured up to ~280 output tokens/sec, placing it on the speed/intelligence Pareto frontier and well ahead of Gemini 3 Flash. It also shows a major uplift on agentic tasks, reaching ~1650 ELO on GDPVal-AA. The trade-off: cost is up ~5x versus Gemini 3 Flash, driven by higher token prices (3x higher than Gemini 3 Flash) and higher token usage. In this video, Declan Jackson, Member of Technical Staff at Artificial Analysis, breaks it down.

译Gemini 3.5 Flash在速度与agent能力上实现进步，实测输出速度可达约280 output tokens/sec，在GDPVal-AA agent任务中ELO提升至约1650，相比Gemini 3 Flash有显著提升。但代价是成本增加约5倍，主要因token单价上涨（为Gemini 3.5 Flash的3倍）以及使用量更高。

meng shao@shao__meng · 5月26日53

Marvis 已卸载，因为发现它不只是除了小动画做的好玩，Agent 能力和输出结果很差，更吓人的是。。它在安装后初始化时，就在要各种权限，因为也不知道如果拒绝会不会影响 Agent 使用，就都点了同意，结果点到最后发现，这货居然拿到了我的 App 列表、我的全部文件清单（还 tm 贴心的给我做了分类） Marvis 难道是拿着腾讯电脑管家的代码仓库干的？还是这个团队直接转过来的？在腾讯面前暴露所有 App 和文件，想想都很吓人，赶紧卸载，能力再强也不敢碰了。

译腾讯AI智能体产品Marvis被用户卸载。主要问题在于：1）隐私风险高，初始化时过度索要权限，获取了用户全部的App列表和文件清单（并做了分类）；2）实际Agent能力与输出效果不佳。其交互界面虽有创意（如模拟办公室的小动画，Agent会摸鱼），但核心执行效果一言难尽，导致用户因担忧数据隐私而选择放弃使用。

karminski-牙医@karminski3 · 5月26日67

大模型写代码比说话还快是什么体验? 智谱刚出了一个 GLM-5.1-highspeed 版本, 赶紧要了个内测给大家做点有趣的 APP. 我测了一下这个模型反应速度用来写代码的话, 人类打字甚至都跟不上它, 于是我干脆接了个语音转文本的服务直接让我言出法随操作它写代码. 大家能看到基本是我说完3s左右它就修改完毕了, 这之间发生了语音转文本(第三方服务), 模型判断是否任务可以并发, 模型 prefill, 模型使用 tool call 修改代码段. iframe 重新渲染. 这些全都发生在 3s 这么短的时间内. 直接体验拉满. 这个模型直接量变引起了质变, 一些之前不敢想象的交互体验现在都可以做了. 所以如果你想使用这个模型构建一些极具竞争力的项目, 不妨去申请试试, 目前这个模型正在向部分企业用户提供中. #GLM #GLM51highspeed #智谱AI

译智谱发布了推理速度极快的GLM-5.1-highspeed版本。测试者发现其生成代码的速度已超过人类打字速度，因此构建了一个语音转文本的编程交互场景。从用户说完语音指令到代码修改完成、页面渲染，整个链路（包括语音识别、模型判断并发与prefill、tool call修改代码）耗时约3秒。这种量级的速度提升带来了全新的实时交互可能性。该模型目前正向部分企业用户提供内测。

Ethan Mollick@emollick · 5月26日56

Its very limiting that a big set of very hard problems that we have just lying around are Erdos problems. Don’t get me wrong, they are quite cool, but we really need hard problems repositories for many fields, including areas that have less specified answers & require judges. Yes, math is the easiest field in which to do verified work, but it is also an area where direct implications of increasing AI ability on everyday life are less clear. We need more types of problems (complex engineering problems, large data sets in economics, physics, biology), for people to turn AI loose on, including speciations of how to evaluate them.

译推文指出，当前用于推动AI能力发展的困难问题过于集中于数学领域（如Erdős问题）。虽然数学易于验证，但其成果对日常生活的直接影响不够明确。作者呼吁需要为包括工程、经济、物理、生物等在内的更多领域建立困难问题库，并配套制定相应的评估方法，以让AI智能体处理更复杂、答案更不明确的任务。

karminski-牙医@karminski3 · 5月25日58

数字人模型本地都能跑了吗? 美团刚发了个数字人模型 LongCat-Video-avatar-1.5, 只要给到图片和音频, 就能生成口播, 我给大家录了一段实测. 目前 HuggingFace Space 上的 demo 只能生成5s的视频, 所以我是录了两段480p的拼接起来的. 我特意挑选了一个很困难的case, 大家可以看到这个人物嘴部有遮挡. 实际效果来看虽然距离SOTA级别的模型有差距, 主要还是口型, 以及输出最大只支持720p. 不过720p这个也比较好解决, 大家可以看到我视频中演示的这个清晰度是可以的, 我是直接用了AI提升分辨率到4K重绘了一下. 这个模型作为本地部署方案还是可以的, 尤其是动漫人物也能泛化. 另外模型略大, int8量化也有16G, 需要用一个好一点的显卡. #longcat #数字人模型 #数字人

译美团发布数字人模型LongCat-Video-avatar-1.5，可通过图片和音频生成口播视频。demo仅支持5秒480p视频。实测中人物嘴部遮挡案例效果与SOTA有差距，主要在口型。最大分辨率720p，但可AI提升至4K。模型本地部署可行，对动漫人物泛化，但体积大，int8量化需16G显存。

meng shao@shao__meng · 5月24日45

腾讯的 Marvis 今天手痒我真的去试了试怎么说呢？确实是我手痒了，这手得剁 😂 整个软件，就这个 Agent 模拟办公室的 dashboard 小动画也有些意思，Marvis 这个项目经理，收到任务后，会小跑到需要调用的 Agent 面前窃窃私语一番，这个 Agent 开始干活后，Marvis 继续回到工位假装很忙的盯着进度；更有意思的是，没活儿干的 Agent 会摸鱼玩游戏 😄 腾讯的产研部门，这是按照自己部门的工作状态做的吗？永远只有几个人真的在忙、项目经理永远假装在忙和在各种催进度、大家摸鱼的功夫百花齐放，公司眼中，最关注的永远是 Token（人力）成本。。最后说回 Agent 执行结果，算了不说了，一言难尽

译腾讯的Marvis项目展示了一个AI智能体模拟办公室的Dashboard动画。动画中，项目经理Marvis在收到任务后会小跑到需要调用的AI智能体面前沟通，后者开始工作后，Marvis会返回工位监控进度；闲置的智能体则会模拟“摸鱼”玩游戏。该设计调侃了常见的职场生态。不过，对于该智能体的实际任务执行结果，推文作者表示体验不佳，“一言难尽”。

Ethan Mollick@emollick · 5月24日44

GPT-5.5 Pro is a very solid fact checker. I can throw entire chapters at it and it will hunt down every key reference accurately. The only real annoyance is that it loves nuance, so returns a lot of “the general idea is right, but you are not taking into account tiny detail X”

译GPT-5.5 Pro是一个非常可靠的事实核查工具。我可以把整章内容丢给它，它能准确找出每一个关键参考文献。唯一的烦恼是它过于注重细微差别，经常返回“大体思路正确，但你没有考虑到微小细节X”这类反馈。

Alibaba Cloud@alibaba_cloud · 5月23日61

The velocity of the Qwen3.7-Max development is unreal. This is what relentless innovation looks like. #AlibabaCloud #Qwen

译阿里云Qwen团队新发布的Qwen3.7-Max模型在极短时间内（不到一个月）实现了多模态生成能力的惊人进步。独立测试显示，该模型已从此前表现落后，跃升至在特定测试中与Gemini 3.5 Flash持平，并超越了GPT-5.5与Claude Opus 4.7。其渲染的图像（如足球运动员与足球）在比例和真实感上表现尤为突出，展现出卓越的空间推理能力。