AIHOT
内容
精选全部 AI 动态AI 日报主题收藏
接入
Agent 接入
更多
关于更新日志反馈
内部员工登录
精选全部日报更多
内部员工登录
全部动态X · 480 条
全部一手资讯X论文
标签「评测/基准」清除
Epoch AI@EpochAIResearch · 5月6日49

The recipe for “classic” reasoning benchmarks is simple: text-only, several-hour time horizons, easy to grade, with expert human baselines. What next? In this week’s Gradient Update, @GregHBurnham argues it’s as easy as dropping one of these four ingredients.

译“经典”推理基准的配方很简单:纯文本、数小时的时间跨度、易于评分,并带有专家人类基线。 接下来呢?在本周的Gradient Update中,@GregHBurnham 认为只需舍弃这四种成分之一即可。

Rohan Paul@rohanpaul_ai · 5月6日68

GPT-5.5 & Opus 4.7 score <1% on ARC-AGI-3

译GPT-5.5 与 Opus 4.7 在 ARC-AGI-3 上的得分低于 1%

Artificial Analysis@ArtificialAnlys · 5月6日58

MiniMax-M2.7 is now available across six inference providers on Artificial Analysis, with significant differentiation in speed and price @SambaNovaAI leads on speed at 435 output tokens/s, >3x faster than any other provider. @FireworksAI_HQ, @novita_labs, @togethercompute, and @GMI_cloud have all matched @MiniMax_AI's first-party API pricing, while SambaNova is 2x higher. Key takeaways: ➤ Fireworks and SambaNova are on the Pareto frontier for Speed vs. Price. At 127 output tokens/s and ~$0.22 per 1M tokens blended, Fireworks is ~2.2x faster than MiniMax's first-party API at the same blended price, whereas SambaNova delivers 435 output tokens/s but at ~2-3.5x the blended price of the other providers (depending on cache usage) ➤ SambaNova is the fastest provider at 435 output tokens/s, ~3.4x the next fastest provider (Fireworks at 127 output tokens/s). The remaining providers run substantially slower: MiniMax’s first-party API at 57 output tokens/s, Novita at 54, GMI at 41, and Together AI at 29 ➤ Cache discounts vary across providers. Fireworks, MiniMax, Novita, and Together AI offer 80% cache hit discounts, while GMI and SambaNova do not offer a discount. For cache-heavy workloads, this can materially increase the relative pricing for GMI and SambaNova ➤ Optimal provider choice depends on workload. SambaNova may be more suited to latency-sensitive deployments, albeit at a higher cost, while Fireworks may be more suitable for high-volume workloads that are not as latency-sensitive

译MiniMax-M2.7模型已在六家推理服务商上线,各提供商在速度和价格上差异明显。SambaNovaAI以每秒435个输出令牌的速度领先,比其他提供商快3倍以上,但其价格也高出约2倍。FireworksAI、Novita Labs等四家则与MiniMax官方API定价持平。分析指出,Fireworks和SambaNova在速度与价格的权衡中处于帕累托前沿:前者性价比高,后者则以高价换取极致速度。此外,各家的高速缓存折扣政策不同,这对缓存密集型工作负载的成本影响显著。因此,最优选择高度依赖于具体工作负载对延迟和成本的敏感度。

Luma@LumaLabsAI · 5月5日71

Multimodal at the frontier. Built around your business.

译Luma Labs 推出的 UNI-1.1-Max 和 UNI-1.1 多模态模型在 Image Arena 的文本生成图像与图像编辑综合排名中位列第三,且未采用智能体搜索技术。具体来看,在文本生成图像竞技场中,两款模型分别排名第六和第七;在多图像编辑和单图像编辑竞技场中,它们均进入前十一名,其中 UNI-1.1-Max 在单图像编辑中排名第七。这一成绩标志着 Luma Labs 在多模态前沿领域取得了扎实进展。

Deedy@deedydas · 5月5日62

The creators of SWE-Bench just dropped a really simple new benchmark every LLM gets 0% on. ProgramBench asks: can models recreate real executable programs (ffmpeg, SQLite, ripgrep) from scratch with no internet? We are far from saturated on model quality.

译SWE-Bench 的创建者刚刚发布了一个非常简单的新基准测试,所有 LLM 都得 0 分。 ProgramBench 提出的问题是:模型能否在没有互联网的情况下从零开始重建真实可执行程序(ffmpeg、SQLite、ripgrep)? 我们在模型质量上还远未饱和。

OpenRouter@OpenRouter · 5月5日65

We analyzed GPT 5.5 vs GPT 5.4 and found that costs increased between 49-92%. The 2x price hike of GPT 5.5 is mitigated by the model generating 19-34% fewer completion tokens for longer prompts. More analysis here: https://openrouter.ai/announcements/gpt55-cost-analysis

译我们分析了GPT 5.5与GPT 5.4,发现成本增加了49-92%。 GPT 5.5价格翻倍的影响因模型生成长提示时补全令牌减少了19-34%而有所缓解。 更多分析请见:https://openrouter.ai/announcements/gpt55-cost-analysis

Berryxia.AI@berryxia · 5月5日58

该说不说没有特别火热的AI工具,贼好。 因为不会降智! Grok 4.3 最近不错。

译Grok 4.3近期在Vals AI的私有基准测试中,于法律和金融领域展现出领先的智能推理能力。其在针对真实加拿大法庭案例的CaseLaw (v2)测试中,以79.31%的准确率超越GPT-5.1;在基于复杂多页信贷协议的CorpFin (v2)测试中,准确率达68.53%。这些测试聚焦深度法律推理与金融合同理解等高难度现实任务,结果表明Grok 4.3在真实世界高风险领域的卓越性能,印证了xAI致力于构建世界级推理引擎的目标。

SemiAnalysis@SemiAnalysis_ · 5月5日71

MINECRAFT STEVE ALERT: GB300 ultra NVL72 is already 2.7x faster 🚀 than GB200 NVL72 on one of the industry standard inference engine known as @vllm_project. On paper, GB300 only has ~1.5x faster NVFP4 FLOP & 1.5x more HBM capacity & same HBM BW than GB200 but due to the full stack optimization with compounding gains, in the middle of the curve where most providers serve at, GB300 is up to 2.7x faster. End to End performance is the gold standard of performance, not on paper theoretical flops. Thanks to the 10x engineers at NVIDIA & @inferact & @coreweave for this temporary gb300 for open source projects!

译在行业标准推理引擎vLLM上的测试显示,NVIDIA GB300 NVL72的实测端到端性能已达GB200 NVL72的2.7倍。尽管其纸面参数仅显示NVFP4算力提升约1.5倍、HBM容量增加1.5倍且带宽相同,但在大多数服务商实际运行的中段负载区间,凭借全栈优化的复合增益,GB300实现了远超理论算力提升的性能飞跃。此次测试基于NVIDIA、Inferact和CoreWeave为开源项目提供的临时GB300系统完成,结果印证了端到端实测性能才是衡量硬件效能的黄金标准,而非单纯的纸面理论算力。

swyx 🇸🇬@swyx · 5月5日61

seeing lot of people saying that Opus 4.7 is a net regression vs 4.6, but it seems quite anecdotal. offline and online evals point towards a clean step up. what's not being captured? "personality"?

译看到很多人说Opus 4.7相比4.6是净退步,但这似乎只是些个例。 离线和在线评估都指向明确的进步。 那是什么没被捕捉到呢?“个性”吗?

Artificial Analysis@ArtificialAnlys · 5月5日69

A new anonymous model debuts at #8 in the Artificial Analysis Text to Image Arena! Peanut’s weights are expected to be released soon, which would make it the leading Text to Image Open Weights Model. Peanut is positioned to be the new leading open weights Text to Image model, surpassing Z-Image Turbo, Qwen-Image, and FLUX.2 [dev]. Further details (and weights) coming soon. See example generations from Peanut in the Artificial Analysis Image Arena below 🧵

译一款新的匿名模型在Artificial Analysis文本转图像竞技场中首次亮相,位列第8!Peanut的权重预计即将发布,这将使其成为领先的文本转图像开源权重模型。 Peanut定位为新的领先开源权重文本转图像模型,超越了Z-Image Turbo、Qwen-Image和FLUX.2 [dev]。 更多详细信息(及权重)即将公布。 查看下方🧵中Artificial Analysis图像竞技场里Peanut的生成示例。

Elon Musk@elonmusk · 5月5日41

Try Grok

译在“Vals AI”的私人基准测试中,Grok 4.3在法律和金融领域展现出领先的智能水平。它在CaseLaw (v2)测试中以79.31%的准确率排名第一,该测试基于真实加拿大法庭案例,评估深度法律推理和先例理解能力,表现优于GPT-5.1。同时,它在针对复杂长期信贷协议的CorpFin (v2)测试中以68.53%的准确率夺冠,评估了对多页金融合同条款、风险的理解。这些模拟高风险现实挑战的测试表明,Grok 4.3在最困难的任务中具备卓越的推理能力。xAI正致力于构建世界所需的推理引擎。

Epoch AI@EpochAIResearch · 5月5日46

Are AI benchmarks doomed? @GregHBurnham and @tmkadamcz join @ansonwhho to push back on benchmark pessimism and dig into what the next generation of AI benchmarks could look like. (0:00:00) - Preview (0:00:36) - Intro: Are AI benchmarks doomed? (0:03:13) - The costs and benefits of benchmark development (0:11:48) - MirrorCode and scalable benchmarks (0:20:57) - AI speed-up in benchmark development (0:23:28) - The benchmark-reality gap (0:38:26) - Can an AGI benchmark exist? (0:43:18) - Beyond automated scoring (1:00:45) - How AI changes benchmark building in practice

译针对“AI基准测试是否已失效”的悲观论调,讨论者进行了反驳,并深入探讨下一代AI基准测试的可能形态。核心议题包括基准测试开发的成本与收益、可扩展基准(如MirrorCode)的构建、AI技术对基准开发本身的加速作用,以及当前基准测试与现实应用能力之间存在的差距。对话还触及了构建通用人工智能(AGI)基准的可行性,并展望了超越自动化评分的更全面评估方法。

Chubby♨️@kimmonismus · 5月4日62

A little-known startup just landed on the @ArtificialAnlys AI Video leaderboard, now ranked among the top 6 in the world. Very cool @video_rebirth

译初创公司Video Rebirth的文本生成视频模型Bach-1.0 Preview在Artificial Analysis的全球AI视频排行榜上首次亮相即位列第六。其性能与Vidu Q3 Pro、Kling 3.0 Omni 1080p (Pro)及grok-imagine-video等知名模型相当。该模型计划于五月下旬广泛发布。

Ethan Mollick@emollick · 5月3日57

Its getting hard to benchmark frontier agent performance on longer tasks. Repeated measurement is very expensive and there are differences between using models in harnesses versus via APIs. I suspect benchmarks understate progress, they are built for models, not harnessed agents

译对前沿智能体在较长任务上的性能进行基准测试正变得越来越困难。重复测量的成本非常高,而且使用受控框架中的模型与通过API使用模型之间存在差异。 我怀疑基准测试低估了进展,它们是为模型设计的,而非为受控智能体。

Eric@ericmitchellai · 5月3日50

What has ChatGPT helped you learn? How does it fall short as a learning or teaching tool?

译用户通过对比GPT-5.4和GPT-5.5的教学效果,指出两者在解释概念时存在关键差异。GPT-5.4倾向于先阐述概念,再让学习者回溯关联标签,增加了认知负担。而GPT-5.5采用更清晰的方式:先给出明确标签(如“导数”),再立即附上概念解释(如“描述变化速率”)。这种“标签优先”的结构使解释流畅连贯,无需大脑反复回溯重组信息,从而在长期教学对话中能更好地维持学习者的注意力。

Chubby♨️@kimmonismus · 5月2日51

Nice! Google is preparing for I/o. New models soon

译不错!Google 正在为 I/O 大会做准备。新模型即将推出

Elon Musk@elonmusk · 5月2日39

Grok Voice is used by Starlink right now

译Grok Voice 目前正被 Starlink 使用 [引用 @XFreeze]:Grok Voice 在 τ-voice 基准测试中占据绝对优势 Grok 得分为 67.3%,而 Gemini 为 43.8%,GPT Realtime 为 35.3% 这遥遥领先于竞争对手,优势巨大 目前最优秀的实时推理语音助手

TestingCatalog News 🗞@testingcatalog · 5月2日66

GOOGLE 🚨: A new Gemini Flash model has been spotted on LM Arena. Besides that, Vertex AI customers who still use Gemini Flash 2 received an email that it will be distributed soon. > Transition to Gemini 3.1 Flash Lite - Generaly Available soon! Soon 🔜 h/t @hishtadlut

译谷歌新的Gemini Flash模型已在LM Arena上出现。同时,Vertex AI客户收到邮件,Gemini 3.1 Flash Lite即将正式发布。引用推文指出,虽然模型在竞技场中仍显示为“Gemini 3 Flash”,但其输出质量已跃升两个层级,性能更接近当前的Gemini 3.1 Pro,是一次重大升级,实际版本可能是3.1、3.2或3.5 Flash。

François Chollet@fchollet · 5月2日37

If you want to help the world make sense of AGI and accelerate its arrival, consider joining the ARC Prize foundation. Two roles currently open: Game Platform Engineering Lead, and Model Testing & Analysis Lead https://arcprize.org/jobs

译若你希望帮助世界理解AGI并加速其到来,可以考虑加入ARC Prize基金会。 目前开放两个职位:Game Platform Engineering Lead,以及Model Testing & Analysis Lead https://arcprize.org/jobs

François Chollet@fchollet · 5月2日56

The latest crop of models remains below 1% on ARC-AGI-3 -- for now. Where will the scores be by the end of the year?

译最新一批模型在ARC-AGI-3上的得分目前仍低于1%。 到今年年底,得分会达到多少呢?

François Chollet@fchollet · 5月2日70

RL is a bit of a double edged sword: in known territory performance increases, but in unknown territory the model tends to hallucinate that it is performing a completely different task it was trained on

译强化学习在已知领域能提升模型性能,但在未知领域可能导致模型产生幻觉,误以为在执行其他训练过的任务。这一现象在GPT-5.5等大模型的ARC AGI 3基准测试中有所体现,其得分仅为0.43%,与Claude 4.6、Gemini 3.1等模型表现相近。分析指出GPT-5.5的主要失败原因包括:局部效应正确但世界模型错误、从训练数据中提取的抽象层级不当,以及虽解决问题却未强化奖励机制。深入分析此类失败案例,有助于全面理解大模型在特定模态上的能力局限与改进方向。

PixVerse@PixVerse_ · 5月1日49

Thanks @TomLikesRobots ! Wishing you a happy and cozy weekend!

译主推文感谢了用户@TomLikesRobots分享的文本生成视频模型对比。对比在SeeDance 2.0和HappyHorse 1.0之间进行,使用了统一的提示词来生成具有低保真、温馨、赛璐珞风格动漫美学的视频。其中,HappyHorse由@PixVerse_提供,目前对会员免费。由于两个模型自带的音频效果不佳,创作者最终使用@Suno来生成背景音轨。

TestingCatalog News 🗞@testingcatalog · 5月1日55

Grok 4.3 got to the 7th spot on the Artificial Analysis Index, surpassing Muse Spark from Meta.

译Grok 4.3 在 Artificial Analysis Index 中升至第 7 位,超越了 Meta 的 Muse Spark。

Rohan Paul@rohanpaul_ai · 5月1日43

The LongCat team just released LARYBench, a benchmark built to test whether an AI model truly learns action from video, instead of only looking good when attached to a robot policy later. It evaluates latent actions, meaning the hidden motion signals a model extracts from video, across 1.2M+ clips, 620K+ image pairs, 595K trajectories, 151 action classes, and 11 robot platforms. A latent action representation tries to store the change between frames as something like reach, pick, place, move left, or close gripper, rather than memorizing raw pixels. The key point is that robot training data is scarce, while human and robot videos are abundant, so the whole field wants a way to turn cheap video into useful action knowledge. The paper argues that older evaluations mixed too many things together, because a robot succeeding on a task depends on the policy, training recipe, environment, and controller, so you could not tell whether the action representation itself was actually good. LARYBench splits the problem into 2 cleaner tests, where one asks whether the representation knows what happened and the other asks whether it preserves enough detail for how to move. The biggest result is that general self-supervised vision models beat specialized embodied LAMs, with V-JEPA 2 reaching 76.62% average action classification accuracy, while DINOv3 gives the best overall control regression score at 0.19 MSE, far ahead of embodied models clustered around 0.87 to 0.97. The deeper point is that strong visual representations already contain a surprising amount of action knowledge, and the paper also shows that latent feature spaces map to robot control better than pixel reconstruction spaces, which helps explain why some robotics systems may be building on the wrong intermediate representation. 🧵 1.

译LongCat团队推出LARYBench基准,旨在评估AI模型是否从视频中真正学习动作,而非仅在后端机器人策略中表现良好。该基准聚焦模型从视频提取的潜在动作表示,通过超过120万视频片段等数据,将评估拆分为动作分类与控制回归两个清晰测试。关键发现是,通用自监督视觉模型(如V-JEPA 2和DINOv3)表现优于专用具身模型,表明强大视觉表示已蕴含丰富动作知识,且潜在特征空间比像素重建更利于机器人控制映射。这为利用丰富视频数据解决机器人训练数据稀缺问题提供了新方向。

Artificial Analysis@ArtificialAnlys · 5月1日57

All three leading open weights models were released last week. Progress continues for open weights models alongside proprietary ones, with the gap to GPT-5.5, the leading proprietary model, sitting at 6 points on the Artificial Analysis Intelligence Index @Kimi_Moonshot’s Kimi K2.6 (Reasoning) and @Xiaomi's MiMo V2.5 Pro (Reasoning) tie as the leading open weights models on the Artificial Analysis Intelligence Index at 54, with @deepseek_ai's DeepSeek V4 Pro (Reasoning, Max Effort) at 52. This places the best open weights models within 3-6 points of the leading proprietary models: @OpenAI's GPT-5.5 (xhigh) at 60, and @Google's Gemini 3.1 Pro Preview and @AnthropicAI's Claude Opus 4.7 (Adaptive Reasoning, Max Effort) at 57. For context: just one year ago the highest-scoring open weights model was DeepSeek V3 0324 which achieved 22 on the Intelligence Index, and was ~13 points below the highest-scoring proprietary model, Claude 3.7 Sonnet (Reasoning) at 35. Key takeaways: ➤ The top three most intelligent open weights models are trillion-plus-parameter MoE architectures with permissive licenses. Kimi K2.6 (Reasoning) has 1T total / 32B active parameters with 256K context window, MiMo V2.5 Pro (Reasoning) has 1T total / 42B active with 1M context window, and DeepSeek V4 Pro (Reasoning, Max Effort) has 1.6T total / 49B active with 1M context window. ➤ The gap to proprietary remains wide on the hardest reasoning and agentic coding evaluations. On HLE (Humanity's Last Exam) the three top open weights models score 34-36%, vs 44% for GPT-5.5 (xhigh) and 45% for Gemini 3.1 Pro Preview. On CritPt (Research-level Physics) they score 4-12%, vs 27% for GPT-5.5 (xhigh). On TerminalBench Hard (Agentic Coding & Terminal Use) they score 43-46%, vs 61% for GPT-5.5 (xhigh) and 54% for Gemini 3.1 Pro Preview. ➤ Omniscience (knowledge + hallucination) shows a large gap to proprietary models, with DeepSeek V4 Pro (Reasoning, Max Effort) hallucinating significantly more than its open weights peers. DeepSeek V4 Pro (Reasoning, Max Effort) scores -10, MiMo V2.5 Pro (Reasoning) +4, and Kimi K2.6 (Reasoning) +6. By comparison, GPT-5.5 (xhigh) scores +20, Claude Opus 4.7 (Adaptive Reasoning, Max Effort) +26, and Gemini 3.1 Pro Preview +33.

译上周,Kimi K2.6、MiMo V2.5 Pro和DeepSeek V4 Pro三大领先开源模型发布,在Artificial Analysis Intelligence Index上得分达52-54分,与顶尖闭源模型GPT-5.5的60分差距缩小至6分以内,相比一年前22分的开源模型进步显著。这些模型均为万亿参数规模的MoE架构。然而,在复杂推理、智能体编码及知识准确性方面,开源模型与闭源模型仍存在明显差距。例如在HLE、CritPt和TerminalBench Hard等专项评估中得分大幅落后;在Omniscience评估中,DeepSeek V4 Pro的幻觉问题尤为突出。

elvis@omarsar0 · 5月1日58

I have been testing DeepSeek-V4-Pro with the Pi coding agent. I am mindblown by how well it works out of the box. A few notes: I spent a few hours building an LLM wiki with an agent powered entirely by DeepSeek-V4-Pro on @FireworksAI_HQ inference. This is the first time I feel like there is an open-weight model that can reason at the level of Claude and Codex. And it does this in a cost-effective way with support for 1M context length. To be clear, I am using DeepSeek-V4-Pro inside of Pi without any special configuration. It works out of the box. It's exciting that there is a model that can just be plugged into a basic harness like Pi, and it just works. I've never seen that before. Most models require lots of configuration and setup. @deepseek_ai's DeepSeek-V4-Pro is clearly good at agentic coding (probably the best from the open-weight models), but the model is also great on knowledge-intensive tasks where reasoning matters. The agent pulled agentic engineering best practices from different company docs (Anthropic, OpenAI, Google, Stripe, Meta, Modal, DeepSeek, Mistral, Cohere), searched and digested Reddit and HN threads, summarized arxiv papers, and surfaced trending GitHub repos. Then it distilled everything into actionable tips across categories. I love the Wiki it built. The quality is really good. Here is a snapshot of what the wiki looks like: https://github.com/dair-ai/dair-workshops/tree/main/agentic-engineering-wiki DeepSeek-V4-Pro handled the task without breaking stride. Multi-step research queries, code generation for scaffolding, context-heavy reasoning across disparate sources. For coding specifically, this is the first open-weight model that genuinely feels like a Codex or Claude Code experience. It compares in capability and actual multi-turn agentic work. What made the loop feel so responsive was Fireworks' inference speed (the fastest in the market) and the fact that they actually validate models at the systems level before shipping. No corrupted reasoning traces. Just fast, reliable iteration. The hybrid CSA and HCA attention design cuts KV cache to just 10% and inference FLOPs by nearly 4x at 1M-token context. This is what makes the agent loop actually fast and cheap enough to run in practice. For devs who've been watching open-weight models close the gap but haven't found one that actually delivers in practice, this is the closest I've seen. Try it here: https://app.fireworks.ai/models/fireworks/deepseek-v4-pro

译测试者使用 DeepSeek-V4-Pro 在 Pi 编码智能体上构建了一个 LLM 知识库,对其开箱即用的表现感到震撼。这是首个在推理能力上媲美 Claude 和 Codex 的开源权重模型,且成本效益高,支持 100 万上下文长度。该模型无需复杂配置即可在基础框架中直接运行,擅长智能体编码和知识密集型推理任务,能跨公司文档、论坛、论文和代码库进行多步骤研究、代码生成与上下文推理。其高效运行得益于 Fireworks 的市场最快推理速度及混合注意力设计,将 KV 缓存降至 10%,推理计算量减少近 4 倍,实现了快速且低成本的实践部署。

Ethan Mollick@emollick · 5月1日61

The new Grok comes in below the latest Chinese open weights models, Grok 4 was at the frontier when released. (& Artificial Analysis: please stop using GDPval-AA which is not a useful test of anything except a model’s ability to impress Gemini as a judge)

译xAI发布Grok 4.3,其在Artificial Analysis智能指数得分53,性能优于Grok 4.20、Muse Spark等模型。核心改进在于“性价比”:输入与输出价格较前代分别降低约40%和60%,且基准测试套件运行成本下降。该版本在GDPval-AA等现实智能体任务上表现显著提升,指令遵循与客服任务强劲。但推文指出,其表现仍落后于最新的中国开源模型,并批评GDPval-AA测试本身价值有限。

OpenRouter@OpenRouter · 5月1日68

The new Grok-4.3 from @xai is live on OpenRouter! Grok-4.3 releases at a lower price than Grok-4.2, while seeing a large jump in agentic performance: a 321 point increase to 1500 ELO on @ArtificialAnlys GDPval-AA, surpassing other top models despite the lower price.

译@xai 的新模型 Grok-4.3 现已在 OpenRouter 上线! Grok-4.3 以比 Grok-4.2 更低的价格发布,同时在代理性能上实现大幅跃升:在 @ArtificialAnlys 的 GDPval-AA 基准上 ELO 分数提升 321 点至 1500,尽管价格更低,但仍超越了其他顶级模型。

Rohan Paul@rohanpaul_ai · 5月1日58

Frontier AI can now autonomously chain complex, expert-level cyber attacks end-to-end, at superhuman speed and near-zero marginal cost. GPT-5.5 essentially tied with Mythos Preview - within the margin of error — both far ahead of earlier models (GPT-4o, Claude Opus 4.x, etc.). - GPT-5.5: 71.4% (±8.0%) - Mythos Preview: 68.6% (±8.7%) AISI has been running controlled, realistic cybersecurity evaluations on the latest AI models. These include: - Narrow CTF-style tasks (expert-level challenges like exploiting memory corruptions, breaking crypto, reverse-engineering stripped binaries, etc.). - Multi-step “cyber range” simulations — a full 32-step corporate network attack chain (recon → initial access → lateral movement → privilege escalation → full network takeover). A human expert needs ~20 hours for this. They previously tested Mythos Preview, and now OpenAI’s GPT-5.5. One hard reverse-engineering task (custom virtual machine) takes a human expert ~12 hours with professional tools. GPT-5.5 solved it in under 11 minutes at a cost of $1.73.

译前沿AI已能以超人速度和近乎零边际成本自主完成端到端的复杂专家级网络攻击链。在AISI的网络安全评估中,GPT-5.5与Mythos Preview表现相当,均远超GPT-4o等早期模型。GPT-5.5在包含32个步骤的企业网络攻击模拟中成功完成端到端攻击,而人类专家需约20小时。在一项人类专家需12小时完成的反向工程任务中,GPT-5.5仅用11分钟、花费1.73美元即告解决。

Chubby♨️@kimmonismus · 5月1日60

/1 Gemma 4 31B just crushed Qwen 3.6 27B in a local LLM gamedev contest inside @atomic_chat_hq (prompt is below) Device: MacBook Pro M5 Max, 64GB RAM Results: Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens So what is more important: tokens per second, or the quality of the final answer? Qwen made a very long response and showed more creativity and visual style. But Gemma gave a shorter, clearer, and more logical answer in much less time. In this one-shot Pac-Man gamedev contest, Gemma 4 31B was the clear winner. Its game logic was stronger: click reactions were smoother, and it handled interactions with elements like walls, ghosts, and particle effects better. But this was only one test. Maybe Qwen 3.6 27B can show better results with better settings. Open the comments, try our prompt, and share your result below.

译在@atomic_chat_hq平台的本地LLM游戏开发竞赛中,Gemma 4 31B与Qwen 3.6 27B于MacBook Pro M5 Max上对决。尽管Qwen生成速度更快(32 tokens/秒)且回答更具创意,但Gemma仅用3分51秒和6209个token,输出了更简短、清晰、逻辑性强的答案。在具体的吃豆人游戏逻辑实现上,Gemma在点击反应、与墙壁/幽灵的交互及粒子效果处理方面表现更优。作者强调此为单次测试,Qwen或可通过调整设置提升表现,并邀请社区验证。

Artificial Analysis@ArtificialAnlys · 5月1日65

Ant Group has just released Ling 2.6 1T, an open weights, non-reasoning model with high cost efficiency and a reasonable intelligence tradeoff. Ling 2.6 1T scores 34 on the Artificial Analysis Intelligence Index, a 15-point jump from Ling-1T Ling 2.6 1T is the latest model from Ant Group’s @TheInclusionAI lab. Ant Group recently released Ling 2.6 Flash, a 104B total parameter non-reasoning model. Ling 2.6 1T’s weights have been publicly released on Hugging Face. Key takeaways: ➤ Comparable intelligence to similarly sized non-reasoning models: At 1T total parameters, Ling 2.6 1T sits near DeepSeek V3.2 (non-reasoning, 32) and Kimi K2.5 (non-reasoning, 37) in intelligence. This is a marked improvement from Ling-1T, which scores 19 on the Intelligence Index. However, there remains a ~10-point gap to frontier non-reasoning open weights models such as GLM-5.1 (non-reasoning, 44) and Kimi K2.6 (non-reasoning, 43). ➤ Strong performance in scientific reasoning and knowledge: Ling 2.6 1T scores 75% on GPQA and 8% on Humanity’s Last Exam (HLE), indicating solid performance on graduate-level reasoning and knowledge recall tasks. This is comparable to DeepSeek V3.2 (non-reasoning), which achieves 75% on GPQA and 11% on HLE. ➤ Efficient token usage: Ling 2.6 1T uses ~16M output tokens to run the Artificial Analysis Intelligence Index, making it more efficient than MiMo V2 Flash (non-reasoning, ~17M), and significantly more efficient than GLM-5.1 (non-reasoning, ~75M) and Kimi K2.6 (non-reasoning, ~27M) ➤ Strong cost-to-intelligence positioning: At $0.30 per million input tokens and $2.50 per million output tokens on InclusionAI’s first-party API, Ling 2.6 1T costs only ~$95 to run the full Artificial Analysis Intelligence Index. This positions it competitively for large-scale workloads relative to models in a similar intelligence tier. ➤ Relatively weak factual reliability: Ling 2.6 1T scores -51 on AA-Omniscience, our benchmark for factual accuracy and hallucination. This is primarily driven by a high hallucination rate (92%), which is similar to GPT-5.5 (non-reasoning, 91%). However, its 21% accuracy is broadly in line with comparable non-reasoning models. Additional model details: ➤ Size: 1T total parameters ➤ Pricing: $0.30 / $2.50 per 1M input/output tokens (via Novita API) ➤ License: Weights not yet released ➤ Availability: First-party API through InclusionAI

译蚂蚁集团InclusionAI实验室发布开源非推理模型Ling 2.6 1T。该模型拥有1万亿参数,在Artificial Analysis Intelligence Index上得分为34分,较前代Ling-1T提升15分,智能水平接近DeepSeek V3.2等同类模型。其在科学推理与知识任务上表现扎实,GPQA得分达75%。模型运行效率较高,执行该指数仅需约1600万输出tokens,成本效益突出,通过官方API运行全套指数成本约95美元。但其事实可靠性较弱,在AA-Omniscience基准上得分为-51分,主要因幻觉率高达92%。模型权重已在Hugging Face公开。

Artificial Analysis@ArtificialAnlys · 5月1日46

GPT-5.5 Pro achieves a small bump on GPT-5.4 Pro with 60% lower cost and token use in our frontier science eval, CritPt CritPt tests models on graduate-level physics research problems contributed by 60+ researchers from 30+ institutions globally. When CritPt was released in November 2025, the highest score was 9% (Gemini 3 Pro Preview). ~4 months later, GPT-5.4 Pro (xhigh) tripled this score with 30%. Now, GPT-5.5 Pro (xhigh) has surpassed this result by half a percentage point at 60% lower cost. The model is priced identically per token, but used fewer tokens to complete the evaluation. According to OpenAI, GPT-5.5 Pro “uses more compute to think harder and provide consistently better answers” than GPT-5.5. Congratulations @OpenAI and @sama on this result

译在名为CritPt的尖端科学评估中,GPT-5.5 Pro (xhigh) 以比前代GPT-5.4 Pro (xhigh) 低60%的成本和令牌使用量,实现了0.5个百分点的性能提升,将得分推至30.5%。CritPt评估包含全球30多家机构的60多名研究人员贡献的研究生级别物理问题。自2025年11月发布以来,最高分从Gemini 3 Pro Preview的9%跃升至GPT-5.4 Pro的30%。OpenAI指出,GPT-5.5 Pro相比GPT-5.5“使用了更多计算资源进行深度思考,以提供更稳定的优质答案”。该模型每令牌定价相同,但通过使用更少的令牌完成了评估。

Chubby♨️@kimmonismus · 5月1日46

GPT-5.5 on par with Claude Mythos on mutli-step cyber-attack simulations? OpenAI: come back of the year.

译GPT-5.5在多层网络攻击模拟方面与Claude Mythos旗鼓相当? OpenAI:年度回归。

Sam Altman@sama · 5月1日43

lisan say more mean things about us you're being too nice

译lisan 多说点我们的坏话 你太客气了 [引用 @scaling01]:GPT-5.5 is on par with Claude Mythos - GPT-5.5 平均通过率 71.4% (±8.0%) - Mythos Preview 68.6% (±8.7%) - GPT-5.5 在 11 分钟内以 1.73 美元成本完成了一项人类专家需约 12 小时的任务

Artificial Analysis@ArtificialAnlys · 5月1日64

Alibaba's Qwen3.6 27B is the new open weights leader under 150B parameters scoring 46 on the Artificial Analysis Intelligence Index, but uses ~3.7x the output tokens and costs ~21x more than Gemma 4 31B (39) to run the full Intelligence Index @Alibaba_Qwen has released two open weights models in the Qwen3.6 family: Qwen3.6 27B (Dense, 46 on the Intelligence Index) and Qwen3.6 35B A3B (MoE, 43). The MoE variant has 36B total parameters but only activates 3B per forward pass. Both are Apache 2.0 licensed, support 262K context, include native multimodal input, and use the unified thinking/non-thinking hybrid architecture. Unlike Qwen3.5, Alibaba has not released larger Qwen3.6 models as open weights - Qwen3.6 Plus and Qwen3.6 Max Preview remain proprietary, so the Qwen3.6 open weights family is currently all under 50B models. All scores below are for reasoning mode. The Intelligence Index is our synthesis metric incorporating 10 evaluations covering agentic tasks, coding, and scientific reasoning. Key takeaways: ➤ Qwen3.6 27B is the most intelligent open weights model under 150B parameters. At 46 on the Intelligence Index, Qwen3.6 27B is ahead of Qwen3.6 35B A3B (43), Qwen3.5 27B (42), and Gemma 4 31B (39). It is also ahead of larger open weights models including NVIDIA Nemotron 3 Super 120B A12B (Reasoning, 36), Qwen3.5 122B A10B (42) and gpt-oss-120b (high, 33). In native BF16 precision, the 27B takes ~56GB to store the weights, fitting on a single H100, and in 4-bit quantization the weights fit on consumer hardware with 16GB+ of RAM ➤ Qwen3.6 35B A3B is the most intelligent open weights model with ~3B active parameters, 6 points ahead of Qwen3.5 35B A3B (37) and 13 points ahead of GLM-4.7-Flash (30). Other ~3B active peers include Gemma 4 26B A4B (31), Qwen3 Coder Next (80B total, 28), and NVIDIA Nemotron Cascade 2 30B A3B (28) ➤ AA-Omniscience improvement is driven entirely by abstention rather than accuracy. Qwen3.6 27B's hallucination rate falls from 80% to 48% versus Qwen3.5 27B, while accuracy is roughly flat - consistent with our finding that AA-Omniscience accuracy typically correlates with total parameter count and Qwen3.6 27B retains the same 27B parameter count as its predecessor. The 35B A3B shows the same pattern whereby hallucination drops from 84% to 50% while accuracy remains equivalent ➤ Token usage is up across both models versus Qwen3.5 and significantly higher than Gemma 4 31B. Qwen3.6 27B used ~144M output tokens to run the Intelligence Index (~1.5x Qwen3.5 27B at 98M, ~3.7x Gemma 4 31B at 39M). Qwen3.6 35B A3B used ~143M (~1.4x Qwen3.5 35B A3B at 100M, ~3.7x Gemma 4 31B) ➤ The 27B got materially more expensive while the 35B A3B is roughly flat versus predecessor. Per-token pricing on Alibaba Cloud moved differently, with the 27B going from $0.30/$2.40 to $0.60/$3.60 while the 35B A3B (Reasoning) remains nearly flat at $0.248/$1.485 (vs $0.25/$2.00 for Qwen3.5 35B A3B). Qwen3.6 27B costs ~$659 to run the Intelligence Index, ~2.2x Qwen3.5 27B (~$299) and ~21x Gemma 4 31B (~$31 at median third-party pricing of $0.14/$0.40 per 1M input/output tokens). Qwen3.6 35B A3B costs ~$280, roughly tied with Qwen3.5 35B A3B (~$302) and ~9x Gemma 4 31B ➤ Qwen3.6 27B is competitive with leading models on agentic real-world work tasks despite its size. At 1414 Elo on GDPval-AA, Qwen3.6 27B is ahead of recent open weights peers Qwen3.6 35B A3B (1297), Qwen3.5 27B (1157) and Gemma 4 31B (1115), but trails larger open weights leaders including DeepSeek V4 Pro (Reasoning, Max Effort, 1554) and GLM-5.1 (Reasoning, 1535). It matches DeepSeek V4 Flash (Reasoning, High Effort, 1414) at 284B total parameters, and sits roughly in line with GPT-5.4 mini (xhigh, 1436) and Muse Spark (1421). ➤ Non-reasoning variants remain equivalent versus Qwen3.5. Qwen3.6 27B (Non-reasoning, 37) is effectively tied with Qwen3.5 27B (Non-reasoning, 37); Qwen3.6 35B A3B (Non-reasoning, 32) is equivalent to Qwen3.5 35B A3B (Non-reasoning, 31). The Qwen3.6 generation gains are concentrated in reasoning mode Other information: ➤ Context window: 262K tokens (equivalent to Qwen3.5) ➤ License: Apache 2.0 ➤ Multimodality: Native vision input (text and image), text output ➤ API pricing (Alibaba Cloud): Qwen3.6 27B: $0.60/$3.60, Qwen3.6 35B A3B (Reasoning): $0.248/$1.485 ➤ Availability: Available on Alibaba Cloud first-party API. Qwen3.6 35B A3B is available on several third-party APIs such as @DeepInfra, @parasail_io, @clarifai and @novita_labs

译阿里巴巴开源了Qwen3.6系列两款模型:27B密集模型和35B A3B混合专家模型。其中,Qwen3.6 27B在Artificial Analysis智能指数上得分46,成为150B参数以下最智能的开源模型,领先于Gemma 4 31B等。但其运行完整测试消耗的输出token约为后者的3.7倍,成本高出约21倍。两款模型均采用Apache 2.0许可,支持262K上下文,具备多模态能力。值得注意的是,其幻觉率较前代大幅下降,但准确率基本持平。更大的Plus和Max Preview版本未开源。

Artificial Analysis@ArtificialAnlys · 4月30日56

Tencent has released Hy3-preview, an open weights reasoning model scoring 42 on the Artificial Analysis Intelligence Index, trailing recent open weights peers Hy3-preview is the latest model from @TencentHunyuan. It is a 295B total / 21B active parameter Mixture-of-Experts model, smaller than its December 2025 predecessor Tencent HY 2.0 (406B total / 32B active). Recent leading open weights reasoning models include Qwen3.6 27B (Reasoning, 46), DeepSeek V4 Flash (Reasoning, Max Effort, 47, 284B / 13B) and GLM-5.1 (Reasoning, 51, 744B / 40B). The Intelligence Index is the Artificial Analysis synthesis metric incorporating 10 evaluations covering agentic tasks, coding and scientific reasoning. Key takeaways: ➤ Hy3-preview trails recent open weights peers on GDPval-AA. Hy3-preview scores an Elo of 1235 on GDPval-AA, our agentic real-world work tasks benchmark, behind Qwen3.6 27B (Reasoning, 1414), DeepSeek V4 Flash (Reasoning, Max Effort, 1388) and GLM-5.1 (Reasoning, 1535). GDPval-AA tests models on real-world tasks across 44 occupations and 9 major industries. ➤ Hy3-preview ties GLM-5.1 (Reasoning) on CritPt despite scoring nearly 10 Intelligence Index points lower. Hy3-preview scores 4.6% on CritPt (research-level physics), matching GLM-5.1 (Reasoning, 51 on the Intelligence Index) and ahead of Qwen3.6 27B (Reasoning, 1.1%) but behind DeepSeek V4 Flash (Reasoning, Max Effort, 7.1%). It trails the open weights leaders, including DeepSeek V4 Pro (Reasoning, Max Effort, 12.9%) and Kimi K2.6 (8.0%). ➤ Hy3-preview used ~125M output tokens to run the Intelligence Index. This is ~12% more than GLM-5.1 (Reasoning, 112M) and less than Qwen3.6 27B (Reasoning, 144M) and DeepSeek V4 Flash (Reasoning, Max Effort, 241M). ➤ AA-Omniscience is a relative weakness compared to peers. Hy3-preview scores -35 on the Artificial Analysis Omniscience Index with 28% accuracy and an 87% hallucination rate. This trails DeepSeek V4 Flash (Reasoning, Max Effort, -23), Qwen3.6 27B (Reasoning, -20) and GLM-5.1 (Reasoning, 2). Other information: ➤ Size: 295B total parameters, 21B active parameters ➤ Context window: 256K tokens ➤ License: Tencent HY Community License Agreement, with restricted commercial use ➤ Availability: Weights are available on @huggingface Face and the model is also available on @SiliconFlowAI at $0/$0 per 1M input/output tokens

译腾讯发布开源混合专家模型Hy3-preview,总参数量2950亿,激活参数量210亿。其在Artificial Analysis综合智能指数上得分42,落后于近期开源的GLM-5.1、DeepSeek V4 Flash及Qwen3.6 27B等推理模型。具体评测表现不均衡:在真实世界任务基准GDPval-AA上落后于主要竞品,但在研究级物理评测CritPt上与高分模型GLM-5.1持平;其相对弱项在于AA-Omniscience指数,幻觉率较高。模型采用Tencent HY社区许可协议,商业使用受限,已在Hugging Face和SiliconFlowAI平台提供。

阿绎 AYi@AYi_AInotes · 4月30日64

讲真,看到百度排第一我属实是没想到的哈哈哈

译LMArena文本榜显示,百度文心5.1 Preview以1476分位列国内第一、全球前十五,成为榜单中唯一国产模型,排名超过GPT-5.5等。尽管当前AI热点集中于Agent、多模态等领域,但DeepSeek V4与文心5.1 Preview仍以文本为核心。文章强调,文本能力是大模型的基础,代码、推理等多模态能力均从中“生长”,文本差距直接决定上层能力水平,因此仍是衡量模型差距的关键分水岭。

SemiAnalysis@SemiAnalysis_ · 4月30日53

GB300 NVL72 Rack Scale Dynamo SGLang disaggregation has up to 6.5x better performance than B200 on DeepSeekv4 Pro 1.6T 🚀   The high throughput configuration uses @deepseek_ai 's MegaMoe kernels  which fully fuses & overlaps EP dispatch & EP combine & the GEMMs into an single kernel. This performance is achieved from the 10x engineers @BanghuaZ, Tom & the rest of the team at @radixark, @lmsysorg & @NVIDIAAI for rapidly enabling this performance! Big Shoutout to @CoreWeave to contributing temporary GB300 NVL72 racks towards the open source performance optimization for all to benefit!

译在 DeepSeek-V4 Pro 1.6T 模型上,采用机架级解耦设计的 GB300 NVL72 系统性能达到 B200 的 6.5 倍。这一高吞吐配置得益于 DeepSeek-AI 的 MegaMoe 内核,该内核将专家分派、专家组合及 GEMM 运算完全融合并重叠至单一内核中。性能突破由 Radixark、LMSYS 和 NVIDIA AI 的工程师团队快速实现。CoreWeave 为此项开源性能优化贡献了临时的 GB300 NVL72 机架资源,使整个社区受益。

Ethan Mollick@emollick · 4月30日56

Gemini now can create documents, and it is a nice start, but not up to the frontier yet, as you can see from my "LBO of Hogwarts" test. PowerPoints are substantially worse than NotebookLM, spreadsheets are primitive, still no thinking trace, it doesn't think hard enough, either.

译Gemini现在可以创建文档了,这是个不错的开始,但尚未达到前沿水平,正如你从我“霍格沃茨杠杆收购”测试中看到的那样。 PowerPoint比NotebookLM差得多,电子表格功能简陋,仍然没有思考轨迹,它的思考也不够深入。

向阳乔木@vista8 · 4月29日37

测了,确实不行,感觉是单独训练的图片模型? 速度快到不行,没思考过程,系统1凭感觉直出,哈哈哈哈。

全部 AI 动态
AI 相关资讯全量信息流
全部一手信源资讯推文
全部模型产品行业论文技巧
5月6日
04:31
Epoch AI@EpochAIResearch
49
"经典"推理基准的配方很简单:纯文本、数小时的时间跨度、易于评分,并带有专家人类基线。 接下来呢?在本周的Gradient Update中,@GregHBurnham 认为只需舍弃这四种成分之一即可。
现象/趋势评测/基准
03:57
Rohan Paul@rohanpaul_ai
68
GPT-5.5 与 Opus 4.7 在 ARC-AGI-3 上的得分低于 1%
AnthropicOpenAI推理评测/基准
02:57
Artificial Analysis@ArtificialAnlys
58
MiniMax-M2.7模型在六大推理服务商上线,速度与价格差异显著

MiniMax-M2.7模型已在六家推理服务商上线,各提供商在速度和价格上差异明显。SambaNovaAI以每秒435个输出令牌的速度领先,比其他提供商快3倍以上,但其价格也高出约2倍。FireworksAI、Novita Labs等四家则与MiniMax官方API定价持平。分析指出,Fireworks和SambaNova在速度与价格的权衡中处于帕累托前沿:前者性价比高,后者则以高价换取极致速度。此外,各家的高速缓存折扣政策不同,这对缓存密集型工作负载的成本影响显著。因此,最优选择高度依赖于具体工作负载对延迟和成本的敏感度。

推理评测/基准部署/工程
5月5日
23:56
Luma@LumaLabsAI
71
Luma Labs 推出的 UNI-1.1-Max 和 UNI-1.1 多模态模型在 Image Arena 的文本生成图像与图像编辑综合排名中位列第三,且未采用智能体搜索技术。具体来看,在文本生成图像竞技场中,两款模型分别排名第六和第七;在多图像编辑和单图像编辑竞技场中,它们均进入前十一名,其中 UNI-1.1-Max 在单图像编辑中排名第七。这一成绩标志着 Luma Labs 在多模态前沿领域取得了扎实进展。

Arena.ai: Exciting news: UNI-1.1-Max and UNI-1.1 debuts making @LumaLabsAI the #3 lab in the Image Arena across both Text-to-Image...

图像生成模型发布评测/基准
23:25
Deedy@deedydas
62
SWE-Bench 的创建者刚刚发布了一个非常简单的新基准测试,所有 LLM 都得 0 分。 ProgramBench 提出的问题是:模型能否在没有互联网的情况下从零开始重建真实可执行程序(ffmpeg、SQLite、ripgrep)? 我们在模型质量上还远未饱和。
推理编码评测/基准
11:25
OpenRouter@OpenRouter
精选65
我们分析了GPT 5.5与GPT 5.4,发现成本增加了49-92%。 GPT 5.5价格翻倍的影响因模型生成长提示时补全令牌减少了19-34%而有所缓解。 更多分析请见:https://openrouter.ai/announcements/gpt55-cost-analysis
OpenAI评测/基准

推荐理由:OpenRouter 拆解了 GPT 5.5 的实际成本,49-92% 的涨价被输出 token 减少部分抵消,用 API 的人必须算的一笔账。
08:14
Berryxia.AI@berryxia
58
Grok 4.3近期在Vals AI的私有基准测试中,于法律和金融领域展现出领先的智能推理能力。其在针对真实加拿大法庭案例的CaseLaw (v2)测试中,以79.31%的准确率超越GPT-5.1;在基于复杂多页信贷协议的CorpFin (v2)测试中,准确率达68.53%。这些测试聚焦深度法律推理与金融合同理解等高难度现实任务,结果表明Grok 4.3在真实世界高风险领域的卓越性能,印证了xAI致力于构建世界级推理引擎的目标。

X Freeze: Grok 4.3 just became the smartest AI in the world at law and money It took #1 on TWO brutal private tests no other model...

OpenAIxAI推理评测/基准
05:25
SemiAnalysis@SemiAnalysis_
精选71
GB300 NVL72实测性能达GB200的2.7倍,凸显端到端实测价值

在行业标准推理引擎vLLM上的测试显示,NVIDIA GB300 NVL72的实测端到端性能已达GB200 NVL72的2.7倍。尽管其纸面参数仅显示NVFP4算力提升约1.5倍、HBM容量增加1.5倍且带宽相同,但在大多数服务商实际运行的中段负载区间,凭借全栈优化的复合增益,GB300实现了远超理论算力提升的性能飞跃。此次测试基于NVIDIA、Inferact和CoreWeave为开源项目提供的临时GB300系统完成,结果印证了端到端实测性能才是衡量硬件效能的黄金标准,而非单纯的纸面理论算力。

推理评测/基准部署/工程

推荐理由:纸面 FP4 算力只多 50% 的 GB300,实际推理却快了 2.7 倍,全栈优化的复合增益比参数表好看太多,做推理服务的该重新算算 TCO 了。
04:57
swyx 🇸🇬@swyx
61
看到很多人说Opus 4.7相比4.6是净退步,但这似乎只是些个例。 离线和在线评估都指向明确的进步。 那是什么没被捕捉到呢?"个性"吗?
Anthropic大佬观点评测/基准
03:18
Artificial Analysis@ArtificialAnlys
69
一款新的匿名模型在Artificial Analysis文本转图像竞技场中首次亮相,位列第8!Peanut的权重预计即将发布,这将使其成为领先的文本转图像开源权重模型。 Peanut定位为新的领先开源权重文本转图像模型,超越了Z-Image Turbo、Qwen-Image和FLUX.2 【dev】。 更多详细信息(及权重)即将公布。 查看下方🧵中Artificial Analysis图像竞技场里Peanut的生成示例。
图像生成开源/仓库模型发布评测/基准
00:45
Elon Musk@elonmusk
41
在"Vals AI"的私人基准测试中,Grok 4.3在法律和金融领域展现出领先的智能水平。它在CaseLaw (v2)测试中以79.31%的准确率排名第一,该测试基于真实加拿大法庭案例,评估深度法律推理和先例理解能力,表现优于GPT-5.1。同时,它在针对复杂长期信贷协议的CorpFin (v2)测试中以68.53%的准确率夺冠,评估了对多页金融合同条款、风险的理解。这些模拟高风险现实挑战的测试表明,Grok 4.3在最困难的任务中具备卓越的推理能力。xAI正致力于构建世界所需的推理引擎。

X Freeze: Grok 4.3 just became the smartest AI in the world at law and money It took #1 on TWO brutal private tests no other model...

xAI推理评测/基准
00:26
Epoch AI@EpochAIResearch
46
探讨AI基准测试的困境与未来方向

针对“AI基准测试是否已失效”的悲观论调,讨论者进行了反驳,并深入探讨下一代AI基准测试的可能形态。核心议题包括基准测试开发的成本与收益、可扩展基准(如MirrorCode)的构建、AI技术对基准开发本身的加速作用,以及当前基准测试与现实应用能力之间存在的差距。对话还触及了构建通用人工智能(AGI)基准的可行性,并展望了超越自动化评分的更全面评估方法。

数据/训练评测/基准
5月4日
23:48
Chubby♨️@kimmonismus
62
初创公司Video Rebirth的文本生成视频模型Bach-1.0 Preview在Artificial Analysis的全球AI视频排行榜上首次亮相即位列第六。其性能与Vidu Q3 Pro、Kling 3.0 Omni 1080p (Pro)及grok-imagine-video等知名模型相当。该模型计划于五月下旬广泛发布。

Artificial Analysis: Bach-1.0 Preview from Video Rebirth debuts at #6 on the Artificial Analysis Text to Video Leaderboard (No Audio)! Bach-1...

模型发布视频评测/基准
5月3日
19:21
Ethan Mollick@emollick
57
对前沿智能体在较长任务上的性能进行基准测试正变得越来越困难。重复测量的成本非常高,而且使用受控框架中的模型与通过API使用模型之间存在差异。 我怀疑基准测试低估了进展,它们是为模型设计的,而非为受控智能体。
智能体大佬观点现象/趋势评测/基准
06:17
Eric@ericmitchellai
50
用户通过对比GPT-5.4和GPT-5.5的教学效果,指出两者在解释概念时存在关键差异。GPT-5.4倾向于先阐述概念,再让学习者回溯关联标签,增加了认知负担。而GPT-5.5采用更清晰的方式:先给出明确标签(如"导数"),再立即附上概念解释(如"描述变化速率")。这种"标签优先"的结构使解释流畅连贯,无需大脑反复回溯重组信息,从而在长期教学对话中能更好地维持学习者的注意力。

Chris: This helped me appreciate GPT-5.5 vs 5.4 even more. "Explain, calculus, short and sweet" I've been testing educational p...

OpenAI评测/基准
5月2日
15:44
Chubby♨️@kimmonismus
51
不错!Google 正在为 I/O 大会做准备。新模型即将推出

can: 🚨 Google updated Gemini 3 Flash in arena It still has the same name "Gemini 3 Flash". However, output quality is two ti...

Google模型发布评测/基准
15:41
Elon Musk@elonmusk
39
Grok Voice 目前正被 Starlink 使用 【引用 @XFreeze】:Grok Voice 在 τ-voice 基准测试中占据绝对优势 Grok 得分为 67.3%,而 Gemini 为 43.8%,GPT Realtime 为 35.3% 这遥遥领先于竞争对手,优势巨大 目前最优秀的实时推理语音助手

X Freeze: Grok Voice brutally dominates the top of the τ-voice Bench Grok scores 67.3%, while Gemini sits at 43.8% and GPT Realtim...

xAI评测/基准语音
13:49
TestingCatalog News 🗞@testingcatalog
66
谷歌新的Gemini Flash模型已在LM Arena上出现。同时,Vertex AI客户收到邮件,Gemini 3.1 Flash Lite即将正式发布。引用推文指出,虽然模型在竞技场中仍显示为"Gemini 3 Flash",但其输出质量已跃升两个层级,性能更接近当前的Gemini 3.1 Pro,是一次重大升级,实际版本可能是3.1、3.2或3.5 Flash。

can: 🚨 Google updated Gemini 3 Flash in arena It still has the same name "Gemini 3 Flash". However, output quality is two ti...

Google模型发布评测/基准
06:47
François Chollet@fchollet
37
若你希望帮助世界理解AGI并加速其到来,可以考虑加入ARC Prize基金会。 目前开放两个职位:Game Platform Engineering Lead,以及Model Testing & Analysis Lead https://arcprize.org/jobs
行业动态评测/基准
05:47
François Chollet@fchollet
56
最新一批模型在ARC-AGI-3上的得分目前仍低于1%。 到今年年底,得分会达到多少呢?

ARC Prize: GPT-5.5 & Opus 4.7 on ARC-AGI-3 - GPT-5.5: 0.43% - Opus 4.7: 0.18% We found 3 failure modes: - True local effect, false ...

AnthropicOpenAI推理评测/基准
03:47
François Chollet@fchollet
精选70
强化学习在已知领域能提升模型性能,但在未知领域可能导致模型产生幻觉,误以为在执行其他训练过的任务。这一现象在GPT-5.5等大模型的ARC AGI 3基准测试中有所体现,其得分仅为0.43%,与Claude 4.6、Gemini 3.1等模型表现相近。分析指出GPT-5.5的主要失败原因包括:局部效应正确但世界模型错误、从训练数据中提取的抽象层级不当,以及虽解决问题却未强化奖励机制。深入分析此类失败案例,有助于全面理解大模型在特定模态上的能力局限与改进方向。

Chris: GPT-5.5 Scores .43% on ARC AGI 3! - GPT-5.5: 0.43% - Opus 4.7: 0.18% - GPT-5.4: 0.20% - Claude 4.6: 0.45% - Gemini 3.1: ...

OpenAI大佬观点推理评测/基准

推荐理由:Chollet 用 ARC AGI 3 冷冰冰的数字撕开了 RL 的局限,GPT-5.5 0.43% 的得分说明在未知领域模型会做完全不相干的事,比任何安全论文都来得更直击要害。
5月1日
19:15
PixVerse@PixVerse_
49
主推文感谢了用户@TomLikesRobots分享的文本生成视频模型对比。对比在SeeDance 2.0和HappyHorse 1.0之间进行,使用了统一的提示词来生成具有低保真、温馨、赛璐珞风格动漫美学的视频。其中,HappyHorse由@PixVerse_提供,目前对会员免费。由于两个模型自带的音频效果不佳,创作者最终使用@Suno来生成背景音轨。

TomLikesRobots🤖: SeeDance 2.0 vs HappyHorse 1.0 Very quick text-to_video comparison. Which do you prefer? Universal Prompt: "Aesthetic: l...

多模态评测/基准
15:47
TestingCatalog News 🗞@testingcatalog
55
Grok 4.3 在 Artificial Analysis Index 中升至第 7 位,超越了 Meta 的 Muse Spark。

Artificial Analysis: This release shows increased cost efficiency to run the Artificial Analysis Intelligence Index, with Grok 4.3 sitting co...

xAI推理评测/基准
14:40
Rohan Paul@rohanpaul_ai
43
LongCat团队发布LARYBench基准,评估AI模型能否从视频中真正学习动作

LongCat团队推出LARYBench基准,旨在评估AI模型是否从视频中真正学习动作,而非仅在后端机器人策略中表现良好。该基准聚焦模型从视频提取的潜在动作表示,通过超过120万视频片段等数据,将评估拆分为动作分类与控制回归两个清晰测试。关键发现是,通用自监督视觉模型(如V-JEPA 2和DINOv3)表现优于专用具身模型,表明强大视觉表示已蕴含丰富动作知识,且潜在特征空间比像素重建更利于机器人控制映射。这为利用丰富视频数据解决机器人训练数据稀缺问题提供了新方向。

具身智能论文/研究评测/基准
14:14
Artificial Analysis@ArtificialAnlys
57
三大开源模型上周齐发,与顶尖闭源模型差距缩小至6分内

上周,Kimi K2.6、MiMo V2.5 Pro和DeepSeek V4 Pro三大领先开源模型发布,在Artificial Analysis Intelligence Index上得分达52-54分,与顶尖闭源模型GPT-5.5的60分差距缩小至6分以内,相比一年前22分的开源模型进步显著。这些模型均为万亿参数规模的MoE架构。然而,在复杂推理、智能体编码及知识准确性方面,开源模型与闭源模型仍存在明显差距。例如在HLE、CritPt和TerminalBench Hard等专项评估中得分大幅落后;在Omniscience评估中,DeepSeek V4 Pro的幻觉问题尤为突出。

DeepSeekOpenAI开源生态推理
08:44
elvis@omarsar0
58
DeepSeek-V4-Pro 在智能体编码任务中表现惊艳

测试者使用 DeepSeek-V4-Pro 在 Pi 编码智能体上构建了一个 LLM 知识库,对其开箱即用的表现感到震撼。这是首个在推理能力上媲美 Claude 和 Codex 的开源权重模型,且成本效益高,支持 100 万上下文长度。该模型无需复杂配置即可在基础框架中直接运行,擅长智能体编码和知识密集型推理任务,能跨公司文档、论坛、论文和代码库进行多步骤研究、代码生成与上下文推理。其高效运行得益于 Fireworks 的市场最快推理速度及混合注意力设计,将 KV 缓存降至 10%,推理计算量减少近 4 倍,实现了快速且低成本的实践部署。

智能体DeepSeek开源生态推理
08:16
Ethan Mollick@emollick
61
xAI发布Grok 4.3,其在Artificial Analysis智能指数得分53,性能优于Grok 4.20、Muse Spark等模型。核心改进在于"性价比":输入与输出价格较前代分别降低约40%和60%,且基准测试套件运行成本下降。该版本在GDPval-AA等现实智能体任务上表现显著提升,指令遵循与客服任务强劲。但推文指出,其表现仍落后于最新的中国开源模型,并批评GDPval-AA测试本身价值有限。

Artificial Analysis: xAI has launched Grok 4.3, achieving 53 on the Artificial Analysis Intelligence Index with improved agentic performance,...

大佬观点行业动态评测/基准
07:45
OpenRouter@OpenRouter
精选68
@xai 的新模型 Grok-4.3 现已在 OpenRouter 上线! Grok-4.3 以比 Grok-4.2 更低的价格发布,同时在代理性能上实现大幅跃升:在 @ArtificialAnlys 的 GDPval-AA 基准上 ELO 分数提升 321 点至 1500,尽管价格更低,但仍超越了其他顶级模型。
智能体xAI模型发布评测/基准

推荐理由:Grok-4.3 降价但性能反升,agentic 跑分直接到 1500,如果之前觉得 Grok 贵而没试过,这次可以上车了。
04:39
Rohan Paul@rohanpaul_ai
58
前沿AI能以超人速度自主实施端到端复杂网络攻击

前沿AI已能以超人速度和近乎零边际成本自主完成端到端的复杂专家级网络攻击链。在AISI的网络安全评估中,GPT-5.5与Mythos Preview表现相当,均远超GPT-4o等早期模型。GPT-5.5在包含32个步骤的企业网络攻击模拟中成功完成端到端攻击,而人类专家需约20小时。在一项人类专家需12小时完成的反向工程任务中,GPT-5.5仅用11分钟、花费1.73美元即告解决。

AI Security Institute: OpenAI's GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-end 🧵

OpenAI安全/对齐评测/基准
04:12
Chubby♨️@kimmonismus
60
本地LLM游戏开发对决:Gemma 4 31B 在效率与逻辑上胜过 Qwen 3.6 27B

在@atomic_chat_hq平台的本地LLM游戏开发竞赛中,Gemma 4 31B与Qwen 3.6 27B于MacBook Pro M5 Max上对决。尽管Qwen生成速度更快(32 tokens/秒)且回答更具创意,但Gemma仅用3分51秒和6209个token,输出了更简短、清晰、逻辑性强的答案。在具体的吃豆人游戏逻辑实现上,Gemma在点击反应、与墙壁/幽灵的交互及粒子效果处理方面表现更优。作者强调此为单次测试,Qwen或可通过调整设置提升表现,并邀请社区验证。

开源生态推理评测/基准
03:14
Artificial Analysis@ArtificialAnlys
65
蚂蚁集团开源Ling 2.6 1T模型,性价比与智能取得平衡

蚂蚁集团InclusionAI实验室发布开源非推理模型Ling 2.6 1T。该模型拥有1万亿参数,在Artificial Analysis Intelligence Index上得分为34分,较前代Ling-1T提升15分,智能水平接近DeepSeek V3.2等同类模型。其在科学推理与知识任务上表现扎实,GPQA得分达75%。模型运行效率较高,执行该指数仅需约1600万输出tokens,成本效益突出,通过官方API运行全套指数成本约95美元。但其事实可靠性较弱,在AA-Omniscience基准上得分为-51分,主要因幻觉率高达92%。模型权重已在Hugging Face公开。

开源生态评测/基准
03:14
Artificial Analysis@ArtificialAnlys
46
GPT-5.5 Pro以更低成本实现性能微升,在尖端科学评估中领先

在名为CritPt的尖端科学评估中,GPT-5.5 Pro (xhigh) 以比前代GPT-5.4 Pro (xhigh) 低60%的成本和令牌使用量,实现了0.5个百分点的性能提升,将得分推至30.5%。CritPt评估包含全球30多家机构的60多名研究人员贡献的研究生级别物理问题。自2025年11月发布以来,最高分从Gemini 3 Pro Preview的9%跃升至GPT-5.4 Pro的30%。OpenAI指出,GPT-5.5 Pro相比GPT-5.5“使用了更多计算资源进行深度思考,以提供更稳定的优质答案”。该模型每令牌定价相同,但通过使用更少的令牌完成了评估。

OpenAI推理评测/基准
02:42
Chubby♨️@kimmonismus
46
GPT-5.5在多层网络攻击模拟方面与Claude Mythos旗鼓相当? OpenAI:年度回归。

AI Security Institute: OpenAI's GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-end 🧵

AnthropicOpenAI安全/对齐评测/基准
01:44
Sam Altman@sama
43
lisan 多说点我们的坏话 你太客气了 【引用 @scaling01】:GPT-5.5 is on par with Claude Mythos - GPT-5.5 平均通过率 71.4% (±8.0%) - Mythos Preview 68.6% (±8.7%) - GPT-5.5 在 11 分钟内以 1.73 美元成本完成了一项人类专家需约 12 小时的任务

Lisan al Gaib: GPT-5.5 is on par with Claude Mythos - GPT-5.5 average pass rate of 71.4% (±8.0%) - Mythos Preview 68.6% (±8.7%) - GPT-5...

OpenAI大佬观点评测/基准
00:13
Artificial Analysis@ArtificialAnlys
64
阿里发布Qwen3.6系列开源模型,27B版本成150B参数以下最强开源模型

阿里巴巴开源了Qwen3.6系列两款模型:27B密集模型和35B A3B混合专家模型。其中,Qwen3.6 27B在Artificial Analysis智能指数上得分46,成为150B参数以下最智能的开源模型,领先于Gemma 4 31B等。但其运行完整测试消耗的输出token约为后者的3.7倍,成本高出约21倍。两款模型均采用Apache 2.0许可,支持262K上下文,具备多模态能力。值得注意的是,其幻觉率较前代大幅下降,但准确率基本持平。更大的Plus和Max Preview版本未开源。

多模态开源生态推理评测/基准
4月30日
22:11
Artificial Analysis@ArtificialAnlys
56
腾讯发布开源推理模型Hy3-preview,综合评分42分落后于近期同类模型

腾讯发布开源混合专家模型Hy3-preview,总参数量2950亿,激活参数量210亿。其在Artificial Analysis综合智能指数上得分42,落后于近期开源的GLM-5.1、DeepSeek V4 Flash及Qwen3.6 27B等推理模型。具体评测表现不均衡:在真实世界任务基准GDPval-AA上落后于主要竞品,但在研究级物理评测CritPt上与高分模型GLM-5.1持平;其相对弱项在于AA-Omniscience指数,幻觉率较高。模型采用Tencent HY社区许可协议,商业使用受限,已在Hugging Face和SiliconFlowAI平台提供。

开源/仓库推理模型发布评测/基准
19:10
阿绎 AYi@AYi_AInotes
64
LMArena文本榜显示,百度文心5.1 Preview以1476分位列国内第一、全球前十五,成为榜单中唯一国产模型,排名超过GPT-5.5等。尽管当前AI热点集中于Agent、多模态等领域,但DeepSeek V4与文心5.1 Preview仍以文本为核心。文章强调,文本能力是大模型的基础,代码、推理等多模态能力均从中"生长",文本差距直接决定上层能力水平,因此仍是衡量模型差距的关键分水岭。

Berryxia.AI: 今天看到一条容易被刷掉的消息,但越想越觉得有意思。 LMArena 文本榜最新更新,文心 5.1 Preview 拿下 1476 分,国内第一,全球前十五唯一国产模型,排在 GPT-5.5 和 DeepSeek-V4-Pro 前面。 这事本...

DeepSeek评测/基准
16:09
SemiAnalysis@SemiAnalysis_
53
GB300 NVL72 在 DeepSeek-V4 Pro 上性能超 B200 6.5 倍

在 DeepSeek-V4 Pro 1.6T 模型上,采用机架级解耦设计的 GB300 NVL72 系统性能达到 B200 的 6.5 倍。这一高吞吐配置得益于 DeepSeek-AI 的 MegaMoe 内核,该内核将专家分派、专家组合及 GEMM 运算完全融合并重叠至单一内核中。性能突破由 Radixark、LMSYS 和 NVIDIA AI 的工程师团队快速实现。CoreWeave 为此项开源性能优化贡献了临时的 GB300 NVL72 机架资源,使整个社区受益。

DeepSeek推理评测/基准
06:08
Ethan Mollick@emollick
56
Gemini现在可以创建文档了,这是个不错的开始,但尚未达到前沿水平,正如你从我"霍格沃茨杠杆收购"测试中看到的那样。 PowerPoint比NotebookLM差得多,电子表格功能简陋,仍然没有思考轨迹,它的思考也不够深入。
Google评测/基准
4月29日
22:45
向阳乔木@vista8
37
测了,确实不行,感觉是单独训练的图片模型? 速度快到不行,没思考过程,系统1凭感觉直出,哈哈哈哈。

Elaina: Tested using this image.

图像生成评测/基准
‹ 上一页
1…89101112
下一页 ›