When I read all the posts about how surprised everyone is that GLM-5.2 is really as good as claimed, and numerous benchmarks support this (usually just behind GPT-5.5 and Opus 4.8 in 3rd place), I can even imagine that the founder isn't exaggerating when he claims to be able to release a Mythos class model this year.

译当我读到所有那些关于大家对 GLM-5.2 真的如宣传中那样出色感到惊讶的帖子，以及众多基准测试支持这一点（通常仅次于 GPT-5.5 和 Opus 4.8，位列第三）时，我甚至能想象创始人声称今年能发布一个 Mythos 级别的模型并非夸大其词。

AYi@AYi_AInotes · 6月21日78

我去，GPT 5.5， Claude Opus 4.8， Gemini 3.5flash，最新旗舰版全接入，完全免费不用单独开订阅！美团最近悄悄上了一个叫 tabbit 国际版的应用，里面接了好几家顶级模型，Claude、Gemini、GPT，都是最新旗舰版。国内这边也没落下，Kimi、GLM、MiniMax 都在。唯一的坑：一定要下国际版，国内版只有国内模型，御三家不接入，这东西的逻辑很简单，美团想做 AI 入口，先砸钱把模型阵容堆满，把用户量拉起来，现阶段一分钱不花就能同时用几家最贵的模型，之前不想在各家之间来回切，又不想开一堆订阅的人，这个方案挺省心的，趁现在还没收费，先用着，谢谢兄弟@fengdu2077挖掘到这么好的工具！亲测有效！

译美团近期上线tabbit国际版应用，免费集成多家顶级AI模型的最新旗舰版，包括GPT-5.5、Claude Opus 4.8、Gemini 3.5 Flash，以及国内Kimi-2.6、GLM-5.1、MiniMax-M3。用户无需单独订阅即可使用这些模型。需注意：只有国际版包含海外模型，国内版仅提供国内模型。该应用旨在抢占AI入口，目前处于免费推广阶段。

meng shao@shao__meng · 6月21日26

看到有人发起的 llm 对比投票 GLM-5.2 vs Gemini 3.5 Flash 对比结果应该很明显，主要是因为 Gemini 3.5 Flash 确实不能打，Google Deepmind 到底怎么了，Gemini 3.0 多模态惊艳后，就一路沉寂下去了。如果正经对一下最近几个国产 llm 呢？你觉得谁更强？

译邵猛发推讨论一项LLM对比投票，对比双方为GLM-5.2（智谱）与Gemini 3.5 Flash（Google DeepMind）。他认为结果毫无悬念，Gemini 3.5 Flash表现不佳，并感叹自Gemini 3.0多模态惊艳发布后，Google便一路沉寂。最后提问：目前几款国产LLM中，谁更强？

Ethan Mollick@emollick · 6月21日65

The interaction between AI & past scholarly work is going to get weird. Here I gave GPT-5.5 Pro a copy of my first published paper from grad school & asked it to find errors and update it. It found new data, analyzed it, created reproducible files, extended the key argument...

译AI 与过往学术成果之间的交互将变得越来越奇特。我把自己研究生时期发表的第一篇论文交给了 GPT-5.5 Pro，让它找出错误并进行更新。它找到了新数据，分析了这些数据，创建了可复现的文件，并扩展了核心论点……

Yuchen Jin@Yuchenj_UW · 6月20日30

After using GLM-5.2 for a day, I’m surprised by how often it feels close to Opus 4.8/GPT-5.5 level. I compared it side by side with Opus 4.8, and sometimes I even preferred GLM-5.2’s results. OSS LLMs are impressive, especially given how many fewer GPUs they were trained on.

译使用 GLM-5.2 一天后，我惊讶于它经常感觉接近 Opus 4.8/GPT-5.5 的水平。我将它与 Opus 4.8 进行了并排比较，有时我甚至更喜欢 GLM-5.2 的结果。开源大语言模型令人印象深刻，尤其是考虑到它们训练的 GPU 数量少得多。

Ethan Mollick@emollick · 6月20日51

I suspect that companies underestimate the value of using higher intelligence for tasks where weaker AIs seem to be good enough to hit KPIs at a lower price. At least build architectures where you can flexibly experiment with smarter models to see whether it makes a difference.

译我怀疑企业低估了使用更高智能的价值，即便在较弱AI似乎能以更低价格达成KPI的任务中也是如此。至少应构建能灵活尝试更智能模型的架构，看看是否会带来不同。

Rohan Paul@rohanpaul_ai · 6月20日64

DeepAdapt has launched a runtime intelligence layer that cuts AI operating costs by up to 82% and 33X faster inference by shifting repetitive workloads from GPUs to standard CPUs. They are calling it Adaptive Continual Intelligence, ACI. ACI is a runtime learning layer where analytical learning, supervised learning, and reinforcement learning work together while the system is already in production. ACI is not caching, memory, a knowledge graph, routing, or a simple optimization trick. This technique learns from model decisions, corrections, labels, outcomes, and experience, then serves known decisions locally on CPU. Only new, uncertain, or complex requests are routed back to the underlying model. ACI can also be pre-trained for specific domains, making continual learning faster and cheaper. DeepAdapt is rolling out first for cloud-based LLM agents, but the same architecture becomes even more important on personal devices, where compute, battery, latency, and local inference reliability are much tighter constraints. In their benchmarks, ACI has shown up to 90% lower token consumption, 5.7X lower production-scale cost, 33X faster inference with 159 ms median latency, 96% accuracy vs. 85% without ACI, 85.7% lower energy per 1,000 decisions, and 4.8× fewer rule violations. DeepAdapt intercepts user requests, serving known answers instantly from a standard CPU to completely bypass the expensive GPU. New questions go to the GPU, but the system logs the output and any human corrections to learn for the next time. This keeps the underlying language model entirely frozen while the outer software layer handles all real-time learning and auditing. ACI requires zero training. No fine-tuning. No retraining pipelines. You wire it into your existing stack and it starts learning from real use on the very first request. Every improvement happens at runtime. The effect: GPU dependency and cost decrease as the system matures, and energy consumption drops proportionally. In ACI-native agents, everything else becomes a tool inside the ACI runtime: the LLM, memory, tools, knowledge graphs, prompts, workflows, APIs, and external systems. ACI decides what can be handled locally, what should be learned, what must be enforced, and when the system actually needs to fall back to the model. Inference is becoming one of AI’s biggest cost centers. Token prices may fall, but total AI bills keep rising because usage is exploding. The real leverage is avoiding unnecessary GPU calls altogether. With ACI, the LLM is no longer the center of the architecture, because ACI becomes the runtime intelligence layer that decides what can be inferred locally, what should be learned, what must be enforced, and when the model is actually needed. 🧵 1.

译DeepAdapt 发布 ACI（自适应持续智能）运行时学习层，通过将重复工作负载从 GPU 转移至标准 CPU，实现运营成本降低 82%、推理速度提升 33 倍（中位延迟 159 ms）。ACI 在推理时实时学习模型决策、人工修正与反馈，已知请求直接本地 CPU 处理，仅不确定或复杂请求回传底层 LLM。基准测试：token 消耗降 90%、生产级成本降 5.7 倍、准确率 96%（对比无 ACI 的 85%）、每千次决策能耗降 85.7%、规则违规减 4.8 倍。无需微调或重训，即插即用，GPU 依赖随系统成熟递减。该架构先用于云端 LLM 智能体，未来对个人设备同样重要。

Rohan Paul@rohanpaul_ai · 6月19日81

This is really good. OpenAI just moved frontier-level health AI from premium reasoning models into the free GPT-5.5 Instant model. GPT-5.5 Instant now performs near OpenAI’s Thinking models on health evaluations, meaning the cheaper, faster default model is being trained to behave more like the slower models that spend extra computation checking their reasoning. The update targets the gap between a chatbot that sounds fluent and a health assistant that knows when to slow down, ask for missing details, admit uncertainty, and push the user toward care when symptoms look urgent. OpenAI says more than 230 million people ask ChatGPT health and wellness questions every week, so moving this capability into the free product changes the scale from premium assistance to mass access. From OpenAI's blog looks like they did a huge "distillation" to achieve this. i.e. a stronger teacher model and human experts create high-quality responses, and a cheaper student model learns the answer patterns without repeating the same expensive internal search every time. i.e. OpenAI's training loop was heavily physician-shaped: more than 260 doctors across 60 countries, 49 languages, and 26 specialties reviewed over 700,000 model responses and judged whether answers were accurate, cautious, clear, complete, and useful. OpenAI's likely mechanism seems to be a mix of supervised fine-tuning, where Instant is shown better answers, and preference training, where it learns which answer a physician-led rubric prefers when two outputs differ. The physician part is crucial because the target is not just “medical facts,” but clinical response behavior, such as asking for age, pregnancy status, duration, medication history, severe pain, breathing trouble, fever, neurological symptoms, or other missing context before giving guidance. So the strongest improvement is not medical trivia but behavior under uncertainty, because a good health answer often means saying what cannot be known yet, what context is missing, what red flags matter, and what the next safe step should be. OpenAI also reports 71% fewer flagged factuality issues in real health traffic over two months, which suggests the update is reducing wrong claims in everyday use rather than only improving benchmark scores.

译OpenAI 将前沿健康 AI 能力从 premium 推理模型迁移至免费版 GPT-5.5 Instant，使其健康评估表现接近 Thinking 模型。每周超 2.3 亿用户通过 ChatGPT 咨询健康问题。OpenAI 采用知识蒸馏：由更强教师模型与 260+ 名医生（覆盖 60 国、49 种语言、26 专科）审查超 70 万条模型响应，训练学生模型学习临床回答模式。训练结合监督微调与偏好训练，重点提升“不确定性下的行为”（如主动询问年龄、症状等缺失信息）。真实健康流量中事实性问题减少 71%。GPT-5.5 Instant 已向全体免费用户开放。

🚨 AI News | TestingCatalog@testingcatalog · 6月19日34

ClickUp is working on context compression for Brain! > Brain will be able to condense a complete workspace in the background, across docs, tasks, and history. > This allows Brain to reason over years of material the way a deep research agent would. > Responses still come back in seconds rather than minutes. It will be possible to point Brain at a multi-year audit, and it will trace relevant policy change, pull the supporting docs, and assemble a timeline without a manual search through the archive.

译ClickUp 正在为 Brain 开发上下文压缩功能。该功能可在后台压缩整个工作空间（含文档、任务和历史），使 Brain 能像深度研究智能体一样推理多年材料，响应仍保持在秒级。例如，指向多年审计时，Brain 可自动追踪相关政策变更、提取支持文档并生成时间线，无需手动搜索存档。

AYi@AYi_AInotes · 6月19日48

做美股的都应该知道这个网站，叫做Finviz。完全免费，筛股功能比一堆收费软件还全。你想找“放量突破 50 日均线、离新高一步之遥、内部人还在买的票”，它几秒给你刷出来。基本面、技术面、内部人交易，几十个参数随便组合。最爽的是这张热力图，S&P 500 用色块铺开，红的绿的涨跌一目了然，早上一分钟就知道资金在往哪个板块跑。不过界面确实有点丑，像 Win98 时代的产物，操作多了会卡，但免费版已经够你日常用了，我觉得不用升级。个股页也干净，财报数据、分析师评级、内部人最近买卖、新闻，一页拉完，不用几个网站来回切。很多需要筛股的体力活，它都能帮你干了，剩下的就是留给你自己做判断。

译UCSD 黄碧薇教授将近 30 年 AI 分为四代：相关性小模型、因果小模型、相关性大模型（LLM）、因果大模型，认为我们正站在第四代门口。她深耕因果 AI 12 年，是 causal-learn 作者、Apple Scholar 入选者。其创立的 Aether AI 今日官宣完成首轮融资，被解读为资本开始为下一代 AI 范式（因果大模型）下注，而非继续堆参数、拼体量的“相扑式”竞争。

Berryxia.AI@berryxia · 6月19日33

GLM-5.2 这次真的有Opus 4.6 的水平了，牛逼~~

Ethan Mollick@emollick · 6月19日43

One of the key moments of the LLM era, ali g with GPT-3.5 and the decision by Microsoft to not take down Bing/Sydney/GPT-4 after the @kevinroose New York Times article.

译LLM时代的关键时刻之一，与GPT-3.5以及微软在@kevinroose纽约时报文章后没有关闭Bing/Sydney/GPT-4的决定相提并论。

Ethan Mollick@emollick · 6月19日67

I have given AA a hard time about its previous agentic evaluation but this looks like a good and impressive benchmark for real world knowledge work that is unsaturated and had private hold out tests. This is one to watch - I didn’t see a human comparison score though?

译Ethan Mollick 称赞 AA-Briefcase 是真实知识工作的优质基准，未饱和且含私有保留测试，同时询问是否有与人类的对比。该基准由 @ArtificialAnlys 发布，测试模型在多周、多任务项目中的能力，输入含数万条 Slack 消息和数千封邮件。模型排名：Claude Fable 5（已不可用）以 1587 Elo 居首，Claude Opus 4.8（1356）第二，GLM-5.2 max（1266）第三。结果凸显难度：最佳模型仅 3% 任务满足全部标准，31/91 任务无模型超过 50%，成本跨度约 800 倍。

Artificial Analysis@ArtificialAnlys · 6月19日55

Announcing AA-Briefcase, the benchmark for the next era of agentic knowledge work AA-Briefcase is our new benchmark for testing models on long-horizon knowledge work tasks in complex projects built by industry experts. Models are evaluated on multi-week projects, each with many linked tasks and thousands of input source files. We evaluated Claude Fable 5 from @AnthropicAI before it became unavailable, and it currently leads with an Elo score of 1587, followed by Claude Opus 4.8 (max, 1356), Opus 4.7, and the recently-released GLM 5.2 (max, 1266) from @Zai_org. Claude Fable 5 cost $31 on average to run each AA-Briefcase task, followed by Claude Opus 4.8 at $10.40, GPT-5.5 (xhigh) at $3.68 and GLM-5.2 (max) at $2.40. AA-Briefcase comprises four private scenarios, each representing a multi-week knowledge work project set in a realistic organizational context. A public fifth scenario has been released via @huggingface as a representation of scenario structure, submission, and grading (AA-Briefcase Lite). This does not count toward official AA-Briefcase results, and is demonstrative only. Key elements of AA-Briefcase: ➤ Realistic long-horizon projects: AA-Briefcase moves beyond single, disconnected prompts by evaluating models across a coherent long-horizon project. Tasks build week by week, draw on shared institutional context, and require deliverables such as financial models, board presentations, and design mock-ups ➤ Large volumes of fragmented context: AA-Briefcase requires models to reason across thousands of inputs, including company documents, meeting transcripts, large-scale data exports, 25,000+ Slack messages and 3,500+ emails. These sources are fragmented, messy, and often contain realistic contradiction, testing whether models can navigate the ambiguity of real-world knowledge work ➤ Composite rubric and pairwise grading: AA-Briefcase combines binary rubric checks for ground-truth correctness with pairwise grading on analytical quality and presentation quality. Unlike many evaluations that focus on a single metric, AA-Briefcase tests agentic capabilities more comprehensively, exposing cases where models produce outputs that look polished but are incorrect or lack analytical rigor ➤ Built by industry experts: AA-Briefcase scenarios mirror real-world knowledge work, with tasks developed over months by experts across data science, product management and corporate strategy from companies including Google, McKinsey & Company and BCG. Task challenges are drawn from professional experience, making AA-Briefcase more reflective of the ambiguity, messy context and competing priorities that define real-world knowledge work Key results: ➤ Claude Fable 5 leads AA-Briefcase at 1587 Elo: This is followed by Claude Opus 4.8 (1356) with the next-best non-Anthropic model, GLM-5.2 (max), ~90 points back at 1266. Note that Claude Fable 5 did not use the Opus 4.8 fallback for any task in AA-Briefcase ➤ Cost per task varies by ~800x across models tested: Claude Fable 5 leads the benchmark but costs more than $31 per task on average, compared to ~$0.04 for DeepSeek V4 Flash (max). The strongest price/performance options are open weights models such as GLM-5.2 (max) and DeepSeek V4 Pro (max), with GLM-5.2 (max) scoring only ~90 Elo below Claude Opus 4.8 (max) for less than 25% of the cost ➤ Real-world complexity remains difficult for models: The top performer, Claude Fable 5, satisfies all rubric criteria on just 3% of AA-Briefcase tasks. On 31 of 91 tasks, no model scores above 50% on the rubric criteria ➤ Task difficulty scales with the number of required input files: For each rubric check, we identify the set of source files needed to pass. Across all models, pass rates fall as this file count increases, though top-tier models degrade less than weaker models More details below in thread ⬇️

译Artificial Analysis 推出新基准 AA-Briefcase，用于评估模型在长期知识工作项目中的智能体能力。基准包含 4 个私有场景（每项目需处理 25000+ Slack 消息、3500+ 邮件等碎片化上下文）及一个公开演示场景。评测结果：Claude Fable 5 以 Elo 1587 领先，其次为 Claude Opus 4.8（1356）、Opus 4.7 及智谱 GLM 5.2（max，1266）。成本方面，Claude Fable 5 平均每任务 $31，Opus 4.8 为 $10.40，GPT-5.5 (xhigh) 为 $3.68，GLM 5.2 (max) 为 $2.40，DeepSeek V4 Flash (max) 仅约 $0.04。所有模型中仅 3% 的任务满足全部标准，31/91 个任务无模型得分超 50%，显示真实世界复杂性仍是挑战。最佳性价比为开源权重模型 GLM-5.2 (max) 和 DeepSeek V4 Pro (max)。

Greg Brockman@gdb · 6月19日28

the reasoning paradigm unlocking medical progress for humanity

译推理范式正在解锁人类医学进步

Rohan Paul@rohanpaul_ai · 6月19日45

Yann LeCun (@ylecun) explains why LLMs are limited in terms of real-world intelligence during a Bloomberg interview. "Language is a very approximate, reduced, quantized, and simplified description of the world, and LLMs can only deal with discrete sequences of symbols. The world is much more complicated than language. The biggest LLMs are pre-trained on the totality of all the publicly available text on the internet. That’s about 20 trillion words, or 30 trillion tokens. A token is about 3 bytes. So total 10¹⁴ bytes of text. This is the amount of data a four-year-old has seen through vision during four years. Now, the text, though, would take 400,000 years to read? So, there is enormously more data from sensory input, like vision, touch, and everything else, than there could ever be through language." A child does not need 400,000 years of reading to understand cups, doors, balance, faces, falls, or heat, because the body is already collecting dense feedback from vision, touch, motion, and consequence. Text strips most of that away. It turns a living scene into symbols, then asks the model to infer the missing world from traces left by people describing it. That is why an LLM can sound fluent about physics and still have no native sense of how fragile glass feels in a hand. Moravec’s paradox names this reversal: the things humans find intellectual can be easier for machines than the things toddlers do without applause. The hard part is not producing an answer, but building a model of the world that survives contact with weight, friction, surprise, and failure. ---- Link to the full video on Bloomberg's site. Link in comment.

译Yann LeCun 在 Bloomberg 采访中指出，LLM 只能处理离散符号序列，而语言是对世界的近似简化描述。互联网公开文本约 20 万亿词（30 万亿 token），而一个 4 岁孩子通过视觉在 4 年内就能看到同等数据量——文本则需要 40 万年阅读。感官输入提供远多于语言的密集反馈，文本剥离了大部分真实世界体验。这解释了 LLM 能流畅谈论物理却缺乏对易碎玻璃的直观感受，也呼应了 Moravec 悖论：机器难以掌握婴儿通过身体习得的常识。

Emad@EMostaque · 6月19日41

Elon on when Chinese models hit fable level performance. I have always thought Chinese labs have a huge advantage here. The feedback loops for usefulness are tighter & AI adoption higher in China than the USA => utility above all else

译Elon Musk 在回应中表示，中国模型在基准测试上或达前沿水平，但按真正有用性衡量，即使 Q1 表现也会令人印象深刻。他指出 Anthropic 正确聚焦于最大化有用智能，该能力不体现在基准测试但直接反映在收入中。Emad Mostaque 补充认为中国实验室在实用性反馈循环和 AI 采用率上比美国更具优势，中国更强调实用性高于一切。

Artificial Analysis@ArtificialAnlys · 6月19日63

Wisedocs, an AI-powered medical record review platform, has launched Medical Long Context Reasoning (MLCR), a new long-context document evaluation based on their experience using frontier models to process medical data. This benchmark tests how well models reason over realistic medical and insurance case files, even as the amount of noise from other documents increases to larger context sizes. It includes a range of difficulty levels, with a private hold-out set of questions including complex medical reasoning, hallucination checking, and parallel questions in a single query inspired by real-world usage. We're excited to partner with @Wisedocsai to bring this benchmark to Artificial Analysis soon!

译Wisedocs 发布 Medical Long Context Reasoning (MLCR) 基准，测试 LLM 对真实医疗档案的长文档推理能力。评测包含 250 个问题，横跨 6 个难度等级，另设私有保留集，涵盖复杂医学推理、幻觉检测及单次查询中的并行提问。Wisedocs 同步开源 10 个合成病例、低三级问题及评估工具。Artificial Analysis 将合作上线该基准。

Greg Brockman@gdb · 6月19日79

We've collaborating with hundreds of physicians across 60 countries, 49 languages, and 26 specialties to make ChatGPT great at health-related questions for everyone:

译OpenAI 与全球 60 个国家、49 种语言、26 个专科的数百名医生合作，通过医生主导的评估大幅提升了 GPT-5.5 Instant 在健康相关问题的智能水平，现已能与公司前沿 Thinking 模型（推理模型）相当。该模型每周为超过 2.3 亿 ChatGPT 用户服务，能更好识别紧急医疗需求、询问相关上下文、解释不确定性并简化复杂信息。由于面向所有 ChatGPT 免费用户开放，这些改进可惠及更多人。

OpenAI@OpenAI · 6月19日60

GPT-5.5 Instant is now on par with our frontier Thinking models for health-related questions. Every week, more than 230 million people turn to ChatGPT with health and wellness questions, and GPT-5.5 Instant is better at recognizing when urgent care may be needed, asking for relevant context, explaining uncertainty, and making complex information easier to understand. Because GPT-5.5 Instant is available to all free users in ChatGPT, these improvements can help more people. Physician-led evaluation was critical to making these major intelligence gains.

译GPT-5.5 Instant在健康相关问题上的表现已与OpenAI的前沿思考模型持平。每周超过2.3亿用户向ChatGPT咨询健康问题，GPT-5.5 Instant能更准确地识别需紧急护理的情况、主动询问相关背景、解释不确定性并简化复杂信息。该模型已向ChatGPT所有免费用户开放。医生主导的评估对这些重大智能提升至关重要。

🚨 AI News | TestingCatalog@testingcatalog · 6月19日64

PERPLEXITY 🔥: Computer now has a Brain, a continuously learning memory system that forms an underlying context graph. It makes you willing to feed it with more and more context every day. > Available as a research preview for all Perplexity Max subscribers.

译Perplexity 为 Computer 推出 Brain 功能，一个持续学习的内存系统，能自动构建底层上下文图。该功能让每项任务从一开始就携带项目、决策和来源的完整上下文，不再从零开始。在需要过往上下文的任务上，Brain 使答案正确性提升 25%，召回率提升 16%，每任务运行成本降低 13%。目前已作为研究预览向所有 Perplexity Max 订阅者开放。

Chubby♨️@kimmonismus · 6月19日45

Nice, sounds like next thursday is gonna be big: GPT-5.6 release incoming

译不错，看来下周四将有大动作：GPT-5.6 即将发布

Noam Brown@polynoamial · 6月19日35

When we announced @OpenAI o1 some researchers from other labs told me we made a strategic mistake and should have kept it secret so we could accelerate ourselves and pull farther ahead of the competition. Studies like these make me confident we made the right choice.

译Noam Brown 发文称，OpenAI 公开 o1 后，有其他实验室研究者认为这是战略失误，应保密以拉开差距。但他引用的最新研究让他确信公开正确：OpenAI 与波士顿儿童医院、哈佛合作，在 NEJM AI 发表研究，展示 o3 Deep Research 帮助临床医生重新审视未解决的罕见儿科疾病病例，为等待多年的家庭找到答案。

Greg Brockman@gdb · 6月19日51

OpenAI for helping find 18 new diagnoses across 376 previously unsolved medical cases. Includes diagnosing Kyra, who has been trying to understand her muscle weakness since age 9, with a rare form of myofibrillar myopathy shortly before her 28th birthday.

译OpenAI 与波士顿儿童医院、哈佛大学合作，在 NEJM AI 发表研究，使用 o3 Deep Research 重新审视 376 例此前未解的罕见儿科疾病案例，帮助找到 18 种新诊断。其中包含一例 Kyra 自 9 岁起出现肌无力的罕见肌原纤维肌病，在她 28 岁生日前不久得到确诊，为等待多年的家庭提供了答案。

🚨 AI News | TestingCatalog@testingcatalog · 6月19日45

OPENAI 🔥: GPT-5.6 model family is being prepared for the upcoming release, as GPT-5.6-Pro has been spotted in testing. Soon 👀

译OPENAI 🔥：GPT-5.6 模型系列正在为即将到来的发布做准备，因为 GPT-5.6-Pro 已在测试中被发现。很快 👀

AYi@AYi_AInotes · 6月19日74

把 1.5TB 的模型剁掉 84% 的体积，塞进本地跑，还剩 82% 的功力，这就是GLM-5.2，最强开源模型，现在缩骨到了 238GB，256GB 的 Mac 或者同档 RAM/VRAM 的机器就能带起来了技术博客：http://z.ai/blog/glm-5.2 权重：http://huggingface.co/zai-org/GLM-5.2 API：https://docs.z.ai/guides/llm/glm-5.2 编码计划：http://z.ai/subscribe

译GLM-5.2 发布开源权重，MIT 许可。原 1.5TB 模型经 84% 压缩至 238GB，可在 256GB Mac 或同档硬件本地运行，保留 82% 性能。拥有 1M 上下文窗口，编码和智能体任务显著提升。提供两种推理力度：GLM-5.2 (max) 极限推理，GLM-5.2 (high) 平衡性能与 token 效率。API 定价与 GLM-5.1 相同。

OpenAI@OpenAI · 6月18日46

Together with researchers at Boston Children’s Hospital and Harvard, we published a study in NEJM AI showing how o3 Deep Research helped clinicians revisit previously unsolved rare pediatric disease cases, and find answers for families who had waited years.

译与波士顿儿童医院和哈佛的研究人员合作，我们在NEJM AI上发表了一项研究，展示了o3 Deep Research如何帮助临床医生重新审视此前未解决的罕见儿科疾病案例，并为等待多年的家庭找到答案。

elvis@omarsar0 · 6月18日40

Cool paper on Skill routing for LLM agents. Real tasks rarely map to a single skill. They need several composed together, but most skill routing still treats the problem as picking one tool from a library. This work formalizes Compositional Skill Routing, decomposes a complex query into atomic sub-tasks, retrieves the right skill for each, and then composes an executable plan. The system, SkillWeaver, pairs an LLM decomposer with a bi-encoder FAISS retriever and a dependency-aware DAG planner. It comes with CompSkillBench, 300 compositional queries over 2,209 real skills, so the multi-skill case gets measured directly. Why does it matter? As skill libraries grow, single-skill retrieval quietly caps what an agent can do. The DAG planner turns retrieved skills into an ordered, dependency-respecting plan. Paper: https://arxiv.org/abs/2606.18051 Learn to build effective AI agents in our academy: https://academy.dair.ai/

译传统LLM智能体技能路由仅从工具库选取单一技能，难以应对多技能组合的真实任务。本文形式化定义“组合式技能路由”，将复杂查询分解为原子子任务，为每个子任务检索对应技能并组合成可执行计划。系统SkillWeaver由LLM分解器、双编码器FAISS检索器和依赖感知DAG规划器构成。同时发布CompSkillBench基准，含300个组合查询和2,209个真实技能，直接评估多技能路由能力。DAG规划器将检索技能转化为有序、尊重依赖关系的计划。

OpenBMB@OpenBMB · 6月18日51

SOAR 2026 has officially wrapped up! 🎉 Hosted by @OpenBMB, @SGLang, and @NVIDIA, the challenge tasked developers worldwide with maximizing the inference performance of MiniCPM-SALA — our sparse+linear hybrid attention model — on a single consumer GPU. On June 6, we brought the SOAR 2026 community together in Beijing for our final in-person Meetup. Developers, researchers, and open-source builders from @NVIDIA, @SGLang, and @OpenBMB gathered to share hard-won lessons from the frontlines of inference optimization. From Blackwell architecture tuning to SGLang-Omni and the Densing Law, it was a powerful reminder that inference efficiency is a full-stack, cross-community effort.☺️ Huge thanks to our co-hosts @SGLang and @NVIDIA for making this possible — and to every participant who submitted, iterated, and shared. 😘 Final Metrics: 📊 326 teams registered, 370 participants 📊 4,300+ total submissions 📊 69 teams on the final leaderboard 🏆 The winning team achieved an overall 6.33x speedup over baseline — peaking at 9.72x on single-request inference. Their solution combined: 🔹 NVFP4 quantization with hybrid GEMM dispatch 🔹 FlashInfer plan-cache optimization 🔹 Custom Triton kernels for GLA layers 🔹 EAGLE-3 speculative decoding with dynamic depth switching 🔹 Runtime-aware scheduling across different concurrency levels Low-bit quantization, speculative decoding, sparse attention, and phase-aware scheduling are emerging as the core pillars of next-gen efficient inference. SOAR 2026 put that thesis to the test — and the community delivered. The leaderboard is closed, but the optimizations, code, and conversations will live on in the open-source ecosystem. 🚀 🔗 MiniCPM-SALA: http://huggingface.co/openbmb/MiniCPM-SALA

译由 OpenBMB、SGLang 和 NVIDIA 联合主办的 SOAR 2026 挑战赛结束，旨在单消费级 GPU 上最大化 MiniCPM-SALA（稀疏+线性混合注意力模型）推理性能。最终 326 支队伍注册，4300+ 次提交，69 队入围排行榜。冠军团队实现整体 6.33 倍加速，单请求推理峰值达 9.72 倍，方案结合 NVFP4 量化、FlashInfer plan-cache 优化、自定义 Triton 内核、EAGLE-3 推测解码及运行时感知调度。低比特量化、推测解码、稀疏注意力和阶段感知调度被视为下一代高效推理核心支柱。

Berryxia.AI@berryxia · 6月18日55

兄弟们，这样下去，我感觉自己真的也要废了啊！很多人都变成了一个“假思考”or “假忙碌”的状态！ 2026年最讽刺的事：你越依赖AI做研究，就越“看起来像在做研究”，却离真正做研究越来越远。 Vivek Nair那篇文章刷了520万阅读，核心就一句话：大多数人学到的不是“怎么做研究”，而是“怎么看起来像在做研究”。现在的信息流太完美了，算法替你选论文、社交链替你过滤热点、大模型替你总结摘要。你每天追的“重要方向”，其实是别人已经跑过的赛道。你以为自己在吸收知识，其实在SFT（监督微调）：给什么样本学什么样本。而真正厉害的研究者是RL型：自己先想清楚想要什么结果，再反推需要什么实验。 Schulman说过，这种从目标出发的推理天然制造原创性，因为你的具体问题不会出现在任何综述里。 AI让SFT型研究变得前所未有的舒服。论文有AI摘要、实验有AI设计、代码有AI生成。你可以用更少的努力“看起来更像研究者”。但判断力这东西，AI替你嚼不出来——它只会顺着你、肯定你、帮你制造“假顿悟”。 Vivek开的药方其实就四条：自己选题、读原文、写下来、盯着失败看。这些在十年前是常识，在2026年反而成了反直觉。因为AI把“看起来像”的门槛拉得极低，而“真正做”的心理门槛却更高了。 520万人看了这篇文章，然后继续刷下一条。

译Vivek Nair的文章（520万阅读）指出，2026年AI让研究变成“看起来像在研究”而非真正研究。算法选论文、AI总结摘要、生成代码，使“SFT型”（监督微调）研究异常舒适，但判断力无法被替代。真正的原创研究是“RL型”：从目标出发推理。Vivek开出药方：自己选题、读原文、写下来、盯着失败看。大多数人阅读后继续刷下一条。

Ant Ling@AntLingAGI · 6月18日50

It has been a privilege to collaborate so closely with the SGLang team @lmsysorg on optimizing Ling-2.6-1T. 🥳 The resulting performance gains speak for themselves: -53% reduction in MoE pre-fill latency -Up to 1.77x higher decode throughput on a 16-chip TPU v7x slice compared to a similar H200 cluster A significant milestone in efficient MoE scaling and hardware utilization!

译蚂蚁百灵与 SGLang 团队合作，将 1T 参数的混合 MoE 模型 Ling-2.6-1T 通过 SGLang-JAX 部署至 TPU v7x。优化包括：升级 Fused MoE V2 内核（token 和累加器驻留 VMEM，双缓冲专家权重，隐藏路由与预取）；混合内存池（10 个全注意力层 per-token MLA KV + 70 个 GLA 层 per-request 循环状态）；GLA 线性注意力逐块并行预填充；单控制器 DP 保持分组 RMSNorm 芯片本地化。效果：MoE 预填充延迟降低 53%；在 16 芯片 TPU v7x 切片上，解码吞吐量比同类 H200 集群最高提升 1.77 倍。

AYi@AYi_AInotes · 6月18日60

人类到今天都写不出一颗煎蛋的物理方程，一颗鸡蛋打进热油锅,它怎么凝固、怎么摊开、边缘怎么变焦, 没有任何一个公式能描述清楚,这种例子在物理世界里多到数不过来。而这恰恰是当下通用 AI 范式的天花板,视频生成、VLA 学的都是像素层面的统计相关性, 它能生成一颗煎蛋的样子,却不知道当把油温调低、锅换小,蛋会变成什么样,因为它从没碰到背后真正的因果。 UCSD 的@huang_biwei 黄碧薇教授在做的, 就是让AI模型从普通视频里自动抽出这些人类都写不出的物理规律, 目标甚至是反过来，发现人类未知的新物理。黄教授把近 30 年的 AI 拆成四代, 相关性小模型、因果小模型、相关性大模型也就是今天的 LLM,再到因果大模型, 我们正站在第四代的门口。黄教授在因果 AI 领域深耕 12 年,是causal-learn 作者、入选过 Apple Scholar，师承因果发现的开山一代,这套血统不是蹭概念蹭得来的。今天 Aether AI 正式官宣完成首轮融资, 我更愿意把它解读成一个明确的信号, 终于有人开始为下一代 AI 范式下注, 而不是继续给大模型圈堆参数、拼体量、比算力这些"以胖为美"的相扑比赛加注。官网传送门:https://aetherlabs.ai

译UCSD教授黄碧薇（@huang_biwei）创办的 Aether AI 宣布完成 2000 万美元首轮融资，目标是构建因果世界模型。她认为当前视频生成、VLA 等 AI 仅学习像素层面的统计相关性，无法理解背后因果，并提出第四代 AI 范式——因果大模型，让模型从普通视频中自动抽取出人类写不出的物理规律，甚至发现未知新物理。黄碧薇深耕因果 AI 12 年，是 causal-learn 作者，入选 Apple Scholar。本轮融资被视为跳出“堆参数、比算力”的 scaling 路线，转向下一代 AI 范式的关键信号。

Rohan Paul@rohanpaul_ai · 6月18日67

Big claim in this paper, pushes against the common idea that more test-time compute should keep helping. Claims a code model gets much better when it rethinks once (i.e. by looping once) inside itself, but worse when it keeps rethinking. The first loop builds context, the second loop refines it, and later loops mostly disturb it. The paper studies a faster design called Parallel Loop Transformer, where loops can run almost in parallel and share memory, so the authors can ask a cleaner question about how many loops are actually useful. They trained 7B code models with 1, 2, 3, and 4 loops on 18T tokens, then tuned and tested them on code writing, code reasoning, software engineering, and tool-use tasks. The main result is that 2 loops worked best, raising SWE-bench Verified from 43.0 to 64.4, while 3 and 4 loops often got worse. Their internal checks suggest loop 2 does the real useful refinement, because it changes the model’s hidden states, attention patterns, and predictions in meaningful ways. After loop 2, the extra loops mostly add weaker, more repetitive changes, while a built-in position shift keeps adding the same kind of mismatch cost. Overall, the paper gives a simple lesson for efficient test-time compute: adding 1 hidden loop can help a lot, but adding more is not automatically better. ---- Link – arxiv. org/abs/2606.18023 Title: "LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling"

译论文《LoopCoder-v2》质疑“测试时计算越多越好”的观点。作者提出Parallel Loop Transformer架构，使循环可并行运行并共享内存。他们训练了7B参数的代码模型（1/2/3/4次循环），在18T tokens上预训练并微调，测试代码编写、推理、软件工程和工具使用任务。主要结果：2次循环效果最好，将SWE-bench Verified从43.0提升至64.4，而3次和4次循环性能下降。内部分析显示，第二次循环进行了有意义的精炼（改变隐藏状态、注意力模式和预测），后续循环则主要添加重复和噪声。结论：增加一次隐藏循环可大幅提升性能，但继续增加并非自动有益。

Berryxia.AI@berryxia · 6月18日48

🔥Gemini 3.5 Pro 爆料合集！发布越来越近了！ - 谷歌已经开始暗示 Gemini 3.5 Pro，在 Gemini 3.1 Pro 的产品卡片上出现了“3.5 Pro 即将推出”的标签～ - 相比 3.1 Pro，预计视觉能力更强、多模态推理更出色，SVG/前端生成功能也会升级！ - 大概率会搭载更严格的安全过滤器和内容审核机制 - 定价预计比 Gemini 3.1 Pro 更高 - 最大期待：谷歌能在正式发布前，修复早期 3.5 Pro 版本在长复杂任务中表现出的“偷懒”问题！

译谷歌即将发布 Gemini 3.5 Pro，已在 Gemini 3.1 Pro 产品卡片上标注“3.5 Pro 即将推出”。相比 3.1 Pro，预计视觉能力更强、多模态推理更出色，SVG/前端生成功能升级。将搭载更严格的安全过滤器和内容审核机制，定价预计更高。最大期待是谷歌能在正式发布前修复早期版本在长复杂任务中的“偷懒”问题。

Artificial Analysis@ArtificialAnlys · 6月18日61

Claude Fable 5 cost ~$6.2K to run the Artificial Analysis Intelligence Index benchmarks - the most expensive model we have ever benchmarked 🧵 Key takeaways: ➤ Intelligence Index: 60, ahead of Claude Opus 4.8 (56) and GPT-5.5 (55) ➤ Cost to run the Intelligence Index: $6.2K, 1.7× the next-highest model (Opus 4.8, $3.7K) and 2.2× GPT-5.5 (xhigh, $2.9K) ➤ List price: $10/$50 per 1M input/output tokens, 2× Opus 4.8. Among 2026 releases, only OpenAI's special Pro tier (GPT-5.5 Pro, $30/$180) is priced higher ➤ Cache pricing, which is particularly important for long agentic coding sessions, doubled too: $1/M cache reads, and $12.50/M cache writes vs $0.50/$6.25 for Opus 4.8 ➤ The top 3 most-expensive models to run the Intelligence Index are now all Claude models

译Artificial Analysis 将 Claude Fable 5 列为有史以来基准测试成本最高的模型，运行其 Intelligence Index 需 $6.2K，是第二贵模型 Opus 4.8（$3.7K）的 1.7 倍、GPT-5.5（$2.9K）的 2.2 倍。该模型 Intelligence Index 得分 60，领先 Opus 4.8（56）和 GPT-5.5（55）。定价 $10/$50 每百万输入/输出 tokens，为 Opus 4.8 的 2 倍，仅低于 GPT-5.5 Pro（$30/$180）。缓存价格同步翻倍：缓存读取 $1/M、写入 $12.5/M，而 Opus 4.8 分别为 $0.5/$6.25。Intelligence Index 成本前三高的模型目前均为 Claude。

SemiAnalysis@SemiAnalysis_ · 6月18日60

Great work to @vllm_project team and @NVIDIA on smooth, out-of-the-box day 0 @MiniMax_AI M3 experience with @inferact EAGLE3 spec decode. Here are the details of ongoing M3 workstream: NVIDIA, Inferact and SemiAnalysis are working hard on enabling disaggregated inferencing (PR 45879), and the Inferact team is working on enabling FlashInfer M3 MoE kernels (PR 45723). Performance should be much better once those PRs land. Huge shoutout to @rogerw0108 & @mgoin_ and the maintainers for the rapid review and mentorship here!

译vLLM 团队与 NVIDIA 合作，为 MiniMax M3 模型提供开箱即用的 day 0 体验，并集成 Inferact 的 EAGLE3 推测解码。当前工作包括：NVIDIA、Inferact 与 SemiAnalysis 推动拆分推理（PR 45879），Inferact 团队启用 FlashInfer M3 MoE 内核（PR 45723），落地后性能将显著提升。NVIDIA 表示 M3 已加入 DeepSeek V4 和 Kimi-K2.6 等前沿开放智能体模型行列。NVIDIA Blackwell Ultra 在 M3 上比 Hopper 实现最高 5 倍 AI 工厂吞吐量，并超过 300 TPS/user。未来通过优化内核、NVFP4 及 NVIDIA Dynamo 拆分推理等，性能有望进一步提升。

Rohan Paul@rohanpaul_ai · 6月18日34

Today’s edition of my newsletter just went out. 🔗 https://www.rohan-paul.com/p/zai-releases-glm-52-model-1m-context 🗞️ Z .ai releases GLM 5.2 model: 1M context window with MIT-licensed open weights, long-horizon coding agents 🗞️ Tensordyne Announces Breakthrough Inference System - 13x the rack throughput of NVIDIA’s NVL72 GB300 🗞️ New MIT study. Code volume surges by 300%, but output increases by only 30%: The AI dividend meets an awkward reality 🗞️ Google released DiffusionGemma, an open experimental 26B MoE, activates only 3.8B. Great news for locall LLMs. 🗞️ Dario Amodei’s new blog, calling for an urgent policy overhaul because he thinks frontier AI is moving faster than governments can regulate it. 🗞️ OpenAI is buying Ona to give Codex agents a secure cloud desk that stays open after humans leave. 🗞️ Full Letter From US Commerce Secretary Howard Lutnick to Dario Amodei - What did US tell Anthropic before banning Mythos and Fable for foreigners

译Z.ai 推出 GLM 5.2 模型，1M 上下文窗口、MIT 许可开源权重，面向长周期编码智能体。Tensordyne 宣布推理系统，机架吞吐量达 NVIDIA NVL72 GB300 的 13 倍。MIT 研究显示代码量激增 300% 但产出仅增 30%。Google 发布 DiffusionGemma，26B MoE 仅激活 3.8B。Anthropic CEO Dario Amodei 呼吁紧急政策改革。OpenAI 收购 Ona，为 Codex 智能体提供安全云桌面。美国商务部长致信 Anthropic，就禁止外国用户使用 Mythos 和 Fable 做出说明。

AK@_akhaliq · 6月18日34

LoopCoder-v2 Only Loop Once for Efficient Test-Time Computation Scaling

译LoopCoder-v2 仅循环一次实现高效测试时计算缩放

Greg Brockman@gdb · 6月18日46

GPT-5.4 for improving a challenging reaction in medicinal chemistry:

译GPT-5.4 用于改善药物化学中一个具有挑战性的反应。

gabriel@gabriel1 · 6月18日33

words are very lossy pointers to complex concepts in our brains explaining these concepts to ai become incrementally harder as models become smarter and can do more things

译词语是我们大脑中复杂概念的有损指针随着模型变得更聪明、能做更多事情，向AI解释这些概念变得更加困难。