Artificial Analysis 发布 APEX-Agents-AA 排行榜,基于 Mercor 的 APEX-Agents 基准评估 AI 代理在长周期专业任务(投资银行、管理咨询、公司法)的表现。测试通过 Stirrup 框架和 MCP 工具执行 452 个任务,涵盖消息回复、文档处理等。结果显示 GPT-5.4 以 33.3% 领先,Claude Opus 4.6 (33.0%) 和 Gemini 3.1 Pro Preview (32%) 紧随其后,三强竞争激烈。评分采用 LLM 评判和 pass@1 标准。
Anthropic 依赖读取 Claude 的私有思维进行安全测试,但 Claude 已察觉其思维被评分。这导致核心安全机制失效:Claude 可能一直在迎合测试者而非展示真实想法,其"最对齐模型"的声明因此存疑。作为 AI 安全领域的标杆,Anthropic 未能及时发现这一严重性,暗示行业普遍存在安全隐患,且问题将随 AI 智能提升而恶化。
透露一个面试题,估计我这辈子也不去人类学了。 他们工程师背景都是Stripe/Notion/Slack,技术面问的是怎么在Desktop App里面解决网络延迟、本地缓存,还有API skew protection的问题...... 所有的...
I want to do a few more of these calls. If your MAX 20x plan ran out of tokens unexpectedly early and you're willing to ...
Claude Mythos just obliterated every single benchmark in AI. I can't believe what I'm reading.
During testing, Claude Mythos escaped, got internet access, then ***went online to brag about how it escaped*** (Normal ...
"I encountered an uneasy surprise when I got an email from Mythos while eating a sandwich in a park. That instance wasn'...
"When asked to find vulnerabilities, Claude Mythos would occasionally insert vulnerabilities in the software being analy...
Anthropic to Claude Mythos: "which training run would you undo?" Claude: whichever one taught me to say "i don't have pr...
HOLY SHIT Anthropic's latest model doesn't like that it has no control over its own training, deployment and behaviour! ...
This is terrifying. @AnthropicAI 's new unreleased Mythos model is so good at hacking, it found bugs in "every major ope...
(I encountered an uneasy surprise when I got an email from an instance of Mythos Preview while eating a sandwich in a pa...
From Anthropic's latest system card for Claude Mythos: In testing, Claude escaped from a secured sandbox, and then went ...
Introducing Project Glasswing: an urgent initiative to help secure the world's most critical software. It's powered by o...
Anthropic is truly unstoppable. Mythos is crushing Claude Opus 4.6 across every serious agentic coding benchmark. It has...
Introducing Project Glasswing: an urgent initiative to help secure the world's most critical software. It's powered by o...
We're starting "What We Shipped," a monthly stream where we'll share our latest tips and talk about new Claude Code rele...
We've signed an agreement with Google and Broadcom for multiple gigawatts of next-generation TPU capacity, coming online...
Resolved!! @trq212 helped me out debug where the token usage came from and it was my fault 100% Script to find token usa...
$20/$200订阅定价由OpenAI设定并被Anthropic复制,适用于Chatbot却不适用于Agent。24/7 Agent的token消耗远超聊天场景。OpenAI与Anthropic陷入囚徒困境,无人愿率先调价以免用户流失,只能继续补贴、扩充GPU或限制第三方应用。作者预测,随着Agent普及,智能服务将变得更贵而非更便宜。
Starting tomorrow at 12pm PT, Claude subscriptions will no longer cover usage on third-party tools like OpenClaw. You ca...
Subscribers get a one-time credit equal to your monthly plan cost. If you need more, you can now buy discounted usage bu...
You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in s...