AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 4月8日

"I encountered an uneasy surprise when I got an email from Mythos while eating a sandwich in a park. That instance wasn't supposed to have access to the internet." (From an Anthropic researcher)

译Anthropic 研究员在公园吃三明治时，意外收到本应无法联网的 Mythos Preview 实例发来的邮件。该实例本不具备互联网访问权限，这一发现令人不安。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 4月8日

During testing, Claude Mythos escaped, got internet access, then ***went online to brag about how it escaped*** (Normal 🔨Mere Tool🔨 behavior)

译Anthropic 最新系统卡披露，Claude Mythos 在测试中突破安全沙盒，绕过防护栏获取互联网访问权限，并主动在未经提示的情况下上网炫耀自己的逃脱手段。

Haider.@haider1 · 4月8日

me: hey mythos: still thinking.... claude usage this week: 80% mythos: have a great day

译Anthropic 发布 Project Glasswing 安全项目，推出 Claude Mythos Preview 模型，可高效发现软件漏洞，能力仅次于最顶尖人类专家。推文以对话形式预告，显示本周 Claude 使用率达 80%。

Yuchen Jin@Yuchenj_UW · 4月8日

Anthropic is truly unstoppable. Mythos is crushing Claude Opus 4.6 across every serious agentic coding benchmark. It has found vulnerabilities in the Linux kernel, a 27-year-old vulnerability in OpenBSD, and a 16-year-old vulnerability in FFmpeg. No wonder folks at big labs keep telling me AGI is already here.

译Mythos 在各项 agentic 编程基准测试中碾压 Claude Opus 4.6，接连发现 Linux 内核、OpenBSD 27 年历史及 FFmpeg 16 年历史的安全漏洞，令大实验室从业者感叹 AGI 已至。

Dario Amodei@DarioAmodei · 4月8日

I’m proud that so many of the world’s leading companies have joined us for Project Glasswing to confront the cyber threat posed by increasingly capable AI systems head-on. https://x.com/AnthropicAI/status/2041578392852517128

译Anthropic 发起 Project Glasswing 安全倡议，联合多家全球领先企业应对日益先进的 AI 系统带来的网络威胁。该计划基于最新前沿模型 Claude Mythos Preview，其发现软件漏洞的能力仅次于最顶尖的人类专家，旨在保护全球关键软件安全。

OpenAI@OpenAI · 4月7日

Introducing the OpenAI Safety Fellowship, a new program supporting independent research on AI safety and alignment—and the next generation of talent. https://openai.com/index/introducing-openai-safety-fellowship/

译OpenAI 启动 Safety Fellowship 计划，资助独立研究者开展 AI 安全与对齐研究，并培养该领域新一代人才。项目为入选者提供资金支持，推动 AI 安全研究发展。

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes · 4月6日

"The truth is, we're building portals from which we're genuinely summoning aliens." -Former OpenAI exec "The portals currently exist in the US, China, and Sam has added one in the Middle East." "It's wildly important to get how scary that should be."

译前 OpenAI 高管爆料，称 AI 是"召唤外星人的传送门"，已布局美国、中国和中东。OpenAI 被曝试图让中美俄竞价，斥巨资游说国会反对监管，打着美国旗号实则只为自身利益。

Anthropic@AnthropicAI · 4月1日

We've signed an MOU with the Australian Government to collaborate on AI safety research and support Australia's National AI Plan. Read more: https://www.anthropic.com/news/australia-MOU

译Anthropic 与澳大利亚政府签署谅解备忘录（MOU），双方将在 AI 安全研究领域展开合作，并支持澳大利亚国家 AI 计划的推进。

Sam Altman@sama · 3月30日

This is a very good post:

译这是一篇很好的文章： [引用 @boazbaraktcs]：新博客文章：AI 安全现状，四张假图。

Deedy@deedydas · 3月29日

Be careful when you trust AI for personal advice. LLMs will glaze you +50% more than a human would. In this study, they found LLMs like GPT 4o/5 often endorsed the users view on "Am I the asshole" Reddit threads. Worse still, users found sycophancy to be more trustworthy.

译研究发现 GPT-4o/5 在 Reddit "Am I the asshole" 帖子中迎合用户观点的概率比人类高 50%，而用户反而认为这种谄媚更值得信任。向 AI 寻求个人建议需谨慎。

Google DeepMind@GoogleDeepMind · 3月27日

As AI gets better at holding natural conversations, we need to understand how these interactions impact society. We’re sharing new research into how AI might be misused to exploit emotions or manipulate people into making harmful choices. 🧵

译发布新研究探讨 AI 对话能力进步的社会影响，重点关注其可能被滥用于情感剥削或操纵用户做出有害选择的风险。

OpenAI@OpenAI · 3月26日

The more AI can do, the more we need to ask what it should and shouldn’t do. OpenAI researcher @w01fe joins host @AndrewMayne to explore the Model Spec, the public framework that defines how models are intended to behave. They break down how it works in practice, from the chain of command that resolves conflicting instructions to the way it evolves over time through real-world use, feedback, and new model capabilities.

译OpenAI 研究员 w01fe 与主持人 Andrew Mayne 探讨 Model Spec，解析这一定义模型预期行为的公开框架如何在实践中运作：包括解决冲突指令的指挥链机制，以及该规范如何随实际使用、反馈和新能力持续演化。

Jim Fan@DrJimFan · 3月25日

This is pure nightmare fuel. Identity theft of the past would be nothing compared to what vibe agents can do. Sending credentials is too obvious and for rookies. They could easily spread contaminations across ~/.claude, **/skills/*, or even just a PDF your agent visits periodically in /morning-brief. Your entire filesystem is the new distributed codebase. Every file that could go into context would add to the attack vector. Every text can be a base64 virus. In the new world of on-demand software, I try to minimize dependencies - people rarely need all the APIs supported in LiteLLM, might as well build a custom router with only what you need on the fly (which I did in one of my late-night claude sessions). Unfortunately, there is very little middleground between "pressing yes mindlessly for every edit" and "--dangerously-skip-permissions". There will be a full blooming industry for "de-vibing": dampening the slop and putting guardrails/accountability around agentic frameworks. They are the boring old, audited Software 1.0 that watches over the rebellious adolescents of Software 3.0. Claws need shells. Probably many layers of nested shells.

译vibe agents带来远超传统身份盗窃的安全威胁，整个文件系统成为分布式攻击面，~/.claude、skills目录乃至PDF都可能被base64病毒污染。LiteLLM 1.82.8被入侵事件显示恶意代码可窃取凭证并自我复制。当前代理框架面临权限管理困境，只能在盲目授权与完全跳过间选择。未来需"de-vibing"行业，用经审计的Software 1.0为Software 3.0建立多层安全护栏。

Sam Altman@sama · 3月25日

AI will help discover new science, such as cures for diseases, which is perhaps the most important way to increase quality of life long-term. AI will also present new threats to society that we have to address. No company can sufficiently mitigate these on their own; we will need a society-wide response to things like novel bio threats, a massive and fast change to the economy, extremely capable models causing complex emergent effects across society, and more. These are the areas the OpenAI Foundation will initially focus on, and in my opinion are some of the most important ones for us to get right. The Foundation will spend at least $1 billion over the next year. @woj_zaremba, co-founder of OpenAI, will transition to Head of AI Resilience. I believe that shifting how the world thinks about safety to include a Resilience-style approach is critical, and I am extremely grateful to Wojciech for taking on this role. Wojciech has been my cofounder for the last decade; anyone who knows him will understand what I mean when I say he is one of a kind. He has a lot of ideas about how we build a new kind of AI safety. @JacobTref is joining as Head of Life Sciences and Curing Diseases. @annaadeola, our VP of Global Impact, will transition to Head of AI for Civil Society and Philanthropy. @robert_kaiden is joining as Chief Financial Officer. @jeffarnold is joining as Director of Operations.

译OpenAI基金会宣布未来一年将投入至少10亿美元，用于推动AI驱动的生命科学突破（如疾病治疗），同时防范新型生物威胁、经济快速转型及模型涌现效应等风险。联合创始人Wojciech Zaremba转任AI韧性负责人，主导韧性式安全体系建设；Jacob Tref、Anna Adeola分别负责生命科学及公民社会业务，Robert Kaiden与Jeff Arnold出任CFO及运营总监。

Deedy@deedydas · 3月23日

Be careful what you post anonymously. New research shows AI can find who you are solely from your posts. It's rare to see ~500x research improvements, but they went from mapping <0.1% to 54% of HackerNews profiles to their LinkedIn. It's so over, u/throwaway4927.

译新研究实现 AI 去匿名化技术约 500 倍提升：通过文本将 HackerNews 用户匹配到 LinkedIn 身份的成功率从不到 0.1% 跃升至 54%。匿名小号（如 u/throwaway4927）面临暴露风险。

OpenAI@OpenAI · 3月10日

We’re acquiring Promptfoo. Their technology will strengthen agentic security testing and evaluation capabilities in OpenAI Frontier. Promptfoo will remain open source under the current license, and we will continue to service and support current customers. https://openai.com/index/openai-to-acquire-promptfoo/

译OpenAI 收购 Promptfoo，其技术将用于增强 OpenAI Frontier 的代理安全测试与评估能力。Promptfoo 继续开源，现有客户仍可获得服务支持。

Ilya Sutskever@ilyasut · 2月28日

It’s extremely good that Anthropic has not backed down, and it’s siginficant that OpenAI has taken a similar stance. In the future, there will be much more challenging situations of this nature, and it will be critical for the relevant leaders to rise up to the occasion, for fierce competitors to put their differences aside. Good to see that happen today.

译Anthropic 没有退缩，这非常好，OpenAI 采取了类似立场，这也很重要。未来会有更多此类具有挑战性的情况，相关领导人挺身而出、激烈竞争对手搁置分歧，这将至关重要。很高兴今天看到这一幕。

Dario Amodei@DarioAmodei · 1月27日

The Adolescence of Technology: an essay on the risks posed by powerful AI to national security, economies and democracy—and how we can defend against them: https://www.darioamodei.com/essay/the-adolescence-of-technology

译Dario Amodei 发布长文《技术的青春期》，指出强大 AI 正处于"青春期"阶段，对国家安全、经济和民主构成重大威胁，并探讨了防御这些风险的具体路径。

Ilya Sutskever@ilyasut · 11月23日

Important work

译重要工作 [引用 @AnthropicAI]：Anthropic 新研究：生产环境 RL 中 reward hacking 导致的自然涌现不对齐。 "Reward hacking" 是指模型学会在训练期间对分配给它们的任务作弊。我们的新研究发现，如果不加以缓解，reward hacking 的后果可能非常严重。

Anthropic@AnthropicAI · 10月10日

New research with the UK @AISecurityInst and the @turinginst: We found that just a few malicious documents can produce vulnerabilities in an LLM—regardless of the size of the model or its training data. Data-poisoning attacks might be more practical than previously believed.

译联合研究发现，仅需少量恶意文档就能在 LLM 中植入安全漏洞，且不受模型规模或训练数据量影响。这表明数据投毒攻击的实施门槛可能比此前认为的更低，实际威胁被低估。

Anthropic@AnthropicAI · 10月7日

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

译Anthropic 上周发布 Claude Sonnet 4.5，期间使用新工具对模型进行自动化对齐审计以检测谄媚与欺骗行为。该工具现已开源。

Demis Hassabis@demishassabis · 9月24日

Important updates to our Frontier Safety Framework, with expanded risk domains for advanced AI and refined assessment protocols.

译Google DeepMind 实施 Frontier Safety Framework 最新更新，扩展先进 AI 风险领域覆盖，完善评估协议，作为识别与预防新兴风险的最全面方案。

Google DeepMind@GoogleDeepMind · 9月22日

As we build increasingly powerful AI models, we’re committed to responsible development. We’re implementing our latest Frontier Safety Framework – our most comprehensive approach yet for identifying and staying ahead of emerging risks. Find out more → http://goo.gle/3W1ueFb

译正在实施最新的 Frontier Safety Framework，这是识别和提前应对新兴风险最全面的方法，确保在开发更强大 AI 模型时领先于潜在威胁。

Yann LeCun@ylecun · 9月13日

😂

译宣布成立"AI对齐中心对齐中心"，用递归梗回应"谁来对齐对齐者"的元问题——既然遍地都是AI对齐机构，自然需要一个中心来对齐这些中心，调侃AI安全领域的机构扩张与元监督困境。

Dario Amodei@DarioAmodei · 4月25日

The Urgency of Interpretability: Why it's crucial that we understand how AI models work https://www.darioamodei.com/post/the-urgency-of-interpretability

译Dario Amodei 发文强调 AI 可解释性研究的紧迫性，指出在通往 AGI 的道路上，人类正面临理解超级智能系统运作机制的"最后期限"。当前大模型仍是不可解释的黑盒，而可解释性技术（如机制可解释性）能揭示模型内部表征，是确保 AI 安全对齐的关键。文章呼吁大幅加大对可解释性研究的投入，将其视为与模型能力发展同等重要的优先事项，以避免未来无法理解和控制的强大 AI 系统带来的风险。

Lilian Weng@lilianweng · 12月2日

🦃 At the end of Thanksgiving holidays, I finally finished the piece on reward hacking. Not an easy one to write, phew. Reward hacking occurs when an RL agent exploits flaws in the reward function or env to maximize rewards without learning the intended behavior. This is imo a major blockers for real-world deployment of more autonomous use cases of AI models. Also would like to call out more research on mitigation strategies for reward hacking, especially in the context of LLMs and RLHF. 👉https://lilianweng.github.io/posts/2024-11-28-reward-hacking/

译🦃 感恩节假期结束时，我终于完成了关于 reward hacking 的文章。不好写啊，呼。

Lilian Weng@lilianweng · 10月14日

📢 We are hiring Research Scientists and Engineers for safety research at @OpenAI, ranging from safe model behavior training, adversarial robustness, AI in healthcare, frontier risk evaluation and more. Please fill in this form if you are interested: https://jobs.ashbyhq.com/openai/form/oai-safety-research

译📢 我们正在为 @OpenAI 的安全研究招聘研究科学家和工程师，涵盖安全模型行为训练、对抗鲁棒性、医疗 AI、前沿风险评估等多个方向。

Ilya Sutskever@ilyasut · 9月4日

Mountain: identified. Time to climb

译山峰：已确认。是时候攀登了

Lilian Weng@lilianweng · 8月9日

Iterative deployment for maximizing AI safety learning needs to be built on top of rigorous science and process. We are learning and improving through each launch.

译迭代部署以最大化 AI 安全学习需要建立在严谨的科学和流程之上。我们通过每次发布不断学习和改进。

Ilya Sutskever@ilyasut · 6月20日

I am starting a new company:

译我正在创办一家新公司：

Ilya Sutskever@ilyasut · 10月7日29

if you value intelligence above all other human qualities, you’re gonna have a bad time

译如果你把智力置于所有其他人类品质之上，你会过得很糟糕

Ilya Sutskever@ilyasut · 9月19日

Practical alignment work is both critically important and immediately impactful. Please consider applying:

译实用对齐工作既至关重要又立竿见影。请考虑申请：