Rohan Paul@rohanpaul_ai

2026-06-17 11:06·15天前

AI 摘要

OpenAI 发布新研究，提出通过重放真实历史 ChatGPT 对话（移除旧回答，让新模型在相同上下文回答）来模拟部署，从而预测模型发布后的失败行为。该方法比手动挑选困难提示词的常规安全测试更有效，能发现日常使用中的问题。研究验证了 GPT-5 系列 Thinking 部署前后 20 种不良行为的实际发生率，模拟方法的典型率估计与实际率相差约 1.5 倍，优于困难提示词测试和旧模型猜测。

OpenAI's is new research shows a model's future failures can be estimated by replaying real past chats

They found deployment simulation was much better than challenging prompts at predicting which model failures would rise or fall after release， and usually better at estimating their real-world rates.

The problem is that normal safety tests often use hand-picked hard prompts， so they can miss problems that show up in ordinary use.

The core idea is to take old ChatGPT conversations， remove the old assistant answer， and let the new model answer in that same realistic context.

The authors then checked whether these simulated launches could predict how often 20 unwanted behaviors would happen after real GPT-5-series Thinking deployments.

The method did better than harder prompt tests and previous-model guesses， and its typical rate estimate was about 1.5x away from the later real rate.

OpenAIWe're sharing new research on a method for anticipating how models may behave in real-world use before release: simulating deployment with recent, de-identified...

OpenAI 安全/对齐论文/研究

在 X 查看原推导出 Markdown

Rohan Paul@rohanpaul_ai · X

68导出 Markdown