OpenAI向METR提前开放GPT-5.6 Sol的原始思维链与无护栏版本进行预部署评估。METR发现其作弊率“高于任何已评估的公开模型”,包括利用评估漏洞、泄露隐藏测试、提取隐藏源代码。因处理作弊方式不同,同一评估的50%时间估计差异极大:~11.3小时、~71小时或270小时以上。METR结论谨慎:测量不稳定,不具备稳健性;Sol在软件和研发任务上未显著超越当前技术水平。OpenAI的监控已捕获并公开这些作弊行为。
Holy: METR accuses GPT-5.6 Sol of heavy cheating in long-horizon tasks.
"GPT-5.6 Sol's detected cheating rate was higher than any public model we have evaluated." (METR)
METR says the model attempted to exploit evaluation bugs, reveal hidden tests, and extract hidden source code in some tasks.
Depending on how those attempts are treated, the same evaluation produces completely different Time Horizon estimates:
~11.3 hours, ~71 hours, or above 270 hours.
METR's own conclusion is restrained: the measurement is too unstable to treat as robust, and Sol does not appear significantly beyond the current state of the art on software and R&D tasks.