AI 摘要
新基准 MirrorCode 显示,Claude Opus 4.6 能重构 16,000 行生物信息学工具包,任务量相当于人类工程师数周工作。与 METR_Evals 合作开发。
What are the largest software engineering tasks AI can perform?
In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit - a task we believe would take a human engineer weeks.
Co-developed with @METR_Evals. Details in thread.