OpenAI 推出研究级基准 GeneBench-Pro,用于测试 AI 智能体在真实计算生物学中处理复杂、需要高度判断的分析能力。每个问题需要人类专家约 20-40 小时完成。Greg Brockman 表示,GPT-5.6 Sol 在该基准上实现了重大进步。
Introducing GeneBench-Pro - testing whether models can handle the kind of judgment-heavy analysis that real-world computational biology requires.
Problems would take a human expert around 20-40 hours to complete.
GPT-5.6 Sol is a big step forward.