Stargazer:天体物理约束下AI智能体模型拟合可扩展基准环境
阅读原文· arxiv.org研究团队发布Stargazer基准测试环境,用于评估AI智能体在径向速度时间序列数据上的物理模型拟合能力。该环境包含120个任务(含20个真实档案案例),分三个难度等级,涵盖单行星到复杂多行星系统场景。对8个前沿智能体的测试显示,尽管智能体能实现良好的统计拟合,却频繁无法恢复正确的物理参数,且增加测试时计算仅带来边际收益,过度token使用往往反映递归失败循环而非有效探索。
The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at https://github.com/Gudmorning2025/Stargazer and https://gudmorning2025.github.io/Stargazer, respectively.