swyx引用OpenAI研究员Noam Brown的观点,强调任何评估报告都应保持恒定推理预算。由于开源模型每美元可获得的token量远超闭源API,因此发布开源模型时,应按主流推理提供商的美元成本(而非token数量)来报告思考水平。该观点源自@saranormous与Noam Brown的播客,他们讨论了大规模测试时计算的后果——模型被给予1000万美元预算处理单一任务,并探讨了基准测试失效、计算预算扩展、能力随投入增长及安全等问题。
An interesting way to take Noam at his word in regards to always keeping a constant inference budget for any eval reporting -
is that open models have a lot more dollar per token mileage than closed model APIs. So anyone launching an open model today or situationally incentivized toward open models should obviously report thinking levels measured by dollar inference on popular inference providers, instead of by number of tokens on the x axis