Epoch AI 使用其综合指标 Epoch Capabilities Index 测量发现,开源模型与闭源模型的能力差距平均约为三个月。但主推文作者对此表示怀疑,认为开源大语言模型的实际表现(尤其是在分布外任务上)比评测分数所显示的更为脆弱,真实的体感差距可能远不止三四个月。
I think Epoch does a great job benchmarking, but I continue to believe that open weights models are much more fragile, especially out-of-distribution, than their benchmarks indicate. Vibe-wise, I don't think they were only 3 months behind last year or only 4 months behind today.