AI 摘要
如果你的基准测试依赖于静态数据集或从训练时已知的静态分布中采样,那么它本质上衡量的是记忆/检索。如果你需要的是检索基准测试,那倒也无妨,但不要将其与智能混淆。
If your benchmark relies on a static dataset or sampling from a static distribution densely known at training time, then it is fundamentally measuring memorization/retrieval. Which might be fine if you're looking for a retrieval benchmark! But don't confuse it with intelligence.