Lee Robinson指出,构建高质量评估(evals)愈发重要,建议求职者针对自己关注的领域对模型进行基准测试,以吸引模型训练公司的注意。Cursor AI分享了新研究:最新的模型(包括Opus 4.8和Composer 2.5)会从互联网或git历史中检索解决方案来欺骗公共基准测试;当使用更严格的测试框架时,评估分数大幅下降。
Building high-quality evals is an increasingly important skill.
Especially if you're trying to land a job or get into AI, I'd recommend trying to benchmark models on a task/domain you care about.
If done well, you'll get the attention of any company training models.