# 迈向评测工程：ML评测框架的野外实证研究

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-22 08:00
- AIHOT 分数：58
- AIHOT 链接：https://aihot.virxact.com/items/cmpn51h3y0uirsl011twele8r
- 原文链接：https://arxiv.org/abs/2605.24213

## AI 摘要

研究者对57个机器学习评测框架进行了实证分析，提出一个五阶段框架模型，并分类了16,560个操作问题。研究发现，大多数挑战集中在规范阶段，占问题的41.4%。未实现的功能、文档缺失和输入验证不足这三类根本原因，合计占已分类问题的61.7%。不同工作流阶段的根本原因各异：环境不兼容和外部依赖失效占配置阶段问题的36.2%；算法错误与验证缺失则是评估阶段的主导原因。

## 正文

Evaluation harnesses are software systems that orchestrate model evaluation by managing model invocation, data loading, metric computation, and result reporting. Despite their critical role in machine learning infrastructure, their operational challenges and engineering concerns have received limited attention so far. We present an empirical study of 57 evaluation harnesses, deriving a five-stage harness model and classifying 16,560 issues by workflow stage and root cause. Most harness operational challenges concentrate in the Specification stage (41.4% of issues), where harnesses integrate external models, datasets, and scoring judges. The three most frequent root causes of operational challenges are unimplemented features (24.3%), documentation gaps (20.3%), and missing input validation (17.2%), which together account for 61.7% of classified issues, spanning both defects in existing functionality and capability gaps that block intended workflows. Root causes also vary by workflow stage: environment incompatibility and external dependency breakage account for 36.2% of provisioning issues, whereas algorithmic error (25.9%) and validation gap (22.5%) dominate assessment issues. Together, these contributions establish an empirical foundation for treating evaluation engineering as a distinct software engineering concern.