# RAMP：生产系统中智能体模型的运行时评估基础设施

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmpz4oprn008sslkpuh19o2w5
- 原文链接：https://arxiv.org/abs/2605.27492

## AI 摘要

RAMP是一个基于YatCC平台的生产级运行时评估基础设施，用于评估长时程软件工程智能体。它通过标准化接口提供统一评估架构，引入含串行依赖和复杂工具链交互的编译器构造工作负载，结合分阶段恢复机制分析局部失败下的执行行为，并采用面向效用的多维度指标联合评估结果质量和过程效率。对15个主流模型的评估显示，传统静态基准无法发现的能力退化：串行工作流中任务完成率从初始阶段100%下降至最终阶段20%，且无一模型完成整个流水线；计算成本在同类模型间差异高达三个数量级。RAMP推动评估向持续、运行时可观测、生产导向发展。

## 正文

LLM agents are rapidly evolving from coding assistants into autonomous software engineering systems. However, existing evaluation methodologies remain largely centered on static, isolated, and short-horizon benchmarks that fail to capture the dynamic complexity of real-world production workflows. As a result, benchmark performance may poorly reflect practical capability under realistic runtime environments involving long execution chains, tool interactions, dependency management, and iterative feedback loops. We thus present RAMP, a production-grounded infrastructure for assessing long-horizon software engineering agents. Built upon the YatCC integrated platform, RAMP provides a unified runtime assessment architecture through standardized orchestration and execution interfaces. RAMP introduces realistic compiler-construction workloads with serial dependencies and complex toolchain interactions, together with a staged recovery mechanism for analyzing execution behavior under partial workflow failure. The framework further incorporates utility-oriented multi-dimensional metrics that jointly evaluate outcome quality and process efficiency. We conduct runtime assessments across 15 mainstream models and observe substantial capability degradation that remains largely invisible to conventional isolated benchmarks. Task completion rates progressively collapse across serial workflows, dropping from 100% in the initial stage to only 20% in the final stage, while none of the evaluated models successfully completes the entire pipeline. Runtime analysis reveals systematic failure propagation and significant resource inefficiencies, with computational costs differing by up to three orders of magnitude among comparable models. These findings suggest RAMP advances agentic model evaluation toward continuous, runtime-observable, and production-grounded assessment.
