# OR-Space：面向工业优化智能体的全生命周期工作区基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-27 08:00
- AIHOT 分数：54
- AIHOT 链接：https://aihot.virxact.com/items/cmpq8vkk102teslnobxgyfj0x
- 原文链接：https://arxiv.org/abs/2605.28158

## AI 摘要

OR-Space 是一个为工业优化智能体设计的全生命周期工作区基准测试，旨在评估其在持久化多制品工作区和多阶段任务下的可靠优化能力。基准测试定义了三种任务模式：从异构资产构建求解模型（Build）、根据需求修改现有模型（Revise）、以及基于工作区证据回答关于方案的问题（Explain）。它通过结合持久化工作区和面向生命周期的任务，评估智能体是否能在超越端到端文本生成之外执行可靠的优化工作。

## 正文

Large language model (LLM) agents are increasingly used to assist with operations research (OR) modeling, yet existing OR-oriented benchmarks often reduce evaluation to one-shot translation from a self-contained problem statement into a mathematical formulation or solver program. Such settings abstract away two characteristics of real industrial OR workflows: persistent multi-artifact workspaces and multi-stage task lifecycles. We introduce OR-Space, a full-lifecycle workspace benchmark for evaluating industrial optimization agents across model construction, model revision, and grounded explanation. Each instance is an executable workspace containing business documents, structured data, optional code artifacts, solver outputs, and task-specific evaluators distributed across interdependent files. OR-Space defines three task modes: Build, where agents construct solver-ready optimization models from heterogeneous artifacts; Revise, where agents modify existing models under changing requirements or solver feedback while preserving valid prior logic; and Explain, where agents answer grounded questions about solutions, constraints, and business implications using evidence spread across workspace artifacts. By combining persistent workspaces with lifecycle-oriented tasks, OR-Space evaluates whether agents can perform reliable optimization work beyond end-to-end text generation. We describe the benchmark design, evaluation protocol, and quality-control pipeline, and position OR-Space as a benchmark for studying the reliability, failure modes, and practical readiness of LLM agents in industrial OR workflows.