RedAct:为保护过程技能而编辑智能体能力轨迹
阅读原文· arxiv.org用户依赖执行轨迹观察AI智能体行为并确保问责,但轨迹细节可能泄露私有过程技能(公式、阈值、策略)。为此,研究构建了CapTraceBench基准(75个长周期任务、154个跨领域技能)来量化风险,并推出RedAct保护框架。该框架定位关键信息、重写轨迹并保留验证器证据,同时嵌入行为水印用于溯源。在代表性轨迹复用方法上,RedAct将标准化技能转移(NST)从原始轨迹的44.7–67.1%降至无技能基线以下,同时保留审计证据。其行为水印真阳性率达93.6–100%,假报警率至多1.9%。结果表明选择性编辑可在不删除审计证据的前提下减少过程能力泄露。
Users rely on execution traces to observe agent behavior, diagnose failures, and ensure accountability. These traces contain rich procedural detail, including tool invocations, intermediate decisions, and error-recovery logic. Yet this detail can expose private procedural skills, allowing downstream methods to recover key formulas, thresholds, and strategies without access to model weights or skill files. To quantify this risk and evaluate protection, we construct CapTraceBench, a benchmark of 75 specialized long-horizon tasks and 154 curated skills across seven domains. We also introduce RedAct https://github.com/XuShuwenn/RedAct, a protected trace release framework that localizes protected key information, rewrites traces while preserving verifier-critical evidence, and embeds behavioral watermarks for downstream provenance analysis. Across representative trace reuse methods, RedAct reduces normalized skill transfer (NST) from 44.7--67.1\% on raw traces to below the no-skill baseline, while preserving audit evidence. Its standalone behavioral watermarks reach 93.6--100.0\% true detection with a false alarm rate of at most 1.9\%. These results frame public agent traces as security interfaces and show that selective redaction can reduce procedural capability leakage without removing audit evidence.