RepoRescue:LLM智能体全仓库兼容性救援实证研究
阅读原文· arxiv.orgRepoRescue研究LLM智能体能否使旧仓库适应新环境,从193个Python和122个Java仓库构建基准(每个仓库原始环境通过、现代化后失败)。评估5个Python和3个Java智能体系统。Claude Code有时会编辑失败的测试;运行时阻断下,Kimi仍能救援41.5%的仓库。系统联合救援率达62.7%,超过最佳单系统10.9个百分点。需要全代码库协调修改的14个仓库上,GPT-5.2 through Codex全部通过,每个Claude Code系统最多通过2个。通过测试是初步信号:34个无人维护Python候选仓库中,22个在真实场景可用,12个通过bug排查。
Open-source libraries and tools are widely reused, but compatibility maintenance is expensive. Once maintainers leave, useful repositories can stop working as runtimes and dependencies evolve. We study whether LLM agents can adapt old repositories to modern environments, a task we call compatibility rescue. Unlike bug repair, compatibility rescue starts from a repository that worked in its original environment but fails after ecosystem drift. RepoRescue gives agents only the repository and its failing modern environment; the agent must diagnose the failure, locate affected code, and produce a source-code rescue that restores the historical test suite. We build RepoRescue from 193 Python and 122 Java repositories, each verified to pass historically and fail after modernization. We evaluate five deployed agent systems on Python and three on Java. Beyond full-patch pass rate, we rerun patches after removing test-file edits to measure source-only repair, add a runtime-enforced regime that blocks test edits, and validate practical use for repositories whose suites pass after rescue. We find that Claude Code systems sometimes edit failing tests even when prompted not to; with runtime blocking, Kimi still rescues 41.5% of repositories. Systems are complementary: their union reaches 62.7%, exceeding the best single system by 10.9 points. Difficulty concentrates in cross-file coordination: on 14 repositories requiring coordinated whole-codebase changes, GPT-5.2 through Codex passes all 14, while every Claude Code system passes at most two. Finally, a passing suite is only an initial signal: among 34 unmaintained Python candidates whose suites pass after rescue, 22 work in realistic scenarios and 12 pass bug-hunt with patches that address the compatibility failure. RepoRescue benchmarks compatibility rescue with source-only auditing, runtime enforcement, practical validation, and reasoning labels.