FAPO:多步LLM管道的全自主提示优化框架
阅读原文· arxiv.orgFAPO是一个让Claude Code在标准化代码库内自动优化多步LLM管道的框架。它评估管道、检查中间步骤、诊断失败、提出范围性更改并反复验证,优先尝试提示编辑,仅当提示优化不足且归因识别出结构瓶颈时才调整链结构。在6个基准和3个任务模型上,FAPO在18个模型-基准比较中15次击败基线GEPA,平均增益+14.1pp;其中11次比较中均值±标准差范围不重叠。在HoVer和IFBench上,提示优先搜索升级为结构变化的6次比较中FAPO全胜,平均增益+33.8pp。安全任务上,仅提示版FAPO在CTIBench-RCM上将GPT-5测试准确率提升+4.0pp,Foundation-Sec-8B-Instruct提升+7.1pp,Foundation-Sec-8B-Reasoning提升+2.0pp。
Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.