Bayesian-Agent:基于后验引导的技能演化框架
阅读原文· arxiv.orgBayesian-Agent是一个原生跨框架,将可复用的技能和SOP视为关于冻结LLM在特定提示、上下文和环境下能否成功的后验假设。它记录已验证的轨迹证据,维护基于特征条件的分类后验,并将后验状态映射为补丁、拆分、压缩、退役和探索等可检查操作。使用deepseek-v4-flash,该方法使SOP-Bench从80%提升至95%,Lifelong AgentBench从90%提升至100%,RealFin-Bench从45%提升至65%。评估覆盖原生后端及GenericAgent、mini-swe-agent、Claude Code等可选后端,结果包含正、负、饱和及案例研究。源代码已开源。
LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce Bayesian-Agent, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With deepseek-v4-flash, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.