# EEVEE：面向真实世界的测试时提示学习框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-10 01:57
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmq7jdn1l03ocsl5wkcs7363a
- 原文链接：https://arxiv.org/abs/2606.11182

## AI 摘要

EEVEE是首个面向LLM智能体的多数据集测试时提示学习框架，用于在真实任务流下自改进。为解决跨数据集干扰，它引入路由器将异构输入流划分到任务簇并分配适配提示配置，并通过路由器‑提示协同进化策略（交替执行路由器和提示学习阶段）优化二者依赖。实验表明，EEVEE在保持单基准学习能力与效率的同时，提升异构数据流鲁棒性：平均多基准得分比Qwen3-4B-Instruct高10.38分，比DeepSeek-V3.2高24.32分，超越SOTA方法GEPA和ACE最高达37.2%和48.2%。

## 正文

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.