# 面向任务的多模态智能体记忆策略学习

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmpunpmki05jnslag8exvrrqd
- 原文链接：https://arxiv.org/abs/2605.31075

## AI 摘要

多模态智能体的长期记忆面临“该记什么”的核心挑战。为此，研究者提出了TaskMem框架，这是一种基于强化学习的记忆策略学习方法，采用两阶段训练范式：第一阶段学习如何记忆以保证记忆质量，第二阶段在部署后根据具体任务学习记什么。该方法基于Qwen3-VL-30B-A3B构建，并在将VideoMME、EgoLife和EgoTempo重构为流式基准测试后，分别将VQA准确率提升了6.3%、7.0%和5.3%。

## 正文

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.
