# MementoGUI：面向长时程GUI代理的学习型多模态记忆控制

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-18 08:00
- AIHOT 分数：60
- AIHOT 链接：https://aihot.virxact.com/items/cmpcl0r060383slaemh1rw8sz
- 原文链接：https://arxiv.org/abs/2605.18652

## AI 摘要

针对现有GUI代理在长期任务中因记忆机制不足而表现脆弱的问题，本文提出了MementoGUI框架。它是一个插件式智能体记忆框架，为基于MLLM的GUI代理配备了学习型控制器MementoCore，无需微调主干模型即可在线进行记忆选择、压缩与检索。该框架将长期交互建模为在线记忆控制问题，通过工作记忆保存文本摘要与视觉证据，并通过情节记忆检索可复用的历史轨迹。MementoCore将记忆控制模块化为四个专用算子，并开发了相应的数据构建流程与评估基准。实验表明，该框架能稳定提升代理在多个基准上的性能。

## 正文

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce MementoGUI, a plug-in agentic memory framework that equips MLLM-based GUI agents with MementoCore, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level visual evidence, while episodic memory retrieves reusable past trajectories through learned relevance selection. MementoCore modularizes memory control into specialized operators for step processing, memory compression, episodic writing, and episodic selection, enabling plug-in memory augmentation without finetuning the GUI agent backbone. We further develop a scalable data curation pipeline that converts computer-use trajectories into memory-controller training data, introduce MementoGUI-Bench for evaluating long-horizon decision-making in GUI agents, and design MLLM-based metrics for semantic action matching, task progress, and memory consistency. Experiments on GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench show that MementoGUI consistently improves GUI agents over no-history, history-replay, and text-only memory baselines, with larger MementoCore backbones further strengthening memory-augmented GUI control.
