EventVLA: 面向长时程视觉-语言-动作策略的事件驱动视觉证据记忆
阅读原文· arxiv.orgEventVLA是一种端到端机器人操作框架,其核心为稀疏视觉证据记忆,包含基础视觉锚点和动态关键帧证据记忆(KEM)模块。KEM从VLA潜在嵌入直接预测未来关键帧概率,自主捕获并存储任务关键的视觉事件,解决了标准VLA模型在长时程操作中因遮挡或不可观测导致的失败。研究还提出诊断基准RoboTwin-MeM。在17个需记忆的仿真任务和4个真实双臂任务上,EventVLA平均成功率比现有最优记忆增强VLA高出40%。
Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.