GeneralVLA-2:几何感知重建与受控记忆用于机器人规划
阅读原文· arxiv.orgGeneralVLA-2针对通用视觉-语言-动作系统的两个瓶颈提出改进:引入GeoFuse-MV3D几何先验引导的多视图重建分支,通过输入视图掩码验证外部几何线索、软视觉外壳支持及轴对齐精炼,仅融合几何信息并保留外观,缓解单目SAM3D式重建的姿态与不可见几何幻觉;将原有KnowledgeBank升级为受控长期记忆系统,显式管理质量、置信度、生命周期、验证器与冲突元数据,并配合面向精度的检索。在GSO-30上,GeoFuse-MV3D相比MV-SAM3D基线将CD降低2.20%、LPIPS降低2.02%,PSNR提升2.36%、SSIM提升1.03%。在Terminal-Bench 2.0与SWE-Bench Verified上,KnowledgeBank相比ReasoningBank在Terminal-Bench SR上提升4.53%,SWE-Bench resolve rate提升3.73%,AS分别降低4.95%和5.65%。
Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.