# Stateful Visual Encoder：为视觉-语言模型引入有状态视觉编码器

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-03 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmpytyvum03ausli3fetv8hy2
- 原文链接：https://arxiv.org/abs/2606.04433

## AI 摘要

现有开放权重视觉-语言模型（VLM）在多图像、多轮智能体场景中，视觉编码器是无状态的，每张图像独立编码，无法访问先前视觉上下文，导致任务关键的小变化被弱化。本文提出 Stateful Visual Encoder，将每个视觉表示基于先前视觉特征进行条件化。通过监督微调，配备该编码器的 VLM 在跨图像空间聚合、多对象视觉差异和轨迹行为克隆等任务上取得一致改进，且适用于不同分辨率、语言模型规模和 VLM 主干。在纵向放射学、精细图像比较和遥感等真实任务中，有状态编码器持续提升通用 VLM 基线，在特定领域匹配或超越专用模型。

## 正文

Vision-language models (VLMs) are increasingly used in multi-image, multi-turn agentic settings where decisions depend on visual changes. However, in existing open-weight VLMs, visual comparisons happen only inside the language model, while the visual encoder itself remains stateless: each image is encoded independently, without access to the prior visual context. As a result, small but task-critical changes may be attenuated before the language model has a chance to compare them, especially when those changes do not affect the high-level semantics of the scene. We introduce a Stateful Visual Encoder, which conditions each visual representation on prior visual features. Under supervised finetuning, VLMs equipped with stateful encoders achieve consistent improvements on controlled tasks involving cross-image spatial aggregation, multi-object visual differencing, and visual trajectory behavior cloning. These improvements are consistent across input resolutions, language model sizes, and VLM backbones. Finally, we validate our model on real-world tasks, including longitudinal radiology, fine-grained image comparison, and remote sensing, where stateful encoders consistently improve generalist VLM baselines and can match or surpass specialized models in selected domains. Project page: https://statefulvisualencoders.github.io/
