# VisualThink-VLA： 用于高效低延迟视觉-语言-动作策略的视觉中间推理框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-28 08:00
- AIHOT 分数：55
- AIHOT 链接：https://aihot.virxact.com/items/cmpv97kha04rmsl0z8y04wc54
- 原文链接：https://arxiv.org/abs/2605.30011

## AI 摘要

本文提出 VisualThink-VLA，一个用于 VLA 策略的视觉中间推理框架，旨在解决文本思维链在具身控制中因信息干扰和解码延迟高导致的实时执行难题。该框架通过一个紧凑的视觉证据接口引导动作预测，在保留空间精度的同时避免了解码开销。其采用选择性路由机制学习视觉证据 token，以实现低延迟推理。研究引入了 VisualEvidence-Kit，其中包含一个构建了 754.7k VLA 指令集的视觉证据智能体。在多项基准和真实机器人评估中，该框架在大多数任务上成功率最高，并将推理增强基线的多秒级延迟降至亚秒级。例如，在 BridgeData V2 上，其将步骤延迟从 ECoT 的 8.377 秒降至 0.367 秒，实现了 22.8 倍的加速。

## 正文

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.
