# 面向长视频理解的线性扩展视频语言模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmpulkg6204v8slag8wzbuwt7
- 原文链接：https://arxiv.org/abs/2605.31598

## AI 摘要

本文提出StateKV，一种推理时方法，使预训练长视频VLM的视频预填充达到线性时间复杂度。其核心是通过固定容量、基于重要性的循环状态携带跨帧上下文，并搭配第二个完整的每帧缓存用于解码。在三个长视频基准和多个模型上的实验表明，StateKV的性能接近全自注意力机制，并持续优于主流的滑动窗口等流式近似方法，且无需微调或架构改变。该方法降低了预填充的FLOPs成本，允许在固定算力下使用更大模型以获得更高准确率，为可扩展的长视频理解提供了实用方案。

## 正文

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.