PEEK:通过高效知识蒸馏选择关键帧
阅读原文· arxiv.org视频语言模型处理帧数有限,帧选择是视频描述的效率瓶颈。现有自适应方法计算成本高。本文提出PEEK,一种高效的动态帧采样方法,通过知识蒸馏将依赖描述信息的帧排序能力从教师模型压缩到仅依赖视觉内容的轻量级时序模型中。实验表明,在ActivityNet Captions和MSR-VTT数据集上,PEEK在所有测试的视觉语言模型上均优于现有方法,尤其在仅选1-2帧时表现最佳。在ActivityNet Captions的16种配置中,PEEK在14种中胜出。该方法仅增加5.2%的描述生成时间,远低于CSTA(65.4%)和MaxInfo(211.9%)。
Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only 5.2% to the captioning time, compared with 65.4% for CSTA and 211.9% for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.