OMTG（一对多时间定位）系统性方案发布

2026-06-04 08:00·29天前

AI 摘要

一对多时间定位（OMTG）旨在定位文本查询对应的多个不连续视频片段。现有SOTA多模态大模型（MLLMs）在此任务上几乎得零分，缺乏事件基数感知。为此，研究者建立了首个综合OMTG基准，引入计数准确率（C-Acc）和有效时间F1（EtF1）作为评测指标；构建包含56k样本的高质量OMTG数据集；开发了针对OMTG的时间奖励和描述奖励函数，其中描述奖励利用密集视频描述的链式推理指导策略优化。该方法在OMTG Bench上达到43.65%的EtF1，分别超过Gemini 2.5 Pro和Seed-1.8达15.85%和15.61%。

原文 · 未翻译

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

HuggingFace Daily Papers（社区热门论文）

53导出 Markdown

OMTG（一对多时间定位）系统性方案发布

2026-06-04 08:00·29天前

阅读原文· arxiv.org

AI 摘要

原文 · 保持原样，未翻译