# OmniCap-IF：全能视频描述的指令遵循基准与模型改进

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-07 08:00
- AIHOT 分数：61
- AIHOT 链接：https://aihot.virxact.com/items/cmq6ixr0408utsl5i1vf1w70n
- 原文链接：https://arxiv.org/abs/2606.08572

## AI 摘要

OmniCap-IF 是首个针对全能模态大语言模型（OLLMs）视频描述指令遵循能力的基准，涵盖纯视觉、纯音频和视听三种模态下的 50 种约束类型，并引入时间定位评估时空精确性。在 1920 个高质量样本上的评测显示模型间存在显著性能差距，并发现“格式-内容权衡”——格式复杂度增加会损害模型的全能模态推理能力。研究团队还构建了 54K 指令微调数据集 OmniCap-IF-54K，并发布 OmniCaptioner-IF 模型，在复杂指令遵循与通用全能模态描述性能上均取得明显提升。

## 正文

While Omni-modal Large Language Models (OLLMs) have demonstrated impressive capabilities in jointly processing audio and visual streams, their ability to strictly adhere to complex, multi-faceted user instructions remains largely unexplored. Existing benchmarks primarily focus on holistic video understanding or text-only instruction following, failing to capture the intricate interplay between modalities and user constraints. To bridge this gap, we introduce OmniCap-IF, the first comprehensive benchmark specifically designed to evaluate instruction-following capabilities in omni-modal captioning. OmniCap-IF incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our benchmark encompasses 50 distinct constraint types across pure visual, pure audio, and audio-visual modalities, while integrating Temporal Grounding to assess spatio-temporal precision. Extensive evaluations of prominent models on 1,920 high-quality samples reveal significant performance disparities. Furthermore, our analysis uncovers a critical "format-content tradeoff", demonstrating that increasing formatting complexity directly degrades models' omni-modal reasoning abilities. Finally, to advance the field, we curate a 54K instruction-tuning dataset, OmniCap-IF-54K and present OmniCaptioner-IF, which achieves notable improvements in both complex instruction adherence and general omni-modal captioning performance.
