# OmniInteract：实时全模态助手的真实场景流式交互基准测试

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：56
- AIHOT 链接：https://aihot.virxact.com/items/cmpqd5x5k03xzslnozd4smojb
- 原文链接：https://arxiv.org/abs/2605.26485

## AI 摘要

OmniInteract是一个用于评估实时全模态大语言模型的流式交互基准。它包含250个视频，定义了1,430个需要模型在线响应的时段：1,062个涵盖实时、主动与嵌套场景的单问单答时段，以及368个多问多答时段。模型必须处理原始音视频流，且无法预知未来内容。评估使用交互感知质量-时效性F1分数等指标。实验表明，当前模型在流式交互上表现薄弱，最优的整体IA-QTF1分数仅为0.368。

## 正文

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.