# SpotSound：通过细粒度时间定位增强大型音频语言模型

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-04-14 08:00
- AIHOT 链接：https://aihot.virxact.com/items/cmnzt9y9v00owslwz5h3f3bsq
- 原文链接：https://arxiv.org/abs/2604.13023

## AI 摘要

研究团队发布SpotSound音频语言模型，针对长音频中的事件精确定位难题，提出可抑制幻觉时间戳的新型训练目标。同步推出SpotSound-Bench基准测试，目标事件占音频片段比例低于10%，模拟"大海捞针"的严苛真实场景。实验表明，该模型在时间定位基准上取得SOTA结果，同时在通用音频语言任务中保持稳健性能。相关代码、模型及数据集均已开源。

## 正文

Large Audio-Language Models (ALMs) have recently demonstrated remarkable capabilities in holistic audio understanding, yet they remain unreliable for temporal grounding, i.e., the task of pinpointing exactly when an event occurs within long-form audio. This limitation stems from two factors: training data dominated by clip-level supervision lacking precise timestamps, and benchmarks that fail to simulate real-world scenarios where short events are obscured by dense background sounds. In this paper, we introduce SpotSound, an audio language model designed for grounding audio events. SpotSound incorporates a novel training objective, specifically designed to suppress hallucinated timestamps for events absent from the input. Additionally, we present SpotSound-Bench, a challenging temporal grounding benchmark where target events occupy less than ~10\% of each clip, creating a rigorous `needle-in-a-haystack' evaluation. Experiments demonstrate that SpotSound achieves state-of-the-art results on temporal grounding benchmarks while maintaining robust performance across general downstream audio-language tasks. Code, models and benchmark are released on https://loiesun.github.io/spotsound/
