# PerceptionDLM：基于多模态扩散语言模型的并行区域感知

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-17 08:00
- AIHOT 分数：49
- AIHOT 链接：https://aihot.virxact.com/items/cmqomzs1j02khslx657qtoi0i
- 原文链接：https://arxiv.org/abs/2606.19534

## AI 摘要

针对现有多模态大语言模型自回归生成导致多区域感知效率低下的问题，提出PerceptionDLM多模态扩散语言模型。该架构利用扩散语言模型的并行解码特性，通过高效提示和结构化注意力掩码，在序列和token两个层次上同时感知多个掩码区域，显著提升推理效率。为系统评估扩散语言模型的并行性，构建了ParaDLC-Bench基准。实验表明，PerceptionDLM在保持区域描述竞争力的同时，大幅提升了多区域感知任务的速度。这是首次利用扩散语言模型实现并行区域标注和感知。

## 正文

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks. However, most existing MLLMs rely on autoregressive generation, which limits their efficiency for perception tasks that require captioning multiple regions. In this work, we propose PerceptionDLM, a multimodal diffusion language model optimized for efficient parallel region perception. Built upon PerceptionDLM-Base, a strong foundational baseline that achieves state-of-the-art performance among open-source diffusion MLLMs, our architecture fully leverages the parallel decoding nature of DLMs. Specifically, we introduce efficient prompting and structured attention masking to enable simultaneous perception of multiple masked regions, allowing the model to generate region descriptions in parallel at both the sequence and token levels. This design significantly improves inference efficiency compared with existing approaches that process regions sequentially. To systematically evaluate the parallelism property of visual perception capability for DLMs, we construct a new Parallel Detailed Localized Captioning Benchmark (ParaDLC-Bench) by scaling the DLC-Bench to include multiple region masks per image, enabling joint evaluation of both caption quality and inference efficiency. Experiments demonstrate that PerceptionDLM maintains competitive performance in region captioning while achieving substantial speed improvements for multi-region perception tasks. Our results highlight the potential of multimodal diffusion language models for efficient, parallel visual perception. To the best of our knowledge, we are the first to achieve parallel region caption and perception by leveraging the advantages of diffusion language models. Code, models, and datasets are released.
