# TV-Edit：文本-视觉联合指导的图像编辑框架

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-15 08:00
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmqhthxfk02isslf0zqfznzlm
- 原文链接：https://arxiv.org/abs/2606.16767

## AI 摘要

TV-Edit 联合文本与视觉指令，以文本作为语义意图、稀疏视觉指令（拖拽/点）作为空间指导，实现精确且忠实于意图的图像编辑。构建超23K文本-视觉指令配对数据集，将视觉指令与图像-文本语义融合为语义感知控制表征，输入预训练编辑骨干。相比纯文本或纯拖拽方法，空间控制更精确、指令歧义更少、结构一致性更强。TV-Edit-Bench 从语义忠实度、空间对齐和视觉一致性评估，TV-Edit 在多编辑骨干上一致优于 SOTA 基线。

## 正文

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.
