# DOMINO：通过最小充分表示学习实现大语言模型领域数据合成

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-29 08:00
- AIHOT 分数：52
- AIHOT 链接：https://aihot.virxact.com/items/cmpy8j6nn02r4slaxq2hx9xvh
- 原文链接：https://arxiv.org/abs/2605.30039

## AI 摘要

针对LLM微调中高质量数据获取难的问题，现有合成方法依赖自然语言描述，不适用于难以表述的领域。本文提出DOMINO，仅以参考样例定义目标领域，通过学习最小充分表示引导生成域对齐数据。DOMINO结合提示调优与对比解缠目标分离域模式与样本噪声。在隐式领域定义的编码基准上，基于DOMINO合成数据微调相比强指令调优基线，Pass@1准确率最高提升4.63%，实现无需手动提示或自然语言规范的自动化域适配。

## 正文

Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.
