多样化字典学习
阅读原文· arxiv.org针对从观测数据恢复潜在变量的不适定问题,研究者提出多样化字典学习框架。该框架证明,即使在没有线性假设或辅助监督的一般场景下,潜在变量的交集、补集、对称差及依赖结构仍可被识别。通过集合代数组合,可构建隐藏世界的结构化视图。当数据具有足够结构多样性时,所有潜在变量均可被完全识别。该方法仅需简单的归纳偏差即可集成到现有模型,并在合成与真实数据上验证有效。
Given only observational data X = g(Z), where both the latent variables Z and the generating process g are unknown, recovering Z is ill-posed without additional assumptions. Existing methods often assume linearity or rely on auxiliary supervision and functional constraints. However, such assumptions are rarely verifiable in practice, and most theoretical guarantees break down under even mild violations, leaving uncertainty about how to reliably understand the hidden world. To make identifiability actionable in the real-world scenarios, we take a complementary view: in the general settings where full identifiability is unattainable, what can still be recovered with guarantees, and what biases could be universally adopted? We introduce the problem of diverse dictionary learning to formalize this view. Specifically, we show that intersections, complements, and symmetric differences of latent variables linked to arbitrary observations, along with the latent-to-observed dependency structure, are still identifiable up to appropriate indeterminacies even without strong assumptions. These set-theoretic results can be composed using set algebra to construct structured and essential views of the hidden world, such as genus-differentia definitions. When sufficient structural diversity is present, they further imply full identifiability of all latent variables. Notably, all identifiability benefits follow from a simple inductive bias during estimation that can be readily integrated into most models. We validate the theory and demonstrate the benefits of the bias on both synthetic and real-world data.