FashionLens: 基于任务自适应学习的通用时尚图像检索框架
阅读原文· arxiv.org为了解决现有时尚图像检索方法难以支持多样化查询与意图的问题,研究提出了统一框架FashionLens。首先构建了综合性基准数据集U-FIRE,整合并增强了现有数据以支持跨场景评估与泛化测试。在此基础上,基于多模态大语言模型,提出了两大核心模块:一是通过自适应球面插值将查询动态映射到任务对齐空间的查询校准器;二是根据学习难度与数据规模自动调整任务权重的自适应采样策略。实验表明,该方法在U-FIRE上取得了最先进性能,并能稳健泛化至未见任务。相关代码与数据已开源。
Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this work, we aim to develop a unified framework capable of handling diverse realistic fashion retrieval scenarios, achieving truly versatile fashion image retrieval. To establish a data foundation, we first introduce U-FIRE, a comprehensive benchmark that consolidates fragmented fashion datasets into a unified collection, supplemented by two manually curated datasets for testing generalization. Building upon this, we propose FashionLens, a unified framework based on Multimodal Large Language Models. To handle divergent matching objectives, we design a Proposal-Guided Spherical Query Calibrator that dynamically shifts query representations into task-aligned metric spaces via adaptive spherical linear interpolation. Additionally, to mitigate the optimization imbalance caused by varying task complexities and data scales, we develop a Gradient-Guided Adaptive Sampling strategy that automatically re-weights tasks based on realtime learning difficulty and the data scale prior. Experiments on U-FIRE show that FashionLens achieves state-of-the-art performance across diverse retrieval scenarios and generalizes robustly to unseen tasks. The data and code are publicly released at https://github.com/haokunwen/FashionLens.