Lens:重新思考基础文本到图像模型的训练效率
阅读原文· arxiv.orgLens是一个3.8B参数的文本到图像模型,其性能可与6B以上参数的模型竞争甚至超越,且仅需约19.3%的训练计算量。高效训练源于两大策略:一是通过GPT-4.1生成的Lens-800M数据集(含约109词的密集描述)最大化每批次数据信息密度;二是采用语义VAE和强语言编码器等架构设计以加速收敛。预训练后,模型通过应用RL训练、推理器模块和知识蒸馏实现了4步推理,并支持1:2到2:1的任意宽高比及最高1440^2分辨率。该模型在单张NVIDIA H100 GPU上生成1024^2图像需3.15秒,其蒸馏版可在0.84秒内完成4步生成。
We introduce Lens, a 3.8B-parameter T2I model that achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6B parameters across various benchmarks, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The training efficiency of Lens stems from two key strategies beyond its compact model size. First, we maximize data information density per training batch by (i) training on Lens-800M, a dataset of 800M densely captioned image-text pairs whose captions are generated by GPT-4.1 and contain approximately 109 words on average, providing richer semantic supervision than conventional short captions, and (ii) constructing each batch from images with multiple resolutions and diverse aspect ratios, thereby enlarging the effective visual coverage of each optimization step. Second, we improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. After pre-training, we apply RL with taxonomy-driven prompts (Lens-RL-8K) and structured reward rubrics to suppress artifacts and improve visual quality, a reasoner module with training-free system prompt search to better align user requests with the model, and distillation-based acceleration for 4-step inference. Through efficient training and systematic optimization, Lens generalizes to arbitrary aspect ratios from 1:2 to 2:1 and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds.