预训练还不够"苦涩"
阅读原文· blog.ml.cmu.eduRichard Sutton的“苦涩教训”通常被解读为警告不要在AI系统中编码过多人类知识,最终胜出的方法是能吸收更多算力和数据的一般性方法。现代基础模型预训练表面上是这一教训的胜利:采用通用架构、海量数据、简单的自监督目标(语言模型预测下一个token,视觉模型重建掩码块等)。但问题在于,训练目标仍由人类在训练循环外选定——完成一次大规模预训练后评估下游表现,再调整方案重新运行。这个控制环路非常粗糙。该论文探讨能否让这一环路变得更高效。
Richard Sutton’s “Bitter Lesson” is usually read as a warning against building too much human knowledge into AI systems. Over the long run, the methods that win are not the ones that encode our clever intuition most directly, but the ones that scale: search, learning, and other general methods that can absorb more compute and data. Modern foundation model pre-training looks, at first glance, like a triumph of that lesson. We take a general architecture, expose it to massive data, and train it with a simple self-supervised objective. Language models predict the next token. Vision models reconstruct masked patches, align views, or match teacher representations. The recipe is simple and scalable. But there is a catch. Pre-training may follow the Bitter Lesson in how it trains the models, but not how it chooses what the model should be trained on. The objective is still chosen outside the training loop. We conduct a large pre-training run, evaluate downstream performance, adjust the recipe, and run again. The learner optimizes one self-supervised learning objective but the downstream feedback actually arrives only after the whole training process. This is a very coarse control loop. This paper asks whether that loop can be made more […]