何时你的大语言模型可被引导?--激活引导效果预测研究
阅读原文· arxiv.org激活引导是一种推理时轻量控制大语言模型行为的方法,但成功率高度依赖提示词、概念、模型和引导配置。为预测引导效果,研究者构建了含140万次引导生成、覆盖150个概念的ASTEER测试集,并提取跨层与初始解码步的隐藏状态特征。基于梯度提升决策树(GBDT)分类器,该模型可在未完成全部自回归生成时判断引导是否欠调、成功或过调,在未见概念上达到约0.7 macro-F1分数。进一步利用该预测器指导引导强度搜索,仅需少量解码成本即可接近最优效果。
Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.