UniSteer:基于文本引导的激活空间流匹配模型,用于通用大语言模型行为引导
阅读原文· arxiv.orgUniSteer是一种文本引导的激活空间流匹配模型,旨在统一控制冻结的大语言模型推理时的内部行为。该模型不依赖固定方向,而是基于自然语言条件学习残差流激活分布的通用条件速度场。推理时,它通过流反转将部分源激活输运至潜状态,并在目标文本条件下重新生成后注入模型。该统一模型同样支持通过选择重建能量最低的文本标签进行激活空间分类。实验表明,UniSteer在行为控制、真实性引导、细粒度概念引导、多约束指令遵循及激活空间分类等任务上提供了统一的接口。
Activation-based control steers large language models (LLMs) by intervening on their internal representations during inference, and has emerged as an effective paradigm for controlling behaviors such as persona and style. However, existing methods often rely on fixed steering directions or task-specific intervention modules, making them difficult to adapt to fine-grained concepts and compositional constraints. We propose UniSteer, a text-guided activation flow matching model that learns a conditional distribution over residual-stream activations from natural-language conditions. Instead of fitting a separate intervention for each target behavior, UniSteer learns a universal conditional velocity field in activation space. At inference time, UniSteer performs flow inversion by partially transporting a source activation toward a latent state and regenerating it under a target textual condition before injecting it back into the frozen LLM. The same conditional model supports activation-space classification by selecting the textual label with the lowest reconstruction energy. Experiments on three target LLMs show that UniSteer provides a unified interface across behavioral control, truthfulness steering, fine-grained concept steering, multi-constraint instruction following, and activation-space classification.