论大语言模型适应性的局限:模型内化先验对标注任务性能的影响
阅读原文· arxiv.org在零样本标注与LLM-as-a-judge任务中,LLM内化先验与用户指令存在交互。针对社交媒体、游戏、新闻和论坛数据集的毒性检测实验发现,近三分之二的零样本错误无法通过提示纠正,总体纠正率仅34.8%,高置信度错误尤为顽固。当给出错误任务定义时,LLM会遵循定义但置信度不变。新提出的定义特定熟悉度(DSF)衡量模型内部概念与任务定义的对齐程度,在控制数据集混淆后与性能呈正相关(partial r=+0.41),而三种记忆指标(ROUGE-L、BERTScore、嵌入向量余弦相似度)均未显示正相关。这表明基于提示的纠正存在根本局限,定义对齐比文本记忆更重要。
Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.