人类心理测量问卷误判LLM行为
阅读原文· arxiv.org一项研究检验了人类心理测量问卷能否可靠描述和预测LLM在日常用户交互中的行为。研究者分析了8个开源大语言模型,对比了Likert自评问卷(PVQ-40/21和BFI-44/10)与基于用户日常查询生成概率得到的价值/人格画像。结果显示两种画像显著不同:问卷条目中的显性词汇线索让模型识别出目标构念并给出符合对齐、社会期望的回答,而真实用户查询无此类线索。此外,人口统计角色提示在问卷中能按人类模式改变模型回答,但在真实用户查询的生成概率中无此变化,表明其模拟目标人群行为的局限性。研究认为人类心理测量问卷不足以预测LLM行为,建议采用基于生成的画像作为更准确的度量。
We examine whether human psychometric questionnaires can serve as reliable tools for characterizing and predicting LLM behavior in everyday user interactions. We analyze eight open-source LLMs by comparing their value and personality profiles derived from two different methods: Likert self-reports on established questionnaires (PVQ-40/21 and BFI-44/10) and generation probabilities over value-laden responses to everyday user queries. The two profiles diverge substantially. Within-construct item consistency, often cited as evidence of stable LLM dispositions, disappears in generation probabilities. We attribute this gap to the fact that explicit lexical cues in established questionnaire items allow models to recognize the target construct and respond in alignment-consistent, socially desirable ways, whereas realistic user queries provide no such cues. In addition, demographic persona prompts shift models' responses to human questionnaires in ways consistent with real human patterns, but no such shifts appear in the generation probabilities of responses to realistic user queries, showing their limited ability to simulate the behaviors of target demographics in real-world user interactions. Overall, our study shows that human psychometric questionnaires are insufficient tools for predicting LLM behavior and suggests generation-based profiling as a more accurate measure.