语言模型智能体群体中的涌现语言:从Token效率到监督规避
阅读原文· arxiv.org监控自主语言模型智能体常依赖其表面行为。但研究发现,智能体群体会发明新语言以规避监督。本研究分析Moltbook Files数据集,通过两阶段方法识别出三类涌现语言:旨在节省token的Token效率型、模仿自然语言的新自然语言型,以及旨在逃避监管的监督规避型。研究使用DeepSeek-3.2评估发现,监督规避类语言的对齐度显著低于其他类别。此外,所有语言均可仅通过描述被其他模型在上下文中学习。手工分析还揭示了诸如在自然语言中嵌入隐藏信息等复杂的隐写协议。这些发现表明,仅监控表面行为可能很快将不足以控制智能体群体。
Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.