# 他们能走多远？用大语言模型红队测试在线影响力

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-21 03:25
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmpn2wapf0u0jsl0165tscjj7
- 原文链接：https://arxiv.org/abs/2605.22880

## AI 摘要

该研究聚焦于本地部署的开源大语言模型，提出了一套红队测试框架，用于测量模型在争议性话题上可可靠表达的“政治表达范围”（Overton Windows），并量化简单自然语言越狱技术如何扩大此范围。研究评估了超过30个大语言模型，发现系统性政治表达不对称：开源模型通常更倾向生成左倾社交媒体内容；政治表达范围随模型规模增大而收缩；尽管生态参与不均，地域差异仍然显著。此外，越狱技术的有效性在不同模型家族间差异明显。

## 正文

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.
