# 在采样中迷失：通过词汇覆盖度分数（WCS）评估大语言模型的词汇可达性

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-05-26 08:00
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmpp43j5a0bo2slv48uoh11wp
- 原文链接：https://arxiv.org/abs/2605.27268

## AI 摘要

研究指出，工业标准的采样默认参数（如 Top-p、Top-k 和 Min-p）无意中充当了审查机制，过滤掉了许多低频但高信息量的人类词汇，导致大语言模型生成的文本趋于同质化。研究团队提出了词汇覆盖度分数（WCS）来量化这一现象，它衡量了上下文语境中被标准采样过滤器修剪掉的人类词汇比例。通过审计开放权重模型，该研究识别出被解码器排除在可达范围之外的逻辑词汇选项，为在文本连贯性与词汇丰富度之间寻找平衡提供了诊断框架。

## 正文

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-p, Top-k, and Min-p). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.
