# 网络规模LLM预训练语料库叙事特征研究--基于Dolma与NarraBERT

- 来源：HuggingFace Daily Papers（社区热门论文）
- 发布时间：2026-06-17 08:00
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmqpppj7702r2slp5ewxixcok
- 原文链接：https://arxiv.org/abs/2606.19468

## AI 摘要

首个针对网络规模LLM预训练语料库叙事特征的细粒度研究。以3万亿token的开放语料库Dolma为对象，基于叙事理论设计涵盖主体、场景、事件3个核心要素的11个可解释维度框架。通过采样并标注400段文本，微调并验证了基于RoBERTa的NarraBERT模型。将NarraBERT应用于300万段落，生成新数据集NarraDolma。研究发现：叙事结构可在海量异构数据中测量，网络文本呈现连续多维度叙事结构，且叙事质量在预训练数据源和主题间分布不均。NarraDolma和NarraBERT已公开。

## 正文

The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.
