# SEFD：将SEC文件转化为LLM训练数据的新方法

- 来源：Rohan Paul (@rohanpaul_ai)
- 发布时间：2026-06-17 19:16
- AIHOT 分数：50
- AIHOT 链接：https://aihot.virxact.com/items/cmqi07gmf04dmslf0u004kqy6
- 原文链接：https://x.com/rohanpaul_ai/status/2067204635442778429

## AI 摘要

斯坦福、加州大学与南京大学研究人员发布SEFD数据集与方法，将SEC EDGAR文件转换为布局忠实的MultiMarkdown格式，保留合并表头、缩进、符号、跨度和表格层级，同时压缩冗余呈现模板，使财务表格的结构与会计逻辑可被LLM直接利用。公开152B token快照，估计完整档案约550B token长文档。该数据集与Common Crawl衍生语料重叠不足0.1%。

## 正文

This was long needed for AI in finance.

Making SEC filings readable for machines without flattening the accounting logic.

Stanford + Univ of Calif + Nanjing Univ researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables.

A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents.

Has less than 0.1% overlap with Common Crawl-derived corpora.

The authors propose SEFD， a rebuilt version of EDGAR filings that keeps table structure， indentation， and financial meaning while using fewer tokens for LLM training.

The dataset turns EDGAR into layout-faithful MultiMarkdown， preserving merged headers， indentation， signs， spans， and table hierarchy while shrinking enormous presentation scaffolding into usable tokens.

----

Link - arxiv. org/abs/2606.18192v1