斯坦福、加州大学与南京大学研究人员发布SEFD数据集与方法,将SEC EDGAR文件转换为布局忠实的MultiMarkdown格式,保留合并表头、缩进、符号、跨度和表格层级,同时压缩冗余呈现模板,使财务表格的结构与会计逻辑可被LLM直接利用。公开152B token快照,估计完整档案约550B token长文档。该数据集与Common Crawl衍生语料重叠不足0.1%。
This was long needed for AI in finance.
Making SEC filings readable for machines without flattening the accounting logic.
Stanford + Univ of Calif + Nanjing Univ researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables.
A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents.
Has less than 0.1% overlap with Common Crawl-derived corpora.
The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training.