斯坦福研究者发布SEFD数据集与处理方法,将SEC EDGAR申报文件转化为适合LLM训练的结构化数据,保留表格结构、缩进、合并表头、符号、跨度及层级关系。公开快照包含152B token,完整档案约550B token。该数据与Common Crawl语料重叠度低于0.1%。采用布局保真的MultiMarkdown格式,大幅压缩原有演示框架,保留财务含义的同时减少token浪费。
This was long needed for AI in finance.
Making SEC filings readable for machines without flattening the accounting logic.
Stanford researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables.
A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents.
Has less than 0.1% overlap with Common Crawl-derived corpora.
The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training.