FineWeb 流式加载、过滤、去重、分词与大规模网络语料库分析实践教程

2026-06-15 04:45·18天前·Sana Hassan

AI 摘要

该教程演示如何在不下载完整多 TB 语料库的前提下，通过 HuggingFace 的 load_dataset 流式接口加载 FineWeb sample-10BT 子集的 3000 条文档，检查其 schema 及 url、language、language_score、token_count 等元数据字段，并复现 FineWeb 的质量过滤流程（Gopher / C4 / FineWeb 自定义规则）、采用 MinHash 进行近似重复检测、用 GPT-2 tokenizer 验证 token 计数，最后生成域名、语言分数、文档长度和 tokenizer 效率等统计图表。

原文 · 未翻译

In this tutorial, we explore the FineWeb dataset through an advanced hands-on workflow. We stream a manageable sample of the dataset without downloading the full multi-terabyte corpus, inspect its schema and metadata, and analyze key fields such as URL, language, language score, and token count. We also reproduce simplified versions of FineWeb’s quality-filtering pipeline, apply MinHash-based near-duplicate detection, verify token counts with the GPT-2 tokenizer, and generate useful analytics on domains, language scores, document lengths, and tokenizer efficiency.

import subprocess, sys def pip(*pkgs): subprocess.run([sys.executable, "-m", "pip", "install", "-q", *pkgs], check=True) pip("datasets>=2.19", "datasketch", "tiktoken", "pandas", "matplotlib", "tqdm") import re, math, random, collections from urllib.parse import urlparse import pandas as pd import numpy as np import matplotlib.pyplot as plt from tqdm.auto import tqdm from datasets import load_dataset random.seed(0); np.random.seed(0) pd.set_option("display.max_colwidth", 90)

We begin by installing all required libraries for streaming, analysis, deduplication, tokenization, and visualization. We import the core Python packages needed to process FineWeb documents and work with tabular data. We also set random seeds and display options so that our results remain consistent and easier to inspect.

N_DOCS = 3000 print(f"Streaming {N_DOCS} docs from FineWeb sample-10BT ...") stream = load_dataset( "HuggingFaceFW/fineweb", name="sample-10BT", split="train", streaming=True, ) docs = [] for i, doc in enumerate(tqdm(stream, total=N_DOCS)): docs.append(doc) if i + 1 >= N_DOCS: break df = pd.DataFrame(docs) print("\nColumns:", list(df.columns)) print(df[["url", "language", "language_score", "token_count"]].head(5)) ex = docs[0] print("\n--- Example record (fields) ---") for k, v in ex.items(): preview = (v[:120] + "…") if isinstance(v, str) and len(v) > 120 else v print(f"{k:>16}: {preview}")

MarkTechPost（RSS）

65导出 Markdown

FineWeb 流式加载、过滤、去重、分词与大规模网络语料库分析实践教程

2026-06-15 04:45·18天前·Sana Hassan

阅读原文· marktechpost.com

AI 摘要

原文 · 保持原样，未翻译