# 面壁智能 OpenBMB 联合发布 FactNet：十亿级开源多语言知识图谱

- 来源：OpenBMB (@OpenBMB)
- 发布时间：2026-06-15 21:30
- AIHOT 分数：43
- AIHOT 链接：https://aihot.virxact.com/items/cmqf9kprd03vpslwa2e40csf8
- 原文链接：https://x.com/OpenBMB/status/2066513555806171537

## AI 摘要

面壁智能 OpenBMB 联合清华NLP、慕尼黑工业大学等发布 FactNet，构建十亿级开源多语言知识图谱。它将 1.7B 原子断言统一为 1.55B FactSynsets，附带 3.01B 来自 316 种语言维基百科的字节级可追溯证据（页面ID、修订版ID、Unicode偏移），99.63% 精确重定位。人工审计 4,200 项，设计加权精度 92.1%（低资源语言 88.5%）。FactNet-Bench 包含 KGC、MKQA、MFC 三项任务，显式惩罚信息泄露，为可验证 AI 提供结构化事实基础。

## 正文

LLMs keep getting more fluent-but can you actually verify what they say？ Structured KBs like Wikidata lack text grounding. Annotation-based datasets like FEVER are too small and monolingual. Synthetic expansion just produces hallucinations at scale. The trilemma between authenticity， scale， and structure has gone unsolved. ❓
Today， we dive into FactNet-a landmark contribution by @TsinghuaNLP （OpenBMB member） alongside researchers from TU Munich， Modelbest Inc.， and Minzu University of China. FactNet constructs a billion-scale， open-source multilingual knowledge graph that unifies structured Wikidata assertions with auditable， byte-level evidence pointers from 316 native Wikipedia editions.
🤗 Paper： https://huggingface.co/papers/2602.03417
📄 arXiv： https://arxiv.org/abs/2602.03417
💻 Code & Data： https://github.com/yl-shen/factnet

Why it matters：
1⃣️ Billion-Scale & Truly Multilingual： FactNet unifies 1.7B atomic assertions into 1.55B FactSynsets， backed by 3.01B grounded evidence spans across 316 languages. Even the bottom-200 languages hold 2.7% of all evidence-a scale no prior resource has achieved with native， auditable text grounding.
2⃣️ Byte-Level Provenance， Zero Stochastic Inference： Unlike synthetic datasets that sever the connection to authentic sources， FactNet is built through a fully deterministic three-stage pipeline. Every FactSense carries a recoverable pointer （page ID， revision ID， Unicode character offsets）， achieving 99.63% exact re-localization on a 1M-sample test.
3⃣️ 92.1% Grounding Precision Across 316 Languages： Human audit of 4，200 items confirms design-weighted precision of 0.921 （95% CI 【0.913， 0.929】）. WIKILINK_ENTITY and INFOBOX_FIELD matchers cover 55% of evidence at precision above 0.94. Low-resource languages still achieve 0.885-validating deterministic segmentation for tail languages.
4⃣️ FactNet-Bench Sets a New Evaluation Standard： Three tasks （KGC， MKQA， MFC） explicitly penalize leakage-removing predicate masking alone inflates KGC MRR anomalously from 0.298 to 0.351. Grammar-guided decoding boosts valid parse rate from 88.5% to 95.2% on MKQA. MFC Top-5 aggregation reaches 0.73 accuracy and 0.54 Span F1.
FactNet resolves the authenticity-scale-structure trilemma and builds the foundation for AI systems that are not just knowledgeable， but structurally grounded and inherently verifiable.
#AI #THUNLP #OpenBMB #KnowledgeGraph #FactChecking #NLP #LLM #MultilingualAI
