微软 MAI 模型被曝使用未授权网络数据，违背"企业级干净商用数据"承诺

2026-06-05 20:10·27天前·Matthias Bastian

AI 摘要

微软向企业客户推销 MAI 模型时声称其训练数据仅使用“干净且经过商业许可的数据”，但实际部分依赖 Common Crawl 等未授权网络数据。与其它 AI 公司一样，微软援引合理使用原则，并将阻止其爬虫的责任推给网站所有者。

原文 · 未翻译

Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"

Microsoft partly trained its new MAI models on unlicensed web data. The technical paper shows Microsoft used Common Crawl, among other sources, as Simon Willison noted. Microsoft had previously claimed the MAI models were trained only on "enterprise grade, clean and commercially licensed data."

Like other AI companies scraping the web, Microsoft is likely relying on fair use. The paper describes the data as a "mixture of publicly available and licensed human-generated data." For web data, Microsoft says it uses "a proprietary crawler that respects the Robots Exclusion Protocol (robots.txt) and related meta-tag and HTML controls, enabling site owners to manage how content on their sites is accessed and used."

That puts the burden of protecting content on site owners, like assuming anyone who doesn't lock their door consents to a break-in. Fair use remains contested, and courts are still sorting it out. In short, Microsoft does what every other AI company does, yet sells its training data as especially "clean." It isn't.

AI News Without the Hype – Curated by Humans

The Decoder：AI News（RSS）

51导出 Markdown

微软 MAI 模型被曝使用未授权网络数据，违背"企业级干净商用数据"承诺

2026-06-05 20:10·27天前·Matthias Bastian

阅读原文· the-decoder.com

AI 摘要

原文 · 保持原样，未翻译

Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"