微软 MAI 模型被曝使用未授权网络数据,违背"企业级干净商用数据"承诺
阅读原文· the-decoder.com微软向企业客户推销 MAI 模型时声称其训练数据仅使用“干净且经过商业许可的数据”,但实际部分依赖 Common Crawl 等未授权网络数据。与其它 AI 公司一样,微软援引合理使用原则,并将阻止其爬虫的责任推给网站所有者。
Microsoft trained its MAI models on unlicensed web data despite promising "enterprise grade, clean and commercially licensed data"
Microsoft partly trained its new MAI models on unlicensed web data. The technical paper shows Microsoft used Common Crawl, among other sources, as Simon Willison noted. Microsoft had previously claimed the MAI models were trained only on "enterprise grade, clean and commercially licensed data."
Like other AI companies scraping the web, Microsoft is likely relying on fair use. The paper describes the data as a "mixture of publicly available and licensed human-generated data." For web data, Microsoft says it uses "a proprietary crawler that respects the Robots Exclusion Protocol (robots.txt) and related meta-tag and HTML controls, enabling site owners to manage how content on their sites is accessed and used."