为应对互联网被AI生成内容污染的问题,研究者提出“低背景标记”设想,计划训练仅使用历史文本的复古模型。团队集结了包括GPT-1/2开发者在内的专家,通过训练复古OCR模型处理旧书籍、报纸等资料,并利用礼仪手册、词典等结构化历史文本合成RLHF数据。为确保数据纯净,他们开发了基于文档n-gram的时代错位分类器,精心筛选了数千亿1931年前的公共领域标记进行训练。最终发布了130亿参数的Talkie模型,旨在探索语言模型的泛化能力。然而,该模型在发布后表现出强烈的种族偏见倾向,引发了新的伦理担忧。
be me "the internet is polluted by ai slop, we need low-background tokens" "wouldnt it be cool if we could time travel and see what our ancestors 100 years ago would say to us" all the existing vintage models are like <4B we need a chat tuned 13B vintage model assemble avengers of ML incl the GPT-1/2 guy need vintage tokens train new vintage OCR model for old books, newspapers, periodicals, scientific journals, patents, and case law need vintage RLHF but cant use chat synthesize RLHF pairs from historical texts with regular structure eg etiquette manuals, letter-writing manuals, cookbooks, dictionaries, encyclopedias, and poetry and fable collections, shove it into ChatML train it future knowledge still got in somehow dammit.jpg train new SOTA document-level n-gram-based anachronism classifier meticulously curate hundreds of billions of pre-1931 tokens (public domain) train it ok! it checks out vs our FineWeb baseline! release it it's the most confidently racist model ever released by humankind mfw