# Talkie复古语言模型：基于1931年前文本的训练与伦理挑战

- 来源：swyx 🇸🇬 (@swyx)
- 发布时间：2026-04-30 08:51
- AIHOT 分数：51
- AIHOT 链接：https://aihot.virxact.com/items/cmoksi5610319sljezh40x4z2
- 原文链接：https://x.com/swyx/status/2049652947408372187

## AI 摘要

为应对互联网被AI生成内容污染的问题，研究者提出“低背景标记”设想，计划训练仅使用历史文本的复古模型。团队集结了包括GPT-1/2开发者在内的专家，通过训练复古OCR模型处理旧书籍、报纸等资料，并利用礼仪手册、词典等结构化历史文本合成RLHF数据。为确保数据纯净，他们开发了基于文档n-gram的时代错位分类器，精心筛选了数千亿1931年前的公共领域标记进行训练。最终发布了130亿参数的Talkie模型，旨在探索语言模型的泛化能力。然而，该模型在发布后表现出强烈的种族偏见倾向，引发了新的伦理担忧。

## 正文

> be me
> "the internet is polluted by ai slop， we need low-background tokens"
> "wouldnt it be cool if we could time travel and see what our ancestors 100 years ago would say to us"
> all the existing vintage models are like <4B
> we need a chat tuned 13B vintage model
> assemble avengers of ML incl the GPT-1/2 guy
> need vintage tokens
> train new vintage OCR model for old books， newspapers， periodicals， scientific journals， patents， and case law
> need vintage RLHF but cant use chat
> synthesize RLHF pairs from historical texts with regular structure eg etiquette manuals， letter-writing manuals， cookbooks， dictionaries， encyclopedias， and poetry and fable collections， shove it into ChatML
> train it
> future knowledge still got in somehow
> dammit.jpg
> train new SOTA document-level n-gram-based anachronism classifier
> meticulously curate hundreds of billions of pre-1931 tokens （public domain）
> train it
> ok！ it checks out vs our FineWeb baseline！
> release it
> it's the most confidently racist model ever released by humankind
> mfw

### 引用推文

> Nick Levine：New work with @AlecRad and @DavidDuvenaud: Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 t...
