从零开始训练你自己的大型语言模型

2026-05-05 20:25·58天前·kristianpaul

AI 摘要

GitHub开源项目“llm-from-scratch”提供了从零开始训练大型语言模型的完整指南。该项目详细阐述了构建现代LLM所需的核心组件，包括分词器、Transformer架构、预训练与微调流程。指南强调通过实践理解模型内部机制，而非直接调用现有API。项目在Hacker News社区获得广泛关注，收获293点热度，反映出开发者对深入掌握LLM底层技术的强烈需求。

原文 · 未翻译

Train Your Own LLM From Scratch

A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why.

Andrej Karpathy's nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space.

This workshop is my attempt to give others that same experience. nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour — designed to be completed in a single workshop session.

What You'll Build

A working GPT model trained from scratch on your MacBook, capable of generating Shakespeare-like text. You'll write:

Tokenizer — turning text into numbers the model can process

Model architecture — the transformer: embeddings, attention, feed-forward layers

Training loop — forward pass, loss, backprop, optimizer, learning rate scheduling

Text generation — sampling from your trained model

Prerequisites

Any laptop or desktop (Mac, Linux, or Windows)

Python 3.12+

Comfort reading Python code (you don't need ML experience)

Training uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. Also works on Google Colab — upload the files and run with !python train.py.

!python train.py

Getting Started

Local (recommended)

Install uv if you don't have it:

macOS / Linux curl -LsSf https://astral.sh/uv/install.sh | sh # Windows powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Then set up the project:

uv sync mkdir scratchpad && cd scratchpad

Google Colab

If you don't have a local setup, upload the repo to Colab and install dependencies:

!pip install torch numpy tqdm tiktoken

Upload data/shakespeare.txt to your Colab files, then write your code in notebook cells or upload .py files and run them with !python train.py.

data/shakespeare.txt

.py

!python train.py

Work through the docs in order. Each part walks you through writing a piece of the pipeline, explaining what each component does and why. By the end, you'll have a working model.py, train.py, and generate.py that you wrote yourself.

Hacker News 热门（buzzing.cc 中文翻译）

68导出 Markdown

从零开始训练你自己的大型语言模型

2026-05-05 20:25·58天前·kristianpaul

阅读原文· github.com

AI 摘要

原文 · 保持原样，未翻译

Train Your Own LLM From Scratch

A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why.