A complete walkthrough of how large language models like ChatGPT are built — from raw internet text to a conversational assistant. Based on Andrej Karpathy's technical deep dive.
Representative figures from frontier models circa 2024 — exact numbers shift with every release. The scale is the point, not the precision.
Downloadingthe Internet
The first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb.
The goal: large quantity of high quality, diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly 10 consumer hard drives worth of text — representing ~15 trillion tokens.
Click any stage to read more detail
Tokenization
Neural networks can't process raw text — they need numbers. The solution is tokenization: breaking text into "tokens" (sub-word chunks) and assigning each an ID.
GPT-4 uses a vocabulary of 100,277 tokens, built via the Byte Pair Encoding (BPE) algorithm. BPE starts with individual bytes (256 symbols), then iteratively merges the most frequent adjacent pairs — compressing the sequence length while expanding the vocabulary.
BPE in Action
Training theNeural Network
The Transformer neural network is initialized with random parameters — billions of "knobs". Training adjusts these knobs so the network gets better at predicting the next token in any sequence.
Every training step: sample a window of tokens → feed to network → compare prediction to actual next token → nudge all parameters slightly in the right direction. Repeat billions of times.
The loss — a single number measuring prediction error — falls steadily as the model learns the statistical patterns of human language.
Transformer Architecture
Select a training stage to see model output quality
Model Output at This Stage
Inference &Token Sampling
Once trained, the network generates text autoregressively: feed a sequence of tokens → get a probability distribution over all 100K possible next tokens → sample one → append → repeat.
This process is stochastic — the same prompt generates different outputs every time because we're flipping a biased coin. Higher-probability tokens are more likely but not guaranteed to be chosen.
Temperature controls randomness. Low temperature (0.1) → model always picks the top token. High temperature (2.0) → uniform chaos. 0.7–1.0 is the sweet spot for coherent-but-creative text.
A complete walkthrough of how large language models like ChatGPT are built — from raw internet text to a conversational assistant. Based on Andrej Karpathy's technical deep dive.
Representative figures from frontier models circa 2024 — exact numbers shift with every release. The scale is the point, not the precision.
Downloadingthe Internet
The first step is collecting an enormous amount of text. Organizations like Common Crawl have been crawling the web since 2007 — indexing 2.7 billion pages by 2024. This raw data is then filtered into a high-quality dataset like FineWeb.
The goal: large quantity of high quality, diverse documents. After aggressive filtering, you end up with about 44 terabytes — roughly 10 consumer hard drives worth of text — representing ~15 trillion tokens.
Token Sampling Demo
Watch the model choose the next word. Each bar shows the probability of a candidate token.
The InternetSimulator
After pre-training, you have a base model — a sophisticated autocomplete engine. It's not an assistant. It doesn't answer questions. It continues token sequences based on what it saw on the internet.
Give it a Wikipedia sentence and it'll complete it from memory. Ask it "What is 2+2?" and it might give you a math textbook page, a quiz answer key, or go off on a tangent — whatever was statistically common in its training data.
The base model's knowledge lives in its 405 billion parameters — a lossy compression of the internet, like a zip file that approximates rather than perfectly stores information.
Base Model Behavior
Building the Assistant
The base model is a token simulator. To turn it into a helpful assistant, we need post-training — a much cheaper but equally critical stage. This is where the model learns conversations.
Supervised Fine-Tuning (SFT)
Human labelers create a dataset of ideal conversations, following detailed labeling instructions: be helpful, be truthful, be harmless. The model is then trained on these conversations — not from scratch, but by continuing to adjust the pre-trained weights on this new data.
Modern SFT datasets (like UltraChat) have millions of conversations — mostly synthetic (LLM-generated), with human review. The model learns by imitation: it adopts the persona of the ideal assistant reflected in the data.
Conversation Token Format
Every conversation must be encoded as a flat token sequence. Special tokens mark the structure:
Then RLHF refines the assistant's behavior further:
RLHF — Reinforcement Learningfrom Human Feedback
Human raters rank multiple model responses. A reward model learns to predict human preferences. The language model is then trained via reinforcement learning to generate responses the reward model scores highly.
Cognitive Quirksof Language Models
Understanding why LLMs behave the way they do requires thinking about their psychology — the emergent properties of being trained to statistically imitate human text.
Retrieval-AugmentedGeneration
LLMs have a knowledge cutoff and a finite context window. RAG solves this by embedding your documents into a vector store, retrieving the most semantically relevant chunks at query time, and injecting them into the context — shifting the model's prediction distribution toward grounded, up-to-date facts rather than memorized training data.
Every document is converted to a dense vector (~1,536 numbers) by an embedding model. Semantically similar texts land near each other in this high-dimensional space — no keyword matching needed.
The user's question is embedded the same way. Cosine similarity finds the nearest document vectors — the chunks most semantically related to the query — typically the top 2–5.
Retrieved chunks are prepended to the prompt before the LLM sees the question. The model generates from injected facts rather than relying on memorized training data — dramatically reducing hallucination on knowledge-intensive tasks.
Effect on Predictions
Security Challengesin LLM Systems
The same properties that make LLMs powerful — following instructions, completing patterns, acting on context — also create new attack surfaces. A cat-and-mouse game between attacks and defenses is now playing out in this new computing paradigm.
From Text toAssistant
The complete journey from raw web crawl to the ChatGPT you interact with — across two major stages, months of compute, and billions of parameters.
Think of an LLM as an Operating System
An LLM isn't just a chatbot — it's the kernel process of an emerging OS. It coordinates memory, compute, and tools via natural language.
Disk = Internet / Files — browsed on demand or retrieved via RAG
RAM = Context Window — finite working memory; the model pages info in and out
CPU = GPU Inference — the forward pass generating each token
System 2 Thinking — converting time into accuracy; "take 30 minutes, don't rush"
Self-Improvement — the AlphaGo question: can LLMs surpass human-level answers once a reward signal exists?
Customization — an app store of specialized LLM experts for narrow tasks
Multimodality — text, images, audio, and video unified in one model
Built from Andrej Karpathy's "Intro to Large Language Models" lecture — all facts, figures, and framings traced back to that source. Interactive visualizations built with AI assistance. The most important takeaway: every word generated is a probabilistic sample — a biased coin flip, at 100K-way scale, billions of times.
This was posted to Hacker News and drew heated debate about it being LLM-generated. That's a fair observation — the implementation was AI-assisted. But the content isn't the AI's: every claim, figure, and framing in this guide comes directly from Karpathy's lecture, not from a model hallucinating about LLMs.
HN discussion · GitHub · Full lecture transcript · HN update note · LLM council report · v1 (original) · Part 2: How to Use LLMs →
Click any stage to read more detail
Tokenization
Neural networks can't process raw text — they need numbers. The solution is tokenization: breaking text into "tokens" (sub-word chunks) and assigning each an ID.
GPT-4 uses a vocabulary of 100,277 tokens, built via the Byte Pair Encoding (BPE) algorithm. BPE starts with individual bytes (256 symbols), then iteratively merges the most frequent adjacent pairs — compressing the sequence length while expanding the vocabulary.
BPE in Action
Training theNeural Network
The Transformer neural network is initialized with random parameters — billions of "knobs". Training adjusts these knobs so the network gets better at predicting the next token in any sequence.
Every training step: sample a window of tokens → feed to network → compare prediction to actual next token → nudge all parameters slightly in the right direction. Repeat billions of times.
The loss — a single number measuring prediction error — falls steadily as the model learns the statistical patterns of human language.
Transformer Architecture
Select a training stage to see model output quality
Model Output at This Stage
Inference &Token Sampling
Once trained, the network generates text autoregressively: feed a sequence of tokens → get a probability distribution over all 100K possible next tokens → sample one → append → repeat.
This process is stochastic — the same prompt generates different outputs every time because we're flipping a biased coin. Higher-probability tokens are more likely but not guaranteed to be chosen.
Temperature controls randomness. Low temperature (0.1) → model always picks the top token. High temperature (2.0) → uniform chaos. 0.7–1.0 is the sweet spot for coherent-but-creative text.
Token Sampling Demo
Watch the model choose the next word. Each bar shows the probability of a candidate token.
The InternetSimulator
After pre-training, you have a base model — a sophisticated autocomplete engine. It's not an assistant. It doesn't answer questions. It continues token sequences based on what it saw on the internet.
Give it a Wikipedia sentence and it'll complete it from memory. Ask it "What is 2+2?" and it might give you a math textbook page, a quiz answer key, or go off on a tangent — whatever was statistically common in its training data.
The base model's knowledge lives in its 405 billion parameters — a lossy compression of the internet, like a zip file that approximates rather than perfectly stores information.
Base Model Behavior
Building the Assistant
The base model is a token simulator. To turn it into a helpful assistant, we need post-training — a much cheaper but equally critical stage. This is where the model learns conversations.
Supervised Fine-Tuning (SFT)
Human labelers create a dataset of ideal conversations, following detailed labeling instructions: be helpful, be truthful, be harmless. The model is then trained on these conversations — not from scratch, but by continuing to adjust the pre-trained weights on this new data.
Modern SFT datasets (like UltraChat) have millions of conversations — mostly synthetic (LLM-generated), with human review. The model learns by imitation: it adopts the persona of the ideal assistant reflected in the data.
Conversation Token Format
Every conversation must be encoded as a flat token sequence. Special tokens mark the structure:
Then RLHF refines the assistant's behavior further:
RLHF — Reinforcement Learningfrom Human Feedback
Human raters rank multiple model responses. A reward model learns to predict human preferences. The language model is then trained via reinforcement learning to generate responses the reward model scores highly.
Cognitive Quirksof Language Models
Understanding why LLMs behave the way they do requires thinking about their psychology — the emergent properties of being trained to statistically imitate human text.
Retrieval-AugmentedGeneration
LLMs have a knowledge cutoff and a finite context window. RAG solves this by embedding your documents into a vector store, retrieving the most semantically relevant chunks at query time, and injecting them into the context — shifting the model's prediction distribution toward grounded, up-to-date facts rather than memorized training data.
Every document is converted to a dense vector (~1,536 numbers) by an embedding model. Semantically similar texts land near each other in this high-dimensional space — no keyword matching needed.
The user's question is embedded the same way. Cosine similarity finds the nearest document vectors — the chunks most semantically related to the query — typically the top 2–5.
Retrieved chunks are prepended to the prompt before the LLM sees the question. The model generates from injected facts rather than relying on memorized training data — dramatically reducing hallucination on knowledge-intensive tasks.
Effect on Predictions
Security Challengesin LLM Systems
The same properties that make LLMs powerful — following instructions, completing patterns, acting on context — also create new attack surfaces. A cat-and-mouse game between attacks and defenses is now playing out in this new computing paradigm.
From Text toAssistant
The complete journey from raw web crawl to the ChatGPT you interact with — across two major stages, months of compute, and billions of parameters.
Think of an LLM as an Operating System
An LLM isn't just a chatbot — it's the kernel process of an emerging OS. It coordinates memory, compute, and tools via natural language.
Disk = Internet / Files — browsed on demand or retrieved via RAG
RAM = Context Window — finite working memory; the model pages info in and out
CPU = GPU Inference — the forward pass generating each token
System 2 Thinking — converting time into accuracy; "take 30 minutes, don't rush"
Self-Improvement — the AlphaGo question: can LLMs surpass human-level answers once a reward signal exists?
Customization — an app store of specialized LLM experts for narrow tasks
Multimodality — text, images, audio, and video unified in one model
Built from Andrej Karpathy's "Intro to Large Language Models" lecture — all facts, figures, and framings traced back to that source. Interactive visualizations built with AI assistance. The most important takeaway: every word generated is a probabilistic sample — a biased coin flip, at 100K-way scale, billions of times.
This was posted to Hacker News and drew heated debate about it being LLM-generated. That's a fair observation — the implementation was AI-assisted. But the content isn't the AI's: every claim, figure, and framing in this guide comes directly from Karpathy's lecture, not from a model hallucinating about LLMs.
HN discussion · GitHub · Full lecture transcript · HN update note · LLM council report · v1 (original) · Part 2: How to Use LLMs →