Soul Player C64 is an AI chatbot. Outputs are generated by a transformer language model, not a human.
A real transformer running on a 1 MHz Commodore 64.
And apparently on the Amstrad CPC, too! -> https://github.com/G1D30N/soulplayer-cpc
.-------. | O O | | V | |..|---|..| # SOUL PLAYER C64 25K PARAMETERS. 2 LAYERS. REAL TRANSFORMER. LOADED OFF A FLOPPY DISK. YOU> hey C64> HELLO! RE SOUNDS ME. MEFUL!
.-------. | O O | | V | |..|---|..| # SOUL PLAYER C64 25K PARAMETERS. 2 LAYERS. REAL TRANSFORMER. LOADED OFF A FLOPPY DISK. YOU> hey C64> HELLO! RE SOUNDS ME. MEFUL!
A 2-layer decoder-only transformer - the same architecture behind ChatGPT, Claude, and Gemini - implemented in hand-written 6502/6510 assembly and running on an unmodified Commodore 64. ~25,000 int8 parameters. Real multi-head causal self-attention, real softmax, real RMSNorm. About 60 seconds per token. The whole thing fits on a floppy disk with room to spare.
Architecture
2 layers, 4 attention heads × 8 dims, 32-dimensional embeddings, 64 FFN hidden units. ~25,000 parameters quantized to int8 with per-tensor shift scaling. The key breakthrough was fixing the softmax score normalization - shifting attention scores by 14 bits instead of 17 gives the 128-entry exp lookup table enough dynamic range to produce meaningful attention weights. Without this fix, the integer attention was essentially uniform across all positions, making the model blind regardless of architecture or training.
Quick start - run the pre-built soul
Grab disk/soulplayer.d64 and load it in any C64 emulator (VICE recommended):
disk/soulplayer.d64
LOAD"SOULPLAYER",8,1 RUN
LOAD"SOULPLAYER",8,1 RUN
Type a short message in lowercase, press RETURN, wait. The border flashes while it thinks. Each token gets a SID blip. A full response takes a few minutes. Type q to quit.
Soul Player C64 is an AI chatbot. Outputs are generated by a transformer language model, not a human.
A real transformer running on a 1 MHz Commodore 64.
And apparently on the Amstrad CPC, too! -> https://github.com/G1D30N/soulplayer-cpc
.-------. | O O | | V | |..|---|..| # SOUL PLAYER C64 25K PARAMETERS. 2 LAYERS. REAL TRANSFORMER. LOADED OFF A FLOPPY DISK. YOU> hey C64> HELLO! RE SOUNDS ME. MEFUL!
.-------. | O O | | V | |..|---|..| # SOUL PLAYER C64 25K PARAMETERS. 2 LAYERS. REAL TRANSFORMER. LOADED OFF A FLOPPY DISK. YOU> hey C64> HELLO! RE SOUNDS ME. MEFUL!
A 2-layer decoder-only transformer - the same architecture behind ChatGPT, Claude, and Gemini - implemented in hand-written 6502/6510 assembly and running on an unmodified Commodore 64. ~25,000 int8 parameters. Real multi-head causal self-attention, real softmax, real RMSNorm. About 60 seconds per token. The whole thing fits on a floppy disk with room to spare.
q
Tip: The model understands lowercase letters, spaces, and punctuation (. , ! ? ' : ; -). Capital letters become unknown tokens.
Tip: The model understands lowercase letters, spaces, and punctuation (. , ! ? ' : ; -). Capital letters become unknown tokens.
. , ! ? ' : ; -
Train your own soul
This is the fun part. Write a corpus, train a model, build a floppy.
Install dependencies
pip install numpy torch
Write a corpus
Create a text file with one exchange per line in inputresponse format:
inputresponse
hellohey! nice to see you! i'm sadi hear you. i care about you. tell me a jokewhy did the bit flip? it was tired!
hellohey! nice to see you! i'm sadi hear you. i care about you. tell me a jokewhy did the bit flip? it was tired!
Keep exchanges short - the model has a 20-token context window. See data/example_corpus.txt for a starter.
data/example_corpus.txt
Train
python train.py data/example_corpus.txt
This trains a BPE tokenizer (128 tokens), trains the QAT transformer, exports models/soul.bin and models/tokenizer.json. Takes a few minutes on GPU.
models/soul.bin
models/tokenizer.json
Every 500 epochs, you'll see both the float and int8 inference output side by side - what the model learned vs what the C64 will actually produce. The best checkpoint is saved based on int8 quality, not float loss. All checkpoints are saved to models/checkpoints/ for cherry-picking.
models/checkpoints/
Options:
python train.py data/my_corpus.txt --epochs 30000 --output models/ python train.py # uses built-in emotional support corpus
Training resumes automatically if checkpoints exist from a previous run.
Build the C64 binary
python build.py
This assembles all 6502/6510 routines, embeds your trained weights, and writes disk/soulplayer.prg and disk/soulplayer.d64.
disk/soulplayer.prg
disk/soulplayer.d64
Run it
x64 disk/soulplayer.d64 # VICE emulator
Or flash the .d64 to a real 1541 floppy for hardware.
All activations are Q8.8 fixed-point (int16). Weights are int8 with per-tensor power-of-2 shifts. Biases are int16 pre-scaled to the matmul accumulator. Softmax uses a 128-entry exp lookup table with >>14 score normalization. The 6502 has no multiply instruction - everything is shift-and-add.
The model uses quantization-aware training (QAT). During training, weights pass through FakeQuantI8 - fake-quantized with continuous float scaling and straight-through gradient estimation. The deliberate mismatch between training's continuous scale and export's power-of-2 shift grid acts as implicit noise, forcing the model to learn weights with wider logit margins that survive the quantization gap. Biases are fake-quantized with simple fq(). Every matmul gets a × 0.5 post-shift simulating the 6502's >> 1.
FakeQuantI8
fq()
× 0.5
> 1
Label smoothing (0.15) prevents the model from sharpening logit distributions beyond what int8 arithmetic can reliably distinguish. The training loop evaluates the actual integer forward pass (numerics.forward()) every 500 epochs and saves the best checkpoint by int8 argmax accuracy, not float loss.
numerics.forward()
The training output shows float and int8 inference side by side - what the model learned vs what the C64 will produce.
Caveats
It's not smart. 25K parameters is about 70 million times smaller than GPT-4. It will produce broken sentences. That's the point - the architecture works at this scale.
It's slow contemplative. About 60 seconds per token on real hardware. A full response takes several minutes.
Capitals become . Stick to lowercase.
Small vocabulary. 128 tokens and 20-token context - keep training exchanges short.
Credits
Code, training: gizmo64k
Debugging, unit tests, rubber duck: Claude (Opus 4.6) by Anthropic
Lucky soul: The Commodore 64 by Commodore Business Machines, 1982
License
GNU General Public License v3. See LICENSE.
The future came back for the past. And now it has a soul.
2 layers, 4 attention heads × 8 dims, 32-dimensional embeddings, 64 FFN hidden units. ~25,000 parameters quantized to int8 with per-tensor shift scaling. The key breakthrough was fixing the softmax score normalization - shifting attention scores by 14 bits instead of 17 gives the 128-entry exp lookup table enough dynamic range to produce meaningful attention weights. Without this fix, the integer attention was essentially uniform across all positions, making the model blind regardless of architecture or training.
Quick start - run the pre-built soul
Grab disk/soulplayer.d64 and load it in any C64 emulator (VICE recommended):
disk/soulplayer.d64
LOAD"SOULPLAYER",8,1 RUN
LOAD"SOULPLAYER",8,1 RUN
Type a short message in lowercase, press RETURN, wait. The border flashes while it thinks. Each token gets a SID blip. A full response takes a few minutes. Type q to quit.
q
Tip: The model understands lowercase letters, spaces, and punctuation (. , ! ? ' : ; -). Capital letters become unknown tokens.
Tip: The model understands lowercase letters, spaces, and punctuation (. , ! ? ' : ; -). Capital letters become unknown tokens.
. , ! ? ' : ; -
Train your own soul
This is the fun part. Write a corpus, train a model, build a floppy.
Install dependencies
pip install numpy torch
Write a corpus
Create a text file with one exchange per line in inputresponse format:
inputresponse
hellohey! nice to see you! i'm sadi hear you. i care about you. tell me a jokewhy did the bit flip? it was tired!
hellohey! nice to see you! i'm sadi hear you. i care about you. tell me a jokewhy did the bit flip? it was tired!
Keep exchanges short - the model has a 20-token context window. See data/example_corpus.txt for a starter.
data/example_corpus.txt
Train
python train.py data/example_corpus.txt
This trains a BPE tokenizer (128 tokens), trains the QAT transformer, exports models/soul.bin and models/tokenizer.json. Takes a few minutes on GPU.
models/soul.bin
models/tokenizer.json
Every 500 epochs, you'll see both the float and int8 inference output side by side - what the model learned vs what the C64 will actually produce. The best checkpoint is saved based on int8 quality, not float loss. All checkpoints are saved to models/checkpoints/ for cherry-picking.
models/checkpoints/
Options:
python train.py data/my_corpus.txt --epochs 30000 --output models/ python train.py # uses built-in emotional support corpus
Training resumes automatically if checkpoints exist from a previous run.
Build the C64 binary
python build.py
This assembles all 6502/6510 routines, embeds your trained weights, and writes disk/soulplayer.prg and disk/soulplayer.d64.
disk/soulplayer.prg
disk/soulplayer.d64
Run it
x64 disk/soulplayer.d64 # VICE emulator
Or flash the .d64 to a real 1541 floppy for hardware.
All activations are Q8.8 fixed-point (int16). Weights are int8 with per-tensor power-of-2 shifts. Biases are int16 pre-scaled to the matmul accumulator. Softmax uses a 128-entry exp lookup table with >>14 score normalization. The 6502 has no multiply instruction - everything is shift-and-add.
The model uses quantization-aware training (QAT). During training, weights pass through FakeQuantI8 - fake-quantized with continuous float scaling and straight-through gradient estimation. The deliberate mismatch between training's continuous scale and export's power-of-2 shift grid acts as implicit noise, forcing the model to learn weights with wider logit margins that survive the quantization gap. Biases are fake-quantized with simple fq(). Every matmul gets a × 0.5 post-shift simulating the 6502's >> 1.
FakeQuantI8
fq()
× 0.5
> 1
Label smoothing (0.15) prevents the model from sharpening logit distributions beyond what int8 arithmetic can reliably distinguish. The training loop evaluates the actual integer forward pass (numerics.forward()) every 500 epochs and saves the best checkpoint by int8 argmax accuracy, not float loss.
numerics.forward()
The training output shows float and int8 inference side by side - what the model learned vs what the C64 will produce.
Caveats
It's not smart. 25K parameters is about 70 million times smaller than GPT-4. It will produce broken sentences. That's the point - the architecture works at this scale.
It's slow contemplative. About 60 seconds per token on real hardware. A full response takes several minutes.
Capitals become . Stick to lowercase.
Small vocabulary. 128 tokens and 20-token context - keep training exchanges short.
Credits
Code, training: gizmo64k
Debugging, unit tests, rubber duck: Claude (Opus 4.6) by Anthropic
Lucky soul: The Commodore 64 by Commodore Business Machines, 1982
License
GNU General Public License v3. See LICENSE.
The future came back for the past. And now it has a soul.