# 使用 Swift 训练大型语言模型（LLM），第一部分：将矩阵乘法性能从 Gflop/s 提升至 Tflop/s

- 来源：Hacker News 热门（buzzing.cc 中文翻译）
- 作者：zdw
- 发布时间：2026-05-12 00:55
- AIHOT 分数：47
- AIHOT 链接：https://aihot.virxact.com/items/cmp1h0yo60y2esllhx0fjh0pt
- 原文链接：https://www.cocoawithlove.com/blog/matrix-multiplications-swift.html

## AI 摘要

文章探讨了在Swift语言中训练大型语言模型时，如何将矩阵乘法的性能从每秒千兆次浮点运算（Gflop/s）提升至每秒万亿次浮点运算（Tflop/s）。这是系列文章的第一部分，聚焦于通过优化技术实现计算性能的数量级飞跃，旨在展示Swift在高效执行核心机器学习运算方面的潜力。

## 正文

In this article, I try to get my own handwritten matrix multiplication code running as fast as possible for training a Large Language Model (LLM) in Swift. The aim is to give some insight into the key steps for optimizing mathematics code in Swift. I also hope that these examples will offer a sense of scale about the capabilities of the different units on Apple Silicon – CPU, SIMD, AMX and GPU.

This will be the first in a series where I look at training neural networks in Swift on Apple Silicon. Future articles will look at the maybe-too-many frameworks Apple offer for machine learning on the Mac. Those established frameworks are what you should really use for matrix multiplication and machine learning (they’ve spent a few more years optimizing matrix kernels than I have).

But until then, I’m having fun writing everything for myself in a “no frameworks, no libraries” plain code approach.

And I’m not just writing matrix multiplication kernels. The sample app will use these kernels as part of a full LLM implementation and the numbers I’ll quote will be for entire forward and backward training iterations. The reference implementation for this series will be Andrej Karpathy’s llm.c (a plain C implementation of a GPT2-compatible model). It’s a fairly basic model but it does contain all the necessary components and is representative of real-world workloads.

That means it’s time for my favorite game: optimize Swift until it’s faster than C.

Backstory

About two years ago, I dug up my engineering thesis from the early 2000s. It’s an image recognizer written in C++ that uses a neural network for classifying images. I wanted to get my old code running again but I hadn’t worked on ML code in a long time. It got annoying and I gave up.

For all the discussion around LLMs in early 2024, it felt like no one was training neural networks on the Mac. At least, not in languages like Swift. I played with some Python libraries like PyTorch and TensorFlow but Python never does the calculations itself – it operates more like an orchestrator of another computational engine under the hood – and the separation left me feeling like I wasn’t in control.

A month later, Andrej Karpathy released llm.c. This reached me in a way that other machine learning content didn’t because nothing is hidden. It is around 1000 lines of plain C and (although it’s filled with some pretty cryptic variable names) it’s relatively readable.

So naturally, I immediately rewrote it in Swift. And it was a lot of fun to play with.

Of course, playing with the code required some work to make it run fast. Some foreshadowing, here: the initial Swift implementation was really super slow. But optimization is a constant process: there’s always something more you can try.

Which finally brings me to this article: I’m going to walk through the different explorations I wrote then (and a couple I’ve added in the last week) to make an LLM train fairly quickly without resorting to using a library. Most of the code will be in Swift (although I’ll show a Metal implementation at the end).

By the way, I will not be explaining how a neural network or an LLM works. If you’re interested, Karpathy’s video Let’s build GPT: from scratch, in code, spelled out. is practically the definitive guide to learning how GPT-like LLMs work and his earlier series starting with The spelled-out intro to language modeling: building makemore covers plenty of introductory concepts in a 5 video series if you want a more introductory lesson. Of course, both are in Python, so please come back here when you’re ready to see how we can do things in Swift.

llm.c

Machine learning is essentially the application of model weights to input data (called the forward pass, a.k.a. inference), then the calculation of error gradients and an update to those weights (the backward pass).

We typically package these calculations together and try to make them run as fast as possible. These packages of operations might be called: “linear tensor projection”, “matrix multiplication”, or even a series of “vector dot products” (depending on how big or small you slice the units of work). It’s ultimately a loop that performs z += x * y a lot of times.

z += x * y

Since these matrix multiplications represent so much of the work in machine learning, I’m going to focus on the code that does this. I will be updating the rest of the implementation as I go, but only using the same improvements I’m showing to matrix multiplication.

Let’s start by looking at the matmul_forward from llm.c which is the core matrix multiplication used on the forward pass. It iterates over the input (inp), multiplies by model weights (weight), and adds the result to the running total (val).

matmul_forward

inp

weight

val

void matmul_forward(float* out, const float* inp, const float* weight, const float* bias, int B, int T, int C, int OC) { for (int b = 0; b in the loop was simply too high a cost. In my 2024 implementation, all I could do was manually unroll the loop 8 times (something that’s fairly hideous to read).

Array

However, Swift 6.2 has given us another useful feature here: InlineArray which finally matches C stack allocated arrays.

InlineArray

for obt in stride(from: 0, to: BT, by: LOOP_UNROLL) { for o in 0..(repeating: bias?[o] ?? 0) let bt = inp.span.extracting(droppingFirst: obt * C) let w = weight.span.extracting(droppingFirst: o * C) for i in 0..(repeating: bias?[o] ?? 0) let bt = inp.span.extracting(droppingFirst: obt * C) let w = weight.span.extracting(droppingFirst: o * C) for i in 0..(repeating: bias?[o] ?? 0) let bt = inp.extracting(droppingFirst: obt * C) let w = weight.extracting(droppingFirst: o * C) for i in 0..(repeating: bias?[o] ?? 0) let bt = inp.extracting(droppingFirst: obt * C) let w = weight.extracting(droppingFirst: o * C) for i in 0.., lhsPanel: UnsafePointer, rhsPanels: UnsafePointer, innerCount: Int ) { zeroTileRow.withUnsafeBufferPointer { zeroBuffer in guard let zeroBase = zeroBuffer.baseAddress else { return } for tile in 0..(tileBase + (row * tileRows)) amx_stz(rowBase.amxZOperand(row: UInt32(tile + (row * accumulatorCount)))) } } } }

private static func amxF32_16x64( outTiles: UnsafeMutablePointer, lhsPanel: UnsafePointer, rhsPanels: UnsafePointer, innerCount: Int ) { zeroTileRow.withUnsafeBufferPointer { zeroBuffer in guard let zeroBase = zeroBuffer.baseAddress else { return } for tile in 0..(tileBase + (row * tileRows)) amx_stz(rowBase.amxZOperand(row: UInt32(tile + (row * accumulatorCount)))) } } } }

Once again, it’s not that different in shape to the LOOP_UNROLL implementations that I called “Fast Swift”.

LOOP_UNROLL

Model Tokens/s Training iterations/s Training versus llm.c llm.c 0.926 0.175 100% Multithreaded Swift 4.356 1.014 558.5% AMX 5.884 1.678 958.8%

Another 1.67 times faster on training. Tiling requires a lot of packing and scattering of data to get our row-major matrices into the required tile shape and I’m pretty sure I could be doing a better job, there. If I was more efficient with this, it would be easily over 2 times faster.

Maximum power draw

The Metal code in this implementation is derived from an implementation by James Thompson in llm.metal. However, they used a library for matrix multiplication, so I’ve written my own Metal matrix multiplication code to keep the framework-free approach going.

The Metal code in this implementation is derived from an implementation by James Thompson in llm.metal. However, they used a library for matrix multiplication, so I’ve written my own Metal matrix multiplication code to keep the framework-free approach going.

In the previous section, I was careful to say “the fastest CPU instruction on Apple Silicon for matrix multiplication” because, of course, we also have the GPU.

What does Metal code look like for matrix multiplication? Unlike the C and Swift code, it’s in two parts: the inner kernel (which we write in Metal/C++) and the outer invocation machinery (which stays on the Swift side).

First, the inner kernel:

kernel void matmul_forward_kernel( device float* out [[buffer(0)]], const device float* inp [[buffer(1)]], const device float* weight [[buffer(2)]], const device float* bias [[buffer(3)]], constant uint& BT [[buffer(4)]], constant uint& C [[buffer(5)]], constant uint& OC [[buffer(6)]], uint2 gid [[thread_position_in_grid]] ) { uint oc = gid.x; uint bt = gid.y; if (bt >= BT || oc >= OC) { return; } float sum = bias[oc]; for (uint i = 0; i = BT || oc >= OC) { return; } float sum = bias[oc]; for (uint i = 0; i < C; i++) { sum += inp[bt * C + i] * weight[oc * C + i]; } out[bt * OC + oc] = sum; }

Skipping over the somewhat heavy parameter block, you can see that all this really contains is the innermost loop of the four loops in the C and Basic Swift matmul_forward. Instead of loops over B, T, OC and C, we just have the loop over C here.

matmul_forward

B
