# 借助 Unsloth 和 NVIDIA 加速大型语言模型的训练

- 来源：Hacker News 热门（buzzing.cc 中文翻译）
- 作者：segmenta
- 发布时间：2026-05-08 00:38
- AIHOT 分数：69
- AIHOT 链接：https://aihot.virxact.com/items/cmovqndvi00dcslotvzrkmisz
- 原文链接：https://unsloth.ai/blog/nvidia-collab

## AI 摘要

Unsloth与NVIDIA合作推出优化方案，显著加速大型语言模型训练。该方案通过集成NVIDIA TensorRT-LLM等工具，使模型训练速度提升最高达5倍，内存占用减少达80%，同时保持模型性能无损。此举旨在降低大模型训练的计算成本与时间门槛，助力开发者更高效地进行模型迭代与应用部署。

## 正文

May 6, 2026

May 6, 2026

We collabed with NVIDIA to make LLM training ~25% faster and in this blog/guide we'll breakdown exactly how we did it. These optimizations have no loss in accuracy and are an extra addition on top of Unsloth’s already 2-5x faster speedup! The new algorithms are auto enabled on RTX laptops, data center GPUs and DGX Spark machines, so just update Unsloth to get the latest improvements. By working with NVIDIA, we show how:

Caching packed sequence metadata makes training 14.3% faster.

Using double buffered async gradient checkpointing gives a 8% speedup.

gpt-oss training is 15% faster by using argsort and bincount during MoE routing.

1. Caching Packed-Sequence Metadata

Suppose we have several short examples:

Instead of padding all of them to the same length and wasting compute on padding tokens, we concatenate them into one longer packed sequence:

The model still needs to know where each original sequence starts and ends. So, alongside the packed tokens, we carry sequence metadata such as:

sequence lengths

cumulative sequence offsets (cu_seqlens)

cu_seqlens

the maximum sequence length

attention structure derived from the three items above

This is the key point: for a fixed packed batch, that metadata is the same for every layer.

If we write the boundary information for a packed batch as:

B = { lengths, cu_seqlens, max_seqlen, mask structure }

then every transformer layer in that forward pass consumes the same B.

If the model has L layers, rebuilding or re-synchronizing on B once per layer is not new work. It is the same information being reconstructed again and again.

In other words, the useful work is:

build B once, use it L times.

The wasteful version is:

build B + build B + ⋯ + build B (L times)

The overhead here is not primarily extra FLOPs. Some of these paths can force device-to-host synchronization, effectively creating a GPU-CPU sync point. Once that happens inside a per-layer path, the overhead recurs at every layer.

That is what the packed-sequence caching change reduces. Instead of repeatedly reconstructing packed sequence info, SDPA packed masks, and xFormers block masks, it caches the reusable metadata and the attention-side structures derived from it, per device, for the current packed batch. Those cached structures are then reused across layers.

Why this helps

Packed training already improves utilization by eliminating padding waste. But if the metadata path keeps forcing synchronization, some of that gain is lost to overhead that has nothing to do with the model's actual learning.

Caching helps because it removes repeated coordination work from the hot path. The forward pass benefits the most because that is where the same packed metadata is consumed repeatedly across many layers.

Benchmarks

On Qwen3-14B QLoRA SFT:

Qwen3-14B QLoRA SFT

forward: +43.3%

+43.3%

backward: +5.8%

+5.8%

per batch: +14.3%

+14.3%

The forward pass sees the biggest benefit because repeated metadata and mask preparation show up most directly there. Backward also improves, but the effect is smaller. The time saved is similar, but the backward pass, especially with gradient checkpointing, takes longer, so the relative gains appear smaller.

Now that we know the measured gain, we can ask a simpler question: does that scale make sense?

A quick sanity check

If we assume each layer is roughly similar, we can model the packed-attention path as:

T_uncached ≈ L · (A + s)

where:

L is the number of layers,

A is the useful attention-side work per layer,

s is the repeated metadata and mask-preparation overhead per layer.

With caching, that repeated overhead is paid once for the batch instead of once per layer:

T_cached ≈ L · A + s

So the saved time is approximately:

T_saved ≈ (L − 1) · s

For the packed SDPA path, our microbenchmark on NVIDIA Blackwell GPUs showed that the low-level, host-visible metadata calls were real but small, at about 0.2 ms each. The dominant repeated cost was the packed SDPA mask-construction path itself, which measured about 13.7 ms for a synthetic packed batch with 2048 total packed tokens.

SDPA

0.2 ms

SDPA

13.7 ms

2048

For the SDPA backend, a better mental model is:

SDPA

small stream fence + mask rebuild ≈ mask rebuild

That lets us do a cleaner consistency check. If one packed-mask rebuild costs m milliseconds, then under a uniform-layer model:

m

T_saved ≈ (L − 1) · m

With m ≈ 13.7 ms, that predicts:

16 layers: (16 - 1) x 13.7 ≈ 206 ms

16

(16 - 1) x 13.7 ≈ 206 ms

28 layers: (28 - 1) x 13.7 ≈ 370 ms

28

(28 - 1) x 13.7 ≈ 370 ms

Smaller packed-sequence runs showed the same pattern:

Llama-3.2-1B, 16 layers: about 199 ms saved per step, which is about 11.5% lower end-to-end step time

Llama-3.2-1B

16

199 ms

11.5%

Qwen3-0.6B, 28 layers: about 319 ms saved per step, which is about 14.8% lower end-to-end step time

Qwen3-0.6B

28

319 ms

14.8%

Those percentages are relative to full training step time, so they still include work outside the packed-attention path, such as embeddings, the MLP, the LM head, the loss, and framework overhead. This estimate is intentionally only about the packed-attention side of the block, not the whole transformer layer. It is there only to check that the measured gains are in the right range for the packed SDPA path.

SDPA

2. Hiding Latency With Double-Buffered Checkpoint Reloads

Activation checkpointing is a standard technique for training large models. The idea is to save memory by not keeping every intermediate activation alive through the backward pass. In exchange, we pay for some extra work during backward.

That trade-off is usually worth it, especially for larger models.

But it raises another systems question: if an activation has been offloaded, how does it get back to the GPU for backward?

In Unsloth's smart checkpointing path, activations can be staged in pinned CPU memory and copied back when needed. That saves VRAM, but it can introduce a bottleneck:

Copy the activation from CPU to GPU.

Wait for the copy to complete.

Run backward compute on that activation.

Start the next copy.

That is a serialization pattern. If one buffer is reused for both copy and compute, the copy stream and the compute stream keep taking turns.

Let T_copy be the activation reload time and T_compute be the backward compute time for the current layer.

With a single buffer, this part of the step is roughly limited by:

T_single ≈ T_copy + T_compute

That is the serialized case. We pay for both almost entirely, one after the other.

A cleaner way to handle this is to use two buffers.

While the backward pass is running on buffer A, the copy stream can preload the next activation into buffer B. Then the roles swap. That creates pipeline overlap, though not perfect overlap.

Double buffering does not reduce the amount of math. It hides copy latency behind useful compute.

Why this helps

This kind of optimization tends to get stronger once the model is large enough that backward compute is substantial, but not so dominant that all copy overhead disappears into noise. For larger models, higher hidden dimensions mean more data movement, so hiding that movement has a larger impact. Larger models also tend to have more layers, which creates more opportunities to hide copies behind computation.

That is why larger dense models are a good fit for this improvement. The GPU has enough real work going on that the copy can overlap with it, and the extra VRAM needed for the second buffer stays modest.

The implementation also keeps practical guardrails in place:

use extra buffers only when enough VRAM is available

fall back cleanly when the memory budget is tight

keep correctness unchanged

Benchmarks

On the larger dense-model runs, benchmarked with NVIDIA B200 Blackwell GPUs:

8B: 0.3739 -> 0.4053 steps/s, +8.40%

8B

0.3739 -> 0.4053 steps/s

+8.40%

14B: 0.2245 -> 0.2395 steps/s, +6.70%

14B

0.2245 -> 0.2395 steps/s

+6.70%

32B: 0.1979 -> 0.2070 steps/s, +4.61%

32B

0.1979 -> 0.2070 steps/s

+4.61%

Memory overhead stayed modest:

+0.37 GB at 8B

+0.37 GB

8B

+0.47 GB at 14B

+0.47 GB

14B

+0.23 GB at 32B

+0.23 GB

32B

In these runs, final losses were effectively unchanged.

The speedup is consistent across larger dense models, and the extra VRAM cost stays relatively small.

Once we know the measured gain, the natural follow-up is: does the scale make sense?

A quick sanity check

If we assume there are L checkpointed layers and each layer is roughly similar:

each reload takes time c

each backward compute chunk takes time g

This also scales with batch size, sequence length, and other factors that affect data movement and computation. We omit those terms for brevity.

With one buffer:

T_single ≈ L · (c + g)

With two buffers, the first layer still has to wait for its activation to arrive, and the last layer still has to finish computing. So a better approximation is:

T_double ≈ c + (L − 1) · max(c, g) + g

So the saved time is approximately:

T_saved ≈ (L − 1) · min(c, g)

This is the useful reading of the result:

the first copy is still exposed

the last compute is still exposed

but for the middle of the pipeline, copy and compute can overlap

If the overlap is good, the per-layer cost in the middle gets much closer to:

T_middle ≈ max(T_copy, T_compute)

From the measured larger-model results, the saved time per training step is roughly:

8B: about 207 ms

8B

207 ms

14B: about 279 ms

14B

279 ms

32B: about 222 ms

32B

222 ms

These host buffers are pinned allocations, so the relevant bandwidth is measured pinned-memory host-to-device bandwidth, not pageable-memory bandwidth. On our NVIDIA B200 Blackwell-based system, that bandwidth was about 55.7 GB/s, with 64 GB/s as a useful PCIe ceiling for comparison.

55.7 GB/s

64 GB/s

If we use the extra buffer size as a rough proxy for one activation reload, then each reload is naturally on the order of only a few milliseconds:

8B, 0.37 GB: about 6.6 ms at 55.7 GB/s, or 5.8 ms at the 64 GB/s ceiling

8B

0.37 GB

6.6 ms

55.7 GB/s

5.8 ms

64 GB/s

14B, 0.47 GB: about 8.4 ms at 55.7 GB/s, or 7.3 ms at the 64 GB/s ceiling

14B

0.47 GB

8.4 ms

55.7 GB/s

7.3 ms

64 GB/s

32B, 0.23 GB: about 4.1 ms at 55.7 GB/s, or 3.6 ms at the 64 GB/s ceiling

32B

0.23 GB

4.1 ms

55.7 GB/s

3.6 ms

64 GB/s

To explain the observed saved time per step, we would need to hide roughly a few dozen such reloads:

8B: about 31 reloads at 55.7 GB/s, or 36 at 64 GB/s

8B

31

55.7 GB/s

36

64 GB/s

14B: about 33 reloads at 55.7 GB/s, or 38 at 64 GB/s

14B

33

55.7 GB/s

38

64 GB/s

32B: about 54 reloads at 55.7 GB/s, or 62 at 64 GB/s