Nathan Lambert@natolambert

2026-07-02 04:24·1天前

AI 摘要

我在课程中陆续制作 Q&A 视频。这是下一期，涵盖 on-policy 蒸馏和奖励模型推导中的细微修正、做这类数学时常见的符号陷阱，以及更多深入资料（例如 @johnschulman2 的 KL 估计博客）。 Q&A 2 来了！ 00:00 推导修正 06:10 代码示例与额外资源 08:08 更多 RL 符号与注释继续在 YouTube、GitHub 和 Discord 上发送问题吧。我和 Phoebe 都很喜欢这些问题。

I'm doing Q&A videos as I roll through my course. Here's the next one， covering subtle fixes to the on-policy distillation and reward model derivations， common notation traps when doing this math， and more added resources to go deeper （e.g. @johnschulman2's kl estimation blog）.

Q&A 2 is here！

00：00 Derivation fixes 06：10 Code examples & additional resources 08：08 Extra RL notation and notes

Keep sending questions on YouTube， GitHub， and Discord. Phoebe and I are loving them.

安全/对齐教程/实践数据/训练

在 X 查看原推导出 Markdown

Nathan Lambert@natolambert · X

43导出 Markdown

2026-07-02 04:24·1天前

在 X 看原推· x.com

AI 摘要

Q&A 2 is here！

00：00 Derivation fixes 06：10 Code examples & additional resources 08：08 Extra RL notation and notes

Keep sending questions on YouTube， GitHub， and Discord. Phoebe and I are loving them.