我在课程中陆续制作 Q&A 视频。这是下一期,涵盖 on-policy 蒸馏和奖励模型推导中的细微修正、做这类数学时常见的符号陷阱,以及更多深入资料(例如 @johnschulman2 的 KL 估计博客)。 Q&A 2 来了! 00:00 推导修正 06:10 代码示例与额外资源 08:08 更多 RL 符号与注释 继续在 YouTube、GitHub 和 Discord 上发送问题吧。我和 Phoebe 都很喜欢这些问题。
I'm doing Q&A videos as I roll through my course. Here's the next one, covering subtle fixes to the on-policy distillation and reward model derivations, common notation traps when doing this math, and more added resources to go deeper (e.g. @johnschulman2's kl estimation blog).
Q&A 2 is here!
00:00 Derivation fixes 06:10 Code examples & additional resources 08:08 Extra RL notation and notes
Keep sending questions on YouTube, GitHub, and Discord. Phoebe and I are loving them.