确实。对于文生图,@xichen_pan 有一个很好的总结支持这种解耦的设计理念:"把生成性的归给 diffusion,把理解的归给 LLMs。"
Indeed. For text-to-image, @xichen_pan had a great summary supporting this decoupled design philosophy: "Render unto diffusion what is generative, and unto LLMs what is understanding."
We've repeatedly observed that diffusion gradients can negatively impact the backbone repr. This effect shows up in simpler settings-for example, we explored this issue to some extent in REPA-E (https://end2end-diffusion.github.io/).
I believe the same principle applies to VLA. Fundamentally, the problem seems to be that diffusion gradients care too much about high-frequency details-whether in pixels or action policies-which tends to conflict with representation learning and understanding.
btw, @ylecun has always been right about this -- long before any of these empirical findings.