Saining Xie@sainingxie

2025-05-29 05:34·400天前

AI 摘要

确实。对于文生图，@xichen_pan 有一个很好的总结支持这种解耦的设计理念："把生成性的归给 diffusion，把理解的归给 LLMs。"

Indeed. For text-to-image， @xichen_pan had a great summary supporting this decoupled design philosophy： "Render unto diffusion what is generative， and unto LLMs what is understanding."

We've repeatedly observed that diffusion gradients can negatively impact the backbone repr. This effect shows up in simpler settings-for example， we explored this issue to some extent in REPA-E （https://end2end-diffusion.github.io/）.

I believe the same principle applies to VLA. Fundamentally， the problem seems to be that diffusion gradients care too much about high-frequency details-whether in pixels or action policies-which tends to conflict with representation learning and understanding.

btw， @ylecun has always been right about this -- long before any of these empirical findings.

You Jiachengas expected, this matches findings in unified multimodal understanding and generation models by @sainingxie: frozen VLM might help you. https://xichenpan.com/me...

图像生成多模态大佬观点

在 X 查看原推导出 Markdown

Saining Xie@sainingxie · X

导出 Markdown