Ethan Mollick@emollick

2026-05-02 22:22·61天前

AI 摘要

谷歌研究团队在论文《Attention Is All You Need》中提出全新的Transformer模型，完全摒弃了RNN和LSTM等传统循环与卷积结构，仅依赖自注意力机制并行处理整个句子。该模型在机器翻译任务上取得突破性性能：英德翻译达到28.4 BLEU分，以超过2分的优势超越先前最佳模型；英法翻译达41.8 BLEU分，且训练成本极低。仅用8块GPU在12小时内即可完成训练，其多注意力头机制能同时学习数据中的不同关系。这一成果标志着NLP领域的根本性范式转变。

（Sorry， after seeing so many of these， could not resist）：

🚨 BREAKING： Google just dropped a NEW paper that completely deletes RNNs from existence.

No recurrence. No convolutions. Nothing. Just one mechanism. And it's destroying every translation benchmark on the planet.

The title alone is a flex： "Attention Is All You Need"

Vaswani. Shazeer. Parmar. Uszkoreit. Jones. Gomez. Kaiser. Polosukhin.

8 researchers. 1 architecture. The entire field of NLP will never be the same.

Here's why this is INSANE → LSTMs took DAYS to train. This thing trains in 12 hours on 8 GPUs. 🤯 → 28.4 BLEU on English-to-German. That's not an improvement. That's a MASSACRE. They beat the previous SOTA by over 2 points. → English-to-French？ 41.8 BLEU. At a FRACTION of the training cost of every model that came before it. → They called it the "Transformer." The name alone tells you they knew.

But here's the part nobody is talking about

👇 They threw out sequential processing ENTIRELY.

Every other model on Earth processes words one at a time. This thing looks at the ENTIRE sentence simultaneously and figures out what matters.

It's called "self-attention" and it's basically the model asking itself： "which words should I care about right now？"

Every. Single. Token. In parallel.

Do you understand what this means？

Training that used to take WEEKS now takes HOURS.

Models that couldn't scale past a few layers？ This thing stacks 6 encoders and 6 decoders like it's nothing.

And the multi-head attention？ 8 attention heads running at once， each learning DIFFERENT relationships in the data.

I'm not being dramatic when I say this paper just rewrote the rulebook.

Ethan Mollick@emollick · X

29导出 Markdown