>The whole "mystery" of transformer is that instead of a linear sequence of stat...

ActorNightly · on Jan 4, 2024

>What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them.

They do that in theory. In practice, its just all matrix multiplication. You could easily structure a transformer as a bunch of fully connected deep layers and it would be mathematically equivalent, just computationally inefficient.