>The whole "mystery" of transformer is that instead of a linear sequence of static weights times values in each layer, you now have 3 different matrices that are obtained from the same input through multiplication of learned weights, and then you just multiply the matrices together. I.e more parallelism which works out nice, but very restrictive since the attention formula is static.
That's not it at all. What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them. The big theoretical advantage over RNNs (which were used for sequences prior to transformers), is that transformers support this in a lossless way, as each element has full access to all the information in every other element in the sequence (or at least all the ones that occurred before it in time sequences). RNNs and "linear transformers" on the other hand compress past values, so generally the last element of a long sequence will not have access to all the information in the first element of the sequence (unless the RNN internal state was really really big so it didn't need to discard any information).
>What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them.
They do that in theory. In practice, its just all matrix multiplication. You could easily structure a transformer as a bunch of fully connected deep layers and it would be mathematically equivalent, just computationally inefficient.
That's not it at all. What's special about transformers is they allow each element in a sequence to decide which parts of data are most important to it from each other element in the sequence, then extract those out and compute on them. The big theoretical advantage over RNNs (which were used for sequences prior to transformers), is that transformers support this in a lossless way, as each element has full access to all the information in every other element in the sequence (or at least all the ones that occurred before it in time sequences). RNNs and "linear transformers" on the other hand compress past values, so generally the last element of a long sequence will not have access to all the information in the first element of the sequence (unless the RNN internal state was really really big so it didn't need to discard any information).