Scaled Dot-Product Attention

Vaswani et al.: attention weights = softmax(QK^T / √d_k) V. The single-head primitive underlying multi-head attention.

In this vault

Backlinks