Expand ↗
Page list (1268)

Scaled Dot-Product Attention

Vaswani et al.: attention weights = softmax(QK^T / √d_k) V. The single-head primitive underlying multi-head attention.

In this vault

Backlinks