Multi-Head Attention

Running several attention operations in parallel on different learned projections, then concatenating. Lets a Transformer layer attend to multiple relational patterns simultaneously.

In this vault

Backlinks