WARNING: GENERATED BY AI AND NEEDS CLEANUP -- DO NOT TRUST
^ here is basic math behind the white paper, which I am still learning how to read. This is just a copy-paste of what LLM says (which might be very wrong)
Here is a table summarizing the mathematical concepts and formulas from the Attention Is All You Need paper, along with explanations of how to read each formula:
| Mathematical Concept / Formula | Formula / Notation | How to Read / Meaning |
|---|---|---|
| Scaled Dot-Product Attention | ( \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ) | Compute attention scores as softmax of scaled dot products between queries ( Q ) and keys ( K ), then weight values ( V ) by these scores. |
| Multi-Head Attention | ( \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O ) ( \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) ) |
Apply multiple attention heads in parallel, each with its own linear projections, then concatenate and project the results. |
| Position-wise Feed-Forward Network | ( \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 ) | Apply two linear transformations with a ReLU activation in between, independently to each position. |
| Positional Encoding | ( PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) ) ( PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) ) |
Encode position information using sine and cosine functions of different frequencies for each dimension ( i ). |
| Layer Normalization with Residual Connection | ( \text{LayerNorm}(x + \text{Sublayer}(x)) ) | Apply a sub-layer (e.g., attention or FFN), add the input via a residual connection, then apply layer normalization. |
| Learning Rate Schedule | ( \text{lrate} = d_{\text{model}}^{-0.5} \cdot \min(\text{step_num}^{-0.5}, \text{step_num} \cdot \text{warmup_steps}^{-1.5}) ) | Adjust learning rate: increase linearly during warmup, then decay proportionally to the inverse square root of step number. |
| Complexity of Self-Attention | ( O(n^2 \cdot d) ) | Computational complexity of self-attention, where ( n ) is sequence length and ( d ) is model dimension. |
| Complexity of Recurrent Layer | ( O(n \cdot d^2) ) | Computational complexity of a recurrent layer. |
| Complexity of Convolutional Layer | ( O(k \cdot n \cdot d^2) ) | Computational complexity of a convolutional layer with kernel size ( k ). |
| Label Smoothing | ( \epsilon_{ls} = 0.1 ) | Use a smoothing factor in the loss function to prevent overconfidence in training labels. |
| Residual Dropout | ( P_{drop} = 0.1 ) | Apply dropout to sub-layer outputs and embedding sums with a rate of 0.1. |
This table captures the core mathematical components of the Transformer model as presented in the paper, along with brief explanations of how to interpret each formula or notation.