WARNING: GENERATED BY AI AND NEEDS CLEANUP -- DO NOT TRUST

^ here is basic math behind the white paper, which I am still learning how to read. This is just a copy-paste of what LLM says (which might be very wrong)

Here is a table summarizing the mathematical concepts and formulas from the Attention Is All You Need paper, along with explanations of how to read each formula:

Mathematical Concept / Formula	Formula / Notation	How to Read / Meaning
Scaled Dot-Product Attention	( \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V )	Compute attention scores as softmax of scaled dot products between queries ( Q ) and keys ( K ), then weight values ( V ) by these scores.
Multi-Head Attention	( \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)W^O ) ( \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) )	Apply multiple attention heads in parallel, each with its own linear projections, then concatenate and project the results.
Position-wise Feed-Forward Network	( \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 )	Apply two linear transformations with a ReLU activation in between, independently to each position.
Positional Encoding	( PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) ) ( PE_{(pos,2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) )	Encode position information using sine and cosine functions of different frequencies for each dimension ( i ).
Layer Normalization with Residual Connection	( \text{LayerNorm}(x + \text{Sublayer}(x)) )	Apply a sub-layer (e.g., attention or FFN), add the input via a residual connection, then apply layer normalization.
Learning Rate Schedule	( \text{lrate} = d_{\text{model}}^{-0.5} \cdot \min(\text{step_num}^{-0.5}, \text{step_num} \cdot \text{warmup_steps}^{-1.5}) )	Adjust learning rate: increase linearly during warmup, then decay proportionally to the inverse square root of step number.
Complexity of Self-Attention	( O(n^2 \cdot d) )	Computational complexity of self-attention, where ( n ) is sequence length and ( d ) is model dimension.
Complexity of Recurrent Layer	( O(n \cdot d^2) )	Computational complexity of a recurrent layer.
Complexity of Convolutional Layer	( O(k \cdot n \cdot d^2) )	Computational complexity of a convolutional layer with kernel size ( k ).
Label Smoothing	( \epsilon_{ls} = 0.1 )	Use a smoothing factor in the loss function to prevent overconfidence in training labels.
Residual Dropout	( P_{drop} = 0.1 )	Apply dropout to sub-layer outputs and embedding sums with a rate of 0.1.

This table captures the core mathematical components of the Transformer model as presented in the paper, along with brief explanations of how to interpret each formula or notation.

Farshid's Weblog

Attention Is All You Need

The paper