โ Back to Transformer Overview | ๐ฏ๐ด ุงููุณุฎุฉ ุงูุนุฑุจูุฉ
Running Example (used throughout):
Translation from English to Arabic:
Input: "I love learning"
Output: "ุฃูุง ุฃุญุจ ุงูุชุนูู "
First step: each word is converted to a vector (list of numbers). We'll assume embedding dimension = 4 (in practice it's 512).
"I" โ [0.2, 0.5, 0.1, 0.8] "love" โ [0.9, 0.3, 0.7, 0.2] "learning" โ [0.4, 0.6, 0.5, 0.9]
These numbers are learned during training โ words with similar meanings end up with similar vectors.
Transformers process all words in parallel (not one by one), so they don't know that "I" is the 1st word and "learning" is the 3rd.
Add a unique vector to each position using sin and cos functions:
Where: pos = word position (0, 1, 2, ...) | i = dimension index | d = embedding size (= 4)
This gives every position a unique fingerprint.
Position 0 ("I"):
PE(0,0) = sin(0 / 10000^(0/4)) = sin(0) = 0.00 PE(0,1) = cos(0 / 10000^(0/4)) = cos(0) = 1.00 PE(0,2) = sin(0 / 10000^(2/4)) = sin(0) = 0.00 PE(0,3) = cos(0 / 10000^(2/4)) = cos(0) = 1.00 โ PEโ = [0.00, 1.00, 0.00, 1.00]
Position 1 ("love"):
PE(1,0) = sin(1) = 0.84 PE(1,1) = cos(1) = 0.54 PE(1,2) = sin(0.01) = 0.01 PE(1,3) = cos(0.01) = 1.00 โ PEโ = [0.84, 0.54, 0.01, 1.00]
Position 2 ("learning"):
PE(2,0) = sin(2) = 0.91 PE(2,1) = cos(2) = -0.42 PE(2,2) = sin(0.02) = 0.02 PE(2,3) = cos(0.02) = 1.00 โ PEโ = [0.91, -0.42, 0.02, 1.00]
"I" = [0.2, 0.5, 0.1, 0.8] + [0.00, 1.00, 0.00, 1.00] = [0.20, 1.50, 0.10, 1.80] "love" = [0.9, 0.3, 0.7, 0.2] + [0.84, 0.54, 0.01, 1.00] = [1.74, 0.84, 0.71, 1.20] "learning" = [0.4, 0.6, 0.5, 0.9] + [0.91,-0.42, 0.02, 1.00] = [1.31, 0.18, 0.52, 1.90]
Each word asks: "Which other words are important for understanding my meaning?"
For each word, multiply by three learned weight matrices:
Q = X ร Wq (Query โ "What am I looking for?") K = X ร Wk (Key โ "What do I contain?") V = X ร Wv (Value โ "What information do I give?")
After multiplication (simplified to size 3):
Q K V
"I" [1, 0, 1] [0, 1, 1] [1, 0, 0]
"love" [0, 1, 0] [1, 1, 0] [0, 1, 0]
"learn" [1, 1, 0] [0, 0, 1] [0, 0, 1]
a) Multiply Q ร KT (every query with every key):
Example: scores for word "I" with all words:
"I" with "I": Q_I ยท K_I = [1,0,1]ยท[0,1,1] = 0+0+1 = 1 "I" with "love": Q_I ยท K_love = [1,0,1]ยท[1,1,0] = 1+0+0 = 1 "I" with "learn": Q_I ยท K_learn = [1,0,1]ยท[0,0,1] = 0+0+1 = 1
All scores:
"I" "love" "learn"
"I" โ [ 1, 1, 1 ]
"love" โ [ 1, 1, 0 ]
"learn" โ [ 0, 2, 1 ]
b) Divide by โdk (scaling):
d_k = 3 โ โ3 โ 1.73
"I" "love" "learn"
"I" โ [0.58, 0.58, 0.58]
"love" โ [0.58, 0.58, 0.00]
"learn" โ [0.00, 1.15, 0.58]
c) Apply Softmax (convert to probabilities):
Each row becomes probabilities that sum to 1:
"I" "love" "learn"
"I" โ [0.33, 0.33, 0.33] โ attends equally to all words
"love" โ [0.39, 0.39, 0.22] โ attends more to "I" and itself
"learn" โ [0.19, 0.59, 0.22] โ attends most to "love"!
d) Multiply by V (get weighted output):
output_I = 0.33ร[1,0,0] + 0.33ร[0,1,0] + 0.33ร[0,0,1] = [0.33, 0.33, 0.33] output_love = 0.39ร[1,0,0] + 0.39ร[0,1,0] + 0.22ร[0,0,1] = [0.39, 0.39, 0.22] output_learn = 0.19ร[1,0,0] + 0.59ร[0,1,0] + 0.22ร[0,0,1] = [0.19, 0.59, 0.22]
Instead of one attention, run 8 in parallel โ each one called a head.
Each head learns a different type of relationship:
1. For each head: run Self-Attention independently (with different weights) headโ = Attention(QรWโQ, KรWโK, VรWโV) headโ = Attention(QรWโQ, KรWโK, VรWโV) ... headโ = Attention(QรWโQ, KรWโK, VรWโV) 2. Concatenate all results: MultiHead = Concat(headโ, headโ, ..., headโ) ร Wแดผ
Head 1 output for "learn": [0.19, 0.59, 0.22] (learned semantic relations) Head 2 output for "learn": [0.45, 0.10, 0.45] (learned positional relations) Concat: [0.19, 0.59, 0.22, 0.45, 0.10, 0.45] ร Wแดผ โ [0.30, 0.40, 0.30] (final combined output)
The original input is added back to the sublayer's output.
Where: ฮผ = mean | ฯยฒ = variance | ฮณ, ฮฒ = learnable parameters
x (original input for "learn"): [1.31, 0.18, 0.52]
attention output: [0.19, 0.59, 0.22]
After Residual: [1.31+0.19, 0.18+0.59, 0.52+0.22] = [1.50, 0.77, 0.74]
ฮผ = (1.50 + 0.77 + 0.74) / 3 = 1.00
ฯยฒ = ((0.50)ยฒ + (-0.23)ยฒ + (-0.26)ยฒ) / 3 = 0.12
โ(ฯยฒ + ฮต) โ 0.35
LayerNorm = [(1.50-1.00)/0.35, (0.77-1.00)/0.35, (0.74-1.00)/0.35]
= [1.43, -0.66, -0.74]
After attention, each position passes through a small neural network:
input: [1.43, -0.66, -0.74] After Wโ + bโ: [2.1, -0.3, 1.5, -1.2] After ReLU (max 0): [2.1, 0.0, 1.5, 0.0] โ negatives become 0 After Wโ + bโ: [0.8, 0.5, 0.3] โ back to original size
The Decoder works one word at a time. Assume we're generating the 3rd word and have:
Decoder inputs so far: ["ุฃูุง", "ุฃุญุจ", ???]
Same as Self-Attention โ but with a mask:
Scores before mask:
"ุฃูุง" "ุฃุญุจ" ???
"ุฃูุง" โ [0.8, 0.5, 0.3]
"ุฃุญุจ" โ [0.6, 0.9, 0.4]
??? โ [0.3, 0.7, 0.8]
Apply Mask (lower triangle only):
"ุฃูุง" "ุฃุญุจ" ???
"ุฃูุง" โ [0.8, -โ, -โ ] โ can only see itself
"ุฃุญุจ" โ [0.6, 0.9, -โ ] โ can see "ุฃูุง" and itself
??? โ [0.3, 0.7, 0.8] โ can see everything before it
After Softmax:
"ุฃูุง" โ [1.00, 0.00, 0.00]
"ุฃุญุจ" โ [0.43, 0.57, 0.00]
??? โ [0.17, 0.37, 0.46]
Here the Decoder looks at the Encoder's output:
Q = from Decoder (the word we're generating)
K = from Encoder ("I", "love", "learning")
V = from Encoder ("I", "love", "learning")
Q for ???: [0.5, 0.8, 0.3] Scores with each English word: ??? with "I": 0.5ร0.2 + 0.8ร0.5 + 0.3ร0.1 = 0.53 ??? with "love": 0.5ร0.9 + 0.8ร0.3 + 0.3ร0.7 = 0.90 ??? with "learning": 0.5ร0.4 + 0.8ร0.6 + 0.3ร0.5 = 0.83 After softmax: [0.22, 0.40, 0.38]
Final step โ the Decoder output passes through:
1. Linear Layer: projects vector to vocabulary size (e.g., 50,000 words) 2. Softmax: converts to probabilities Decoder output for ???: [0.25, 0.60, 0.15] After Linear (simplified to 5 words): ["ุฃูุง": 0.02, "ุฃุญุจ": 0.05, "ุงูุชุนูู ": 0.85, "ูุชุงุจ": 0.03, "ุจูุช": 0.05] Highest probability โ "ุงูุชุนูู " (0.85) โ
| Parameter | Value |
|---|---|
| Encoder layers (original paper) | 6 |
| Decoder layers | 6 |
| Embedding size (d_model) | 512 |
| Number of heads | 8 |
| Size per head (d_k) | 512 / 8 = 64 |
| Feed-Forward inner dimension | 2048 |
| Original paper | "Attention Is All You Need" (2017) |
| Authors | Vaswani et al. (Google) |
| Optimizer | Adam (with warmup + decay schedule) |
| Training loss | Cross-Entropy |
| Decoder training trick | Teacher Forcing (feed correct previous token) |
| Type | Examples | Best For | How It Works |
|---|---|---|---|
| Encoder-only | BERT, RoBERTa, DistilBERT | Classification, NER, understanding | Bidirectional โ sees all words at once |
| Decoder-only | GPT, GPT-2, GPT-3, GPT-4 | Text generation | Autoregressive โ predicts next token left-to-right |
| Encoder-Decoder | BART, T5, Original Transformer | Translation, summarization | Full sequence-to-sequence |
Prepared by Dr. Abdulkarim Albanna โ AI Applications Course