โ† Back to Transformer Overview   |   ๐Ÿ‡ฏ๐Ÿ‡ด ุงู„ู†ุณุฎุฉ ุงู„ุนุฑุจูŠุฉ

Transformer โ€” Complete Step-by-Step Guide

Running Example (used throughout):
Translation from English to Arabic:
Input: "I love learning"
Output: "ุฃู†ุง ุฃุญุจ ุงู„ุชุนู„ู…"

1. Input Embedding (Words โ†’ Numbers)

First step: each word is converted to a vector (list of numbers). We'll assume embedding dimension = 4 (in practice it's 512).

"I"        โ†’ [0.2,  0.5,  0.1,  0.8]
"love"     โ†’ [0.9,  0.3,  0.7,  0.2]
"learning" โ†’ [0.4,  0.6,  0.5,  0.9]

These numbers are learned during training โ€” words with similar meanings end up with similar vectors.


2. Positional Encoding (Adding Word Order)

The Problem:

Transformers process all words in parallel (not one by one), so they don't know that "I" is the 1st word and "learning" is the 3rd.

The Solution:

Add a unique vector to each position using sin and cos functions:

PE(pos, 2i) = sin(pos / 100002i/d)
PE(pos, 2i+1) = cos(pos / 100002i/d)

Where: pos = word position (0, 1, 2, ...)  |  i = dimension index  |  d = embedding size (= 4)

Think of it like a clock:

This gives every position a unique fingerprint.

Calculation:

Position 0 ("I"):

PE(0,0) = sin(0 / 10000^(0/4)) = sin(0) = 0.00
PE(0,1) = cos(0 / 10000^(0/4)) = cos(0) = 1.00
PE(0,2) = sin(0 / 10000^(2/4)) = sin(0) = 0.00
PE(0,3) = cos(0 / 10000^(2/4)) = cos(0) = 1.00
โ†’ PEโ‚€ = [0.00, 1.00, 0.00, 1.00]

Position 1 ("love"):

PE(1,0) = sin(1) = 0.84       PE(1,1) = cos(1) = 0.54
PE(1,2) = sin(0.01) = 0.01    PE(1,3) = cos(0.01) = 1.00
โ†’ PEโ‚ = [0.84, 0.54, 0.01, 1.00]

Position 2 ("learning"):

PE(2,0) = sin(2) = 0.91       PE(2,1) = cos(2) = -0.42
PE(2,2) = sin(0.02) = 0.02    PE(2,3) = cos(0.02) = 1.00
โ†’ PEโ‚‚ = [0.91, -0.42, 0.02, 1.00]

Add them together (Embedding + Position):

"I"        = [0.2, 0.5, 0.1, 0.8] + [0.00, 1.00, 0.00, 1.00] = [0.20, 1.50, 0.10, 1.80]
"love"     = [0.9, 0.3, 0.7, 0.2] + [0.84, 0.54, 0.01, 1.00] = [1.74, 0.84, 0.71, 1.20]
"learning" = [0.4, 0.6, 0.5, 0.9] + [0.91,-0.42, 0.02, 1.00] = [1.31, 0.18, 0.52, 1.90]
Now each word knows what it is (embedding) and where it is (position). โœ…

3. Self-Attention (The Core Mechanism)

The Idea:

Each word asks: "Which other words are important for understanding my meaning?"

Step 1: Create Q, K, V

For each word, multiply by three learned weight matrices:

Q = X ร— Wq    (Query  โ€” "What am I looking for?")
K = X ร— Wk    (Key    โ€” "What do I contain?")
V = X ร— Wv    (Value  โ€” "What information do I give?")

After multiplication (simplified to size 3):

            Q              K              V
"I"      [1, 0, 1]     [0, 1, 1]     [1, 0, 0]
"love"   [0, 1, 0]     [1, 1, 0]     [0, 1, 0]
"learn"  [1, 1, 0]     [0, 0, 1]     [0, 0, 1]

Step 2: Compute Attention Scores

Attention(Q, K, V) = softmax(Q ร— KT / โˆšdk) ร— V

a) Multiply Q ร— KT (every query with every key):

Example: scores for word "I" with all words:

"I" with "I":     Q_I ยท K_I     = [1,0,1]ยท[0,1,1] = 0+0+1 = 1
"I" with "love":  Q_I ยท K_love  = [1,0,1]ยท[1,1,0] = 1+0+0 = 1
"I" with "learn": Q_I ยท K_learn = [1,0,1]ยท[0,0,1] = 0+0+1 = 1

All scores:

              "I"    "love"   "learn"
"I"      โ†’  [ 1,      1,       1    ]
"love"   โ†’  [ 1,      1,       0    ]
"learn"  โ†’  [ 0,      2,       1    ]

b) Divide by โˆšdk (scaling):

d_k = 3  โ†’  โˆš3 โ‰ˆ 1.73

              "I"    "love"   "learn"
"I"      โ†’  [0.58,   0.58,    0.58]
"love"   โ†’  [0.58,   0.58,    0.00]
"learn"  โ†’  [0.00,   1.15,    0.58]
Why divide? Without scaling, large values cause softmax to output near-0 and near-1 only (no gradients) โ†’ model can't learn effectively.

c) Apply Softmax (convert to probabilities):

Each row becomes probabilities that sum to 1:

              "I"    "love"   "learn"
"I"      โ†’  [0.33,   0.33,    0.33]   โ† attends equally to all words
"love"   โ†’  [0.39,   0.39,    0.22]   โ† attends more to "I" and itself
"learn"  โ†’  [0.19,   0.59,    0.22]   โ† attends most to "love"!
Notice: "learning" attends most to "love" โ€” makes sense! "love learning" are semantically connected.

d) Multiply by V (get weighted output):

output_I     = 0.33ร—[1,0,0] + 0.33ร—[0,1,0] + 0.33ร—[0,0,1] = [0.33, 0.33, 0.33]
output_love  = 0.39ร—[1,0,0] + 0.39ร—[0,1,0] + 0.22ร—[0,0,1] = [0.39, 0.39, 0.22]
output_learn = 0.19ร—[1,0,0] + 0.59ร—[0,1,0] + 0.22ร—[0,0,1] = [0.19, 0.59, 0.22]
Now each word carries information from the words that matter to it. โœ…

4. Multi-Head Attention (Multiple Perspectives)

The Idea:

Instead of one attention, run 8 in parallel โ€” each one called a head.

Why?

Each head learns a different type of relationship:

Steps:

1. For each head: run Self-Attention independently (with different weights)
   headโ‚ = Attention(Qร—Wโ‚Q, Kร—Wโ‚K, Vร—Wโ‚V)
   headโ‚‚ = Attention(Qร—Wโ‚‚Q, Kร—Wโ‚‚K, Vร—Wโ‚‚V)
   ...
   headโ‚ˆ = Attention(Qร—Wโ‚ˆQ, Kร—Wโ‚ˆK, Vร—Wโ‚ˆV)

2. Concatenate all results:
   MultiHead = Concat(headโ‚, headโ‚‚, ..., headโ‚ˆ) ร— Wแดผ

In our example (simplified to 2 heads):

Head 1 output for "learn": [0.19, 0.59, 0.22]  (learned semantic relations)
Head 2 output for "learn": [0.45, 0.10, 0.45]  (learned positional relations)

Concat: [0.19, 0.59, 0.22, 0.45, 0.10, 0.45]
ร— Wแดผ โ†’ [0.30, 0.40, 0.30]  (final combined output)

5. Residual Connection + Layer Normalization

Residual (Skip) Connection:

output = x + Sublayer(x)

The original input is added back to the sublayer's output.

Why? Prevents vanishing gradients in deep networks and ensures original information isn't lost.

Layer Normalization:

LayerNorm(x) = (x - ฮผ) / โˆš(ฯƒยฒ + ฮต) ร— ฮณ + ฮฒ

Where: ฮผ = mean  |  ฯƒยฒ = variance  |  ฮณ, ฮฒ = learnable parameters

In our example:

x (original input for "learn"):    [1.31, 0.18, 0.52]
attention output:                   [0.19, 0.59, 0.22]

After Residual: [1.31+0.19, 0.18+0.59, 0.52+0.22] = [1.50, 0.77, 0.74]

ฮผ = (1.50 + 0.77 + 0.74) / 3 = 1.00
ฯƒยฒ = ((0.50)ยฒ + (-0.23)ยฒ + (-0.26)ยฒ) / 3 = 0.12
โˆš(ฯƒยฒ + ฮต) โ‰ˆ 0.35

LayerNorm = [(1.50-1.00)/0.35, (0.77-1.00)/0.35, (0.74-1.00)/0.35]
          = [1.43, -0.66, -0.74]

6. Feed-Forward Network

After attention, each position passes through a small neural network:

FFN(x) = max(0, xร—Wโ‚ + bโ‚) ร— Wโ‚‚ + bโ‚‚

In our example:

input: [1.43, -0.66, -0.74]

After Wโ‚ + bโ‚:      [2.1, -0.3, 1.5, -1.2]
After ReLU (max 0):  [2.1,  0.0, 1.5,  0.0]    โ† negatives become 0
After Wโ‚‚ + bโ‚‚:      [0.8,  0.5, 0.3]           โ† back to original size
+ Another Residual + LayerNorm โ†’ Encoder output is ready! โœ…

7. Decoder โ€” Generating the Translation

The Decoder works one word at a time. Assume we're generating the 3rd word and have:

Decoder inputs so far: ["ุฃู†ุง", "ุฃุญุจ", ???]

7.1 Masked Self-Attention

Same as Self-Attention โ€” but with a mask:

Scores before mask:
           "ุฃู†ุง"   "ุฃุญุจ"    ???
"ุฃู†ุง"  โ†’  [0.8,    0.5,    0.3]
"ุฃุญุจ"  โ†’  [0.6,    0.9,    0.4]
 ???    โ†’  [0.3,    0.7,    0.8]

Apply Mask (lower triangle only):
           "ุฃู†ุง"   "ุฃุญุจ"    ???
"ุฃู†ุง"  โ†’  [0.8,     -โˆž,     -โˆž ]   โ† can only see itself
"ุฃุญุจ"  โ†’  [0.6,    0.9,     -โˆž ]   โ† can see "ุฃู†ุง" and itself
 ???    โ†’  [0.3,    0.7,    0.8]   โ† can see everything before it

After Softmax:
"ุฃู†ุง"  โ†’  [1.00,   0.00,   0.00]
"ุฃุญุจ"  โ†’  [0.43,   0.57,   0.00]
 ???    โ†’  [0.17,   0.37,   0.46]
Why the Mask? During generation, the model must NOT cheat by looking at future words it hasn't generated yet!

7.2 Cross-Attention (Encoder-Decoder Attention)

Here the Decoder looks at the Encoder's output:

Q = from Decoder (the word we're generating)
K = from Encoder ("I", "love", "learning")
V = from Encoder ("I", "love", "learning")
Q for ???:  [0.5, 0.8, 0.3]

Scores with each English word:
  ??? with "I":        0.5ร—0.2 + 0.8ร—0.5 + 0.3ร—0.1 = 0.53
  ??? with "love":     0.5ร—0.9 + 0.8ร—0.3 + 0.3ร—0.7 = 0.90
  ??? with "learning": 0.5ร—0.4 + 0.8ร—0.6 + 0.3ร—0.5 = 0.83

After softmax: [0.22, 0.40, 0.38]
The 3rd Arabic word attends most to "love" and "learning" โ€” makes sense since "ุงู„ุชุนู„ู…" translates "learning"! โœ…

8. Linear + Softmax (Predicting the Word)

Final step โ€” the Decoder output passes through:

1. Linear Layer: projects vector to vocabulary size (e.g., 50,000 words)
2. Softmax: converts to probabilities

Decoder output for ???: [0.25, 0.60, 0.15]

After Linear (simplified to 5 words):
["ุฃู†ุง": 0.02, "ุฃุญุจ": 0.05, "ุงู„ุชุนู„ู…": 0.85, "ูƒุชุงุจ": 0.03, "ุจูŠุช": 0.05]

Highest probability โ†’ "ุงู„ุชุนู„ู…" (0.85) โœ…
Result: "ุฃู†ุง ุฃุญุจ ุงู„ุชุนู„ู…" ๐ŸŽ‰

9. Full Pipeline Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ENCODER โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ "I love learning" โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [Input Embedding] โ†’ vector for each word โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [+ Positional Encoding] โ†’ add position information โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [Multi-Head Self-Attention] โ†’ each word attends to all others โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [+ Residual + LayerNorm] โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [Feed-Forward Network] โ†’ neural net per position โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [+ Residual + LayerNorm] โ”‚ โ”‚ โ†“ โ”‚ โ”‚ โ•โ•โ–บ Encoder Output (sent to Decoder) โ”‚ โ”‚ โ”‚ โ”‚ ร— 6 layers โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ DECODER โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ "ุฃู†ุง ุฃุญุจ" (words generated so far) โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [Output Embedding + Positional Encoding] โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [Masked Multi-Head Self-Attention] โ†’ can't see future โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [+ Residual + LayerNorm] โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [Cross-Attention] โ†’ Q from here, K+V from Encoder โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [+ Residual + LayerNorm] โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [Feed-Forward Network] โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [+ Residual + LayerNorm] โ”‚ โ”‚ โ†“ โ”‚ โ”‚ [Linear โ†’ Softmax] โ†’ probability over vocabulary โ”‚ โ”‚ โ†“ โ”‚ โ”‚ โ•โ•โ–บ "ุงู„ุชุนู„ู…" โ† highest probability word โ”‚ โ”‚ โ”‚ โ”‚ ร— 6 layers โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

10. Key Facts (Exam Reference)

ParameterValue
Encoder layers (original paper)6
Decoder layers6
Embedding size (d_model)512
Number of heads8
Size per head (d_k)512 / 8 = 64
Feed-Forward inner dimension2048
Original paper"Attention Is All You Need" (2017)
AuthorsVaswani et al. (Google)
OptimizerAdam (with warmup + decay schedule)
Training lossCross-Entropy
Decoder training trickTeacher Forcing (feed correct previous token)

11. Three Types of Transformer Models

TypeExamplesBest ForHow It Works
Encoder-only BERT, RoBERTa, DistilBERT Classification, NER, understanding Bidirectional โ€” sees all words at once
Decoder-only GPT, GPT-2, GPT-3, GPT-4 Text generation Autoregressive โ€” predicts next token left-to-right
Encoder-Decoder BART, T5, Original Transformer Translation, summarization Full sequence-to-sequence

Prepared by Dr. Abdulkarim Albanna โ€” AI Applications Course