Transformer — Complete Step-by-Step Guide with One Example

First step: each word is converted to a vector (list of numbers). We'll assume embedding dimension = 4 (in practice it's 512).

These numbers are learned during training — words with similar meanings end up with similar vectors.

2. Positional Encoding (Adding Word Order)

The Problem:

Transformers process all words in parallel (not one by one), so they don't know that "I" is the 1st word and "learning" is the 3rd.

The Solution:

Where: pos = word position (0, 1, 2, ...) | i = dimension index | d = embedding size (= 4)

Think of it like a clock:

Calculation:

Add them together (Embedding + Position):

3. Self-Attention (The Core Mechanism)

The Idea:

Each word asks: "Which other words are important for understanding my meaning?"

Step 1: Create Q, K, V

Step 2: Compute Attention Scores

4. Multi-Head Attention (Multiple Perspectives)

The Idea:

Why?

Steps:

In our example (simplified to 2 heads):

5. Residual Connection + Layer Normalization

Residual (Skip) Connection:

Layer Normalization:

In our example:

6. Feed-Forward Network

In our example:

7. Decoder — Generating the Translation

The Decoder works one word at a time. Assume we're generating the 3rd word and have:

7.1 Masked Self-Attention

7.2 Cross-Attention (Encoder-Decoder Attention)

8. Linear + Softmax (Predicting the Word)

9. Full Pipeline Overview

┌─────────────────────────── ENCODER ───────────────────────────┐ │ │ │ "I love learning" │ │ ↓ │ │ [Input Embedding] → vector for each word │ │ ↓ │ │ [+ Positional Encoding] → add position information │ │ ↓ │ │ [Multi-Head Self-Attention] → each word attends to all others │ │ ↓ │ │ [+ Residual + LayerNorm] │ │ ↓ │ │ [Feed-Forward Network] → neural net per position │ │ ↓ │ │ [+ Residual + LayerNorm] │ │ ↓ │ │ ══► Encoder Output (sent to Decoder) │ │ │ │ × 6 layers │ └────────────────────────────────────────────────────────────────┘ ┌─────────────────────────── DECODER ───────────────────────────┐ │ │ │ "أنا أحب" (words generated so far) │ │ ↓ │ │ [Output Embedding + Positional Encoding] │ │ ↓ │ │ [Masked Multi-Head Self-Attention] → can't see future │ │ ↓ │ │ [+ Residual + LayerNorm] │ │ ↓ │ │ [Cross-Attention] → Q from here, K+V from Encoder │ │ ↓ │ │ [+ Residual + LayerNorm] │ │ ↓ │ │ [Feed-Forward Network] │ │ ↓ │ │ [+ Residual + LayerNorm] │ │ ↓ │ │ [Linear → Softmax] → probability over vocabulary │ │ ↓ │ │ ══► "التعلم" ← highest probability word │ │ │ │ × 6 layers │ └────────────────────────────────────────────────────────────────┘

Parameter	Value
Encoder layers (original paper)	6
Decoder layers	6
Embedding size (d_model)	512
Number of heads	8
Size per head (d_k)	512 / 8 = 64
Feed-Forward inner dimension	2048
Original paper	"Attention Is All You Need" (2017)
Authors	Vaswani et al. (Google)
Optimizer	Adam (with warmup + decay schedule)
Training loss	Cross-Entropy
Decoder training trick	Teacher Forcing (feed correct previous token)

Type	Examples	Best For	How It Works
Encoder-only	BERT, RoBERTa, DistilBERT	Classification, NER, understanding	Bidirectional — sees all words at once
Decoder-only	GPT, GPT-2, GPT-3, GPT-4	Text generation	Autoregressive — predicts next token left-to-right
Encoder-Decoder	BART, T5, Original Transformer	Translation, summarization	Full sequence-to-sequence

Transformer — Complete Step-by-Step Guide

1. Input Embedding (Words → Numbers)

2. Positional Encoding (Adding Word Order)

The Problem:

The Solution:

Think of it like a clock:

Calculation:

Add them together (Embedding + Position):

3. Self-Attention (The Core Mechanism)

The Idea:

Step 1: Create Q, K, V

Step 2: Compute Attention Scores

4. Multi-Head Attention (Multiple Perspectives)

The Idea:

Why?

Steps:

In our example (simplified to 2 heads):

5. Residual Connection + Layer Normalization

Residual (Skip) Connection:

Layer Normalization:

In our example:

6. Feed-Forward Network

In our example:

7. Decoder — Generating the Translation

7.1 Masked Self-Attention

7.2 Cross-Attention (Encoder-Decoder Attention)

8. Linear + Softmax (Predicting the Word)

9. Full Pipeline Overview

10. Key Facts (Exam Reference)

11. Three Types of Transformer Models