21 questions covering every component of the Transformer architecture
A Transformer is a deep learning architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. (Google).
It processes entire sequences in parallel using a mechanism called self-attention — no recurrence (RNN) or convolution (CNN) needed.
It consists of an Encoder (understands the input) and a Decoder (generates the output).
| Problem with RNN/LSTM | Transformer Solution |
|---|---|
| Sequential processing (slow) | Parallel processing (fast) |
| Vanishing gradients | Residual connections + LayerNorm |
| Hard to capture long-range dependencies | Self-attention sees ALL words at once |
| Can't scale well | Scales efficiently with GPU parallelism |
| # | Component | Purpose |
|---|---|---|
| 1 | Input Embedding | Converts words to vectors |
| 2 | Positional Encoding | Adds position information (sin/cos) |
| 3 | Multi-Head Self-Attention | Each word attends to all others |
| 4 | Feed-Forward Network | Two-layer neural net per position |
| 5 | Residual + LayerNorm | Stabilize training |
| 6 | Linear + Softmax | Predict next word (Decoder output) |
Self-Attention allows each word to look at all other words in the sentence to understand its context.
Example: In "The cat sat on the mat because it was soft"
Steps:
| Vector | Name | Meaning |
|---|---|---|
| Q | Query | "What am I looking for?" |
| K | Key | "What do I contain?" |
| V | Value | "What information do I give?" |
Why three? Separating them allows the model to learn different representations for "what to search for" vs "what to match against" vs "what to output". Using one vector for all three would be much less expressive.
| Part | What it does |
|---|---|
| Q × KT | Dot product → compatibility scores between queries and keys |
| √dk | Scaling factor (dk = dimension of keys) to prevent large values |
| softmax | Converts scores to probabilities (sum = 1) |
| × V | Weighted sum of values based on attention weights |
Multi-Head Attention runs multiple self-attention operations in parallel, each with different learned weight matrices.
Why multiple heads? Each head learns different relationships:
Original paper: 8 heads, each with dimension 64 (512/8).
Transformers process all words simultaneously — they have no inherent sense of word order.
Without positional encoding:
"The cat ate the fish" = "The fish ate the cat"
Same words, same embeddings — but completely different meanings!
Positional Encoding adds a unique vector to each position so the model knows word order.
Why sin and cos?
| Reason | Explanation |
|---|---|
| Unique fingerprint | Every position gets a different combination of values |
| Relative positions | sin(A+B) = linear function of sin(A), cos(A) → model learns relative distances |
| No length limit | Works for any sequence length (unlike learned embeddings) |
| Both sin AND cos | sin alone repeats (sin(0) = sin(π) = 0), cos distinguishes them |
Same as Self-Attention, but future positions are masked out (set to -∞ before softmax).
When generating "love" in "I love learning":
Can see: "I" ✅
Can see: "love" ✅ (itself)
Cannot see: "learning" ❌ (future — not generated yet)
Why? During generation, the model produces words one at a time. If it could see future words, it would be cheating — copying the answer instead of learning to predict.
The mask sets future positions to -∞ → after softmax they become 0 → ignored.
| Self-Attention | Cross-Attention | |
|---|---|---|
| Q from | Same sequence | Decoder |
| K, V from | Same sequence | Encoder output |
| Purpose | Understand within a sequence | Connect Decoder to Encoder |
| Used in | Both Encoder and Decoder | Decoder only |
Example (English → Arabic): Generating "التعلم":
The original input is added directly to the sublayer's output.
Why?
| Symbol | Meaning |
|---|---|
| μ | Mean of the vector |
| σ² | Variance |
| γ, β | Learnable parameters (scale and shift) |
| ε | Small constant to prevent division by zero |
Purpose: Normalizes values to mean ≈ 0, variance ≈ 1, which stabilizes training, allows higher learning rates, and reduces sensitivity to initialization.
Why needed? Self-attention only computes weighted averages (linear). The FFN adds non-linear transformation, allowing the model to learn complex patterns.
| Type | Examples | Best For | Direction |
|---|---|---|---|
| Encoder-only | BERT, RoBERTa | Classification, NER, understanding | Bidirectional |
| Decoder-only | GPT, GPT-4 | Text generation | Left → Right |
| Encoder-Decoder | BART, T5 | Translation, summarization | Full seq-to-seq |
| Parameter | Value |
|---|---|
| Encoder layers | 6 |
| Decoder layers | 6 |
| d_model (embedding size) | 512 |
| Number of heads (h) | 8 |
| d_k (per head) | 512/8 = 64 |
| FFN inner dimension | 2048 |
| Optimizer | Adam (with warmup schedule) |
| Training technique | Teacher Forcing |
During training, instead of feeding the model its own (possibly wrong) predictions, we feed it the correct previous token.
Example: Target = "أنا أحب التعلم"
Step 1: Input = [START] → predict "أنا"
Step 2: Input = [START, أنا] → predict "أحب" (correct "أنا", not model's guess)
Step 3: Input = [START, أنا, أحب] → predict "التعلم"
Why? Speeds up training and prevents error accumulation (one wrong prediction causing all subsequent ones to be wrong).
Translating "I love learning" → "أنا أحب التعلم":
Encoder:
Decoder (one word at a time):
| Encoder | Decoder | |
|---|---|---|
| Self-Attention | Unmasked — sees all words | Masked — can't see future |
| Cross-Attention | None | Yes — Q from Decoder, K/V from Encoder |
| Sub-layers per layer | 2 (Attention + FFN) | 3 (Masked Attn + Cross Attn + FFN) |
| Aspect | CNN | Transformer |
|---|---|---|
| Context window | Limited (kernel size) | Global (entire sequence) |
| Long-range dependencies | Needs many layers | Single attention layer |
| Parallelization | Good | Excellent |
| Position awareness | Built-in (local) | Needs Positional Encoding |
| Dominant in NLP? | No | Yes (since 2018) |
Training: Parallel ✅ — we have the full target sentence, so we use teacher forcing and process all positions at once (with masking).
Inference (generation): Sequential ❌ — we don't know the next word until we predict it, so we must generate one token at a time.