🧠 Transformer — Questions & Answers

Question 1

1

What is a Transformer?

▼

Answer

A Transformer is a deep learning architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. (Google).

It processes entire sequences in parallel using a mechanism called self-attention — no recurrence (RNN) or convolution (CNN) needed.

It consists of an Encoder (understands the input) and a Decoder (generates the output).

Key insight: "Attention Is All You Need" — the entire architecture is built around the attention mechanism.

Question 2

2

What problems does the Transformer solve compared to RNNs/LSTMs?

▼

Answer

Problem with RNN/LSTM	Transformer Solution
Sequential processing (slow)	Parallel processing (fast)
Vanishing gradients	Residual connections + LayerNorm
Hard to capture long-range dependencies	Self-attention sees ALL words at once
Can't scale well	Scales efficiently with GPU parallelism

Question 3

3

What are the main components of the Transformer?

▼

Answer

#	Component	Purpose
1	Input Embedding	Converts words to vectors
2	Positional Encoding	Adds position information (sin/cos)
3	Multi-Head Self-Attention	Each word attends to all others
4	Feed-Forward Network	Two-layer neural net per position
5	Residual + LayerNorm	Stabilize training
6	Linear + Softmax	Predict next word (Decoder output)

Question 4

4

What is Self-Attention? Explain with an example.

▼

Answer

Self-Attention allows each word to look at all other words in the sentence to understand its context.

Example: In "The cat sat on the mat because it was soft"

What does "it" refer to? The model needs to attend to "mat" (not "cat").
Self-Attention computes a score between "it" and every other word.
"mat" gets the highest score → the model understands "it" = "mat".

Steps:

Create Q (Query), K (Key), V (Value) for each word
Score = Q × K^T (how relevant is each word?)
Scale by √d_k
Apply Softmax → probabilities
Multiply by V → weighted output

Question 5

5

What are Q, K, and V? Why three separate vectors?

▼

Answer

Vector	Name	Meaning
Q	Query	"What am I looking for?"
K	Key	"What do I contain?"
V	Value	"What information do I give?"

Why three? Separating them allows the model to learn different representations for "what to search for" vs "what to match against" vs "what to output". Using one vector for all three would be much less expressive.

Analogy — Library: Q = your search query | K = book titles (index) | V = book content (what you actually read)

Question 6

6

Write the Self-Attention formula and explain each part.

▼

Answer

Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V

Part	What it does
Q × K^T	Dot product → compatibility scores between queries and keys
√d_k	Scaling factor (d_k = dimension of keys) to prevent large values
softmax	Converts scores to probabilities (sum = 1)
× V	Weighted sum of values based on attention weights

Why divide by √d_k? Without scaling, large d_k → large dot products → softmax outputs near 0 and 1 only → gradients vanish → model can't learn.

Question 7

7

What is Multi-Head Attention? Why not single attention?

▼

Answer

Multi-Head Attention runs multiple self-attention operations in parallel, each with different learned weight matrices.

MultiHead(Q, K, V) = Concat(head₁, ..., head_h) × W^O
where head_i = Attention(Q×W_i^Q, K×W_i^K, V×W_i^V)

Why multiple heads? Each head learns different relationships:

Head 1: might learn grammar (subject-verb)
Head 2: might learn proximity (nearby words)
Head 3: might learn meaning (semantic similarity)

Original paper: 8 heads, each with dimension 64 (512/8).

Question 8

8

Why do we need Positional Encoding?

▼

Answer

Transformers process all words simultaneously — they have no inherent sense of word order.

Without positional encoding:

"The cat ate the fish"  =  "The fish ate the cat"

Same words, same embeddings — but completely different meanings!

Positional Encoding adds a unique vector to each position so the model knows word order.

Question 9

9

Write the Positional Encoding formula. Why sin/cos?

▼

Answer

PE(pos, 2i) = sin(pos / 10000^2i/d)
PE(pos, 2i+1) = cos(pos / 10000^2i/d)

Why sin and cos?

Reason	Explanation
Unique fingerprint	Every position gets a different combination of values
Relative positions	sin(A+B) = linear function of sin(A), cos(A) → model learns relative distances
No length limit	Works for any sequence length (unlike learned embeddings)
Both sin AND cos	sin alone repeats (sin(0) = sin(π) = 0), cos distinguishes them

Why 10000? Controls frequency range — early dimensions = fast (like seconds hand), later dimensions = slow (like hours hand). This covers both nearby and distant position differences.

Question 10

10

What is Masked Self-Attention? Why is it needed?

▼

Answer

Same as Self-Attention, but future positions are masked out (set to -∞ before softmax).

When generating "love" in "I love learning":
  Can see: "I"        ✅
  Can see: "love"     ✅ (itself)
  Cannot see: "learning" ❌ (future — not generated yet)

Why? During generation, the model produces words one at a time. If it could see future words, it would be cheating — copying the answer instead of learning to predict.

The mask sets future positions to -∞ → after softmax they become 0 → ignored.

Question 11

11

What is Cross-Attention? How is it different from Self-Attention?

▼

Answer

	Self-Attention	Cross-Attention
Q from	Same sequence	Decoder
K, V from	Same sequence	Encoder output
Purpose	Understand within a sequence	Connect Decoder to Encoder
Used in	Both Encoder and Decoder	Decoder only

Example (English → Arabic): Generating "التعلم":

Q = from Decoder ("what Arabic word am I generating?")
K, V = from Encoder ("I", "love", "learning")
Attends most to "learning" → generates "التعلم"

Question 12

12

What are Residual (Skip) Connections?

▼

Answer

output = x + Sublayer(x)

The original input is added directly to the sublayer's output.

Why?

Prevents vanishing gradients — gradients flow directly through the skip connection
Preserves information — original input is never lost
Enables deeper networks — without them, 6+ layers would be very hard to train

Analogy: Taking notes in class. Even after processing (Sublayer), you keep your original notes (x) and add new insights on top.

Question 13

13

What is Layer Normalization?

▼

Answer

LayerNorm(x) = γ × (x - μ) / √(σ² + ε) + β

Symbol	Meaning
μ	Mean of the vector
σ²	Variance
γ, β	Learnable parameters (scale and shift)
ε	Small constant to prevent division by zero

Purpose: Normalizes values to mean ≈ 0, variance ≈ 1, which stabilizes training, allows higher learning rates, and reduces sensitivity to initialization.

Question 14

14

What is the Feed-Forward Network?

▼

Answer

FFN(x) = max(0, x·W₁ + b₁) · W₂ + b₂

Two linear layers with ReLU activation in between
Inner dimension = 4× model dimension (512 → 2048 → 512)
Applied to each position independently

Why needed? Self-attention only computes weighted averages (linear). The FFN adds non-linear transformation, allowing the model to learn complex patterns.

Question 15

15

What are the three types of Transformer models?

▼

Answer

Type	Examples	Best For	Direction
Encoder-only	BERT, RoBERTa	Classification, NER, understanding	Bidirectional
Decoder-only	GPT, GPT-4	Text generation	Left → Right
Encoder-Decoder	BART, T5	Translation, summarization	Full seq-to-seq

Question 16

16

What are the key hyperparameters of the original Transformer?

▼

Answer

Parameter	Value
Encoder layers	6
Decoder layers	6
d_model (embedding size)	512
Number of heads (h)	8
d_k (per head)	512/8 = 64
FFN inner dimension	2048
Optimizer	Adam (with warmup schedule)
Training technique	Teacher Forcing

Question 17

17

What is Teacher Forcing?

▼

Answer

During training, instead of feeding the model its own (possibly wrong) predictions, we feed it the correct previous token.

Example: Target = "أنا أحب التعلم"

Step 1: Input = [START]              → predict "أنا"
Step 2: Input = [START, أنا]         → predict "أحب"     (correct "أنا", not model's guess)
Step 3: Input = [START, أنا, أحب]    → predict "التعلم"

Why? Speeds up training and prevents error accumulation (one wrong prediction causing all subsequent ones to be wrong).

Question 18

18

How does the complete pipeline work for translation?

▼

Answer

Translating "I love learning" → "أنا أحب التعلم":

Encoder:

Embed each English word → vectors
Add Positional Encoding
Pass through 6 encoder layers (Self-Attention → FFN)
Output: contextualized representations

Decoder (one word at a time):

Start with [START] token
Embed + Positional Encoding
Masked Self-Attention (can't see future)
Cross-Attention (attend to Encoder output)
FFN → Linear → Softmax → predict "أنا"
Repeat with [START, أنا] → predict "أحب"
Repeat → predict "التعلم"
Repeat → predict [END] → stop

Question 19

19

Encoder attention vs Decoder attention?

▼

Answer

	Encoder	Decoder
Self-Attention	Unmasked — sees all words	Masked — can't see future
Cross-Attention	None	Yes — Q from Decoder, K/V from Encoder
Sub-layers per layer	2 (Attention + FFN)	3 (Masked Attn + Cross Attn + FFN)

Question 20

20

Why is the Transformer better than CNN for NLP?

▼

Answer

Aspect	CNN	Transformer
Context window	Limited (kernel size)	Global (entire sequence)
Long-range dependencies	Needs many layers	Single attention layer
Parallelization	Good	Excellent
Position awareness	Built-in (local)	Needs Positional Encoding
Dominant in NLP?	No	Yes (since 2018)

Question 21

⭐

Bonus: If Transformers are parallel, why is generation sequential?

▼

Answer

Training: Parallel ✅ — we have the full target sentence, so we use teacher forcing and process all positions at once (with masking).

Inference (generation): Sequential ❌ — we don't know the next word until we predict it, so we must generate one token at a time.

This is why large language models like GPT-4 still output words one by one, even though internally each step processes all previous tokens in parallel.

🧠 Transformer — Questions & Answers

📋 Table of Contents