🧠 Transformer — Questions & Answers

21 questions covering every component of the Transformer architecture

📝 21 Questions 📊 Covers All Components 🎯 Exam Ready

📋 Table of Contents

Q1 What is a Transformer? Q2 Transformer vs RNN/LSTM Q3 Main components Q4 Self-Attention explained Q5 What are Q, K, V? Q6 Self-Attention formula Q7 Multi-Head Attention Q8 Why Positional Encoding? Q9 PE formula (sin/cos) Q10 Masked Attention Q11 Cross-Attention Q12 Residual Connections Q13 Layer Normalization Q14 Feed-Forward Network Q15 Three model types Q16 Key hyperparameters Q17 Teacher Forcing Q18 Full translation pipeline Q19 Encoder vs Decoder attention Q20 Transformer vs CNN Bonus Why is generation sequential?
1
What is a Transformer?

A Transformer is a deep learning architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. (Google).

It processes entire sequences in parallel using a mechanism called self-attention — no recurrence (RNN) or convolution (CNN) needed.

It consists of an Encoder (understands the input) and a Decoder (generates the output).

Key insight: "Attention Is All You Need" — the entire architecture is built around the attention mechanism.
2
What problems does the Transformer solve compared to RNNs/LSTMs?
Problem with RNN/LSTMTransformer Solution
Sequential processing (slow)Parallel processing (fast)
Vanishing gradientsResidual connections + LayerNorm
Hard to capture long-range dependenciesSelf-attention sees ALL words at once
Can't scale wellScales efficiently with GPU parallelism
3
What are the main components of the Transformer?
#ComponentPurpose
1Input EmbeddingConverts words to vectors
2Positional EncodingAdds position information (sin/cos)
3Multi-Head Self-AttentionEach word attends to all others
4Feed-Forward NetworkTwo-layer neural net per position
5Residual + LayerNormStabilize training
6Linear + SoftmaxPredict next word (Decoder output)
4
What is Self-Attention? Explain with an example.

Self-Attention allows each word to look at all other words in the sentence to understand its context.

Example: In "The cat sat on the mat because it was soft"

Steps:

  1. Create Q (Query), K (Key), V (Value) for each word
  2. Score = Q × KT (how relevant is each word?)
  3. Scale by √dk
  4. Apply Softmax → probabilities
  5. Multiply by V → weighted output
5
What are Q, K, and V? Why three separate vectors?
VectorNameMeaning
QQuery"What am I looking for?"
KKey"What do I contain?"
VValue"What information do I give?"

Why three? Separating them allows the model to learn different representations for "what to search for" vs "what to match against" vs "what to output". Using one vector for all three would be much less expressive.

Analogy — Library: Q = your search query | K = book titles (index) | V = book content (what you actually read)
6
Write the Self-Attention formula and explain each part.
Attention(Q, K, V) = softmax(Q × KT / √dk) × V
PartWhat it does
Q × KTDot product → compatibility scores between queries and keys
√dkScaling factor (dk = dimension of keys) to prevent large values
softmaxConverts scores to probabilities (sum = 1)
× VWeighted sum of values based on attention weights
Why divide by √dk? Without scaling, large dk → large dot products → softmax outputs near 0 and 1 only → gradients vanish → model can't learn.
7
What is Multi-Head Attention? Why not single attention?

Multi-Head Attention runs multiple self-attention operations in parallel, each with different learned weight matrices.

MultiHead(Q, K, V) = Concat(head1, ..., headh) × WO
where headi = Attention(Q×WiQ, K×WiK, V×WiV)

Why multiple heads? Each head learns different relationships:

Original paper: 8 heads, each with dimension 64 (512/8).

8
Why do we need Positional Encoding?

Transformers process all words simultaneously — they have no inherent sense of word order.

Without positional encoding:

"The cat ate the fish"  =  "The fish ate the cat"

Same words, same embeddings — but completely different meanings!

Positional Encoding adds a unique vector to each position so the model knows word order.

9
Write the Positional Encoding formula. Why sin/cos?
PE(pos, 2i) = sin(pos / 100002i/d)
PE(pos, 2i+1) = cos(pos / 100002i/d)

Why sin and cos?

ReasonExplanation
Unique fingerprintEvery position gets a different combination of values
Relative positionssin(A+B) = linear function of sin(A), cos(A) → model learns relative distances
No length limitWorks for any sequence length (unlike learned embeddings)
Both sin AND cossin alone repeats (sin(0) = sin(π) = 0), cos distinguishes them
Why 10000? Controls frequency range — early dimensions = fast (like seconds hand), later dimensions = slow (like hours hand). This covers both nearby and distant position differences.
10
What is Masked Self-Attention? Why is it needed?

Same as Self-Attention, but future positions are masked out (set to -∞ before softmax).

When generating "love" in "I love learning":
  Can see: "I"        ✅
  Can see: "love"     ✅ (itself)
  Cannot see: "learning" ❌ (future — not generated yet)

Why? During generation, the model produces words one at a time. If it could see future words, it would be cheating — copying the answer instead of learning to predict.

The mask sets future positions to -∞ → after softmax they become 0 → ignored.

11
What is Cross-Attention? How is it different from Self-Attention?
Self-AttentionCross-Attention
Q fromSame sequenceDecoder
K, V fromSame sequenceEncoder output
PurposeUnderstand within a sequenceConnect Decoder to Encoder
Used inBoth Encoder and DecoderDecoder only

Example (English → Arabic): Generating "التعلم":

12
What are Residual (Skip) Connections?
output = x + Sublayer(x)

The original input is added directly to the sublayer's output.

Why?

  1. Prevents vanishing gradients — gradients flow directly through the skip connection
  2. Preserves information — original input is never lost
  3. Enables deeper networks — without them, 6+ layers would be very hard to train
Analogy: Taking notes in class. Even after processing (Sublayer), you keep your original notes (x) and add new insights on top.
13
What is Layer Normalization?
LayerNorm(x) = γ × (x - μ) / √(σ² + ε) + β
SymbolMeaning
μMean of the vector
σ²Variance
γ, βLearnable parameters (scale and shift)
εSmall constant to prevent division by zero

Purpose: Normalizes values to mean ≈ 0, variance ≈ 1, which stabilizes training, allows higher learning rates, and reduces sensitivity to initialization.

14
What is the Feed-Forward Network?
FFN(x) = max(0, x·W₁ + b₁) · W₂ + b₂

Why needed? Self-attention only computes weighted averages (linear). The FFN adds non-linear transformation, allowing the model to learn complex patterns.

15
What are the three types of Transformer models?
TypeExamplesBest ForDirection
Encoder-onlyBERT, RoBERTaClassification, NER, understandingBidirectional
Decoder-onlyGPT, GPT-4Text generationLeft → Right
Encoder-DecoderBART, T5Translation, summarizationFull seq-to-seq
16
What are the key hyperparameters of the original Transformer?
ParameterValue
Encoder layers6
Decoder layers6
d_model (embedding size)512
Number of heads (h)8
d_k (per head)512/8 = 64
FFN inner dimension2048
OptimizerAdam (with warmup schedule)
Training techniqueTeacher Forcing
17
What is Teacher Forcing?

During training, instead of feeding the model its own (possibly wrong) predictions, we feed it the correct previous token.

Example: Target = "أنا أحب التعلم"

Step 1: Input = [START]              → predict "أنا"
Step 2: Input = [START, أنا]         → predict "أحب"     (correct "أنا", not model's guess)
Step 3: Input = [START, أنا, أحب]    → predict "التعلم"

Why? Speeds up training and prevents error accumulation (one wrong prediction causing all subsequent ones to be wrong).

18
How does the complete pipeline work for translation?

Translating "I love learning""أنا أحب التعلم":

Encoder:

  1. Embed each English word → vectors
  2. Add Positional Encoding
  3. Pass through 6 encoder layers (Self-Attention → FFN)
  4. Output: contextualized representations

Decoder (one word at a time):

  1. Start with [START] token
  2. Embed + Positional Encoding
  3. Masked Self-Attention (can't see future)
  4. Cross-Attention (attend to Encoder output)
  5. FFN → Linear → Softmax → predict "أنا"
  6. Repeat with [START, أنا] → predict "أحب"
  7. Repeat → predict "التعلم"
  8. Repeat → predict [END] → stop
19
Encoder attention vs Decoder attention?
EncoderDecoder
Self-AttentionUnmasked — sees all wordsMasked — can't see future
Cross-AttentionNoneYes — Q from Decoder, K/V from Encoder
Sub-layers per layer2 (Attention + FFN)3 (Masked Attn + Cross Attn + FFN)
20
Why is the Transformer better than CNN for NLP?
AspectCNNTransformer
Context windowLimited (kernel size)Global (entire sequence)
Long-range dependenciesNeeds many layersSingle attention layer
ParallelizationGoodExcellent
Position awarenessBuilt-in (local)Needs Positional Encoding
Dominant in NLP?NoYes (since 2018)
Bonus: If Transformers are parallel, why is generation sequential?

Training: Parallel ✅ — we have the full target sentence, so we use teacher forcing and process all positions at once (with masking).

Inference (generation): Sequential ❌ — we don't know the next word until we predict it, so we must generate one token at a time.

This is why large language models like GPT-4 still output words one by one, even though internally each step processes all previous tokens in parallel.


Prepared by Dr. Abdulkarim Albanna — AI Applications Course
Back to Transformer Overview