Pre-norm, causal MHA + MLP (4x, GELU), residual
Hard ArchitectureImplement a full GPT-2 style Transformer block โ combining everything you've learned.
• Inherit from nn.Module
• self.ln1, self.ln2: nn.LayerNorm(d_model)
• self.W_q, self.W_k, self.W_v, self.W_o: nn.Linear for attention
• self.mlp: nn.Sequential(Linear(d, 4d), GELU(), Linear(4d, d))
• Attention must be causal (mask future positions)
• Pre-norm architecture (LayerNorm *before* attention and MLP)
• Residual connections around both attention and MLP
Implement the function below. Use only basic PyTorch operations.
Use this code to debug before submitting.
Try solving it yourself first! Click below to reveal the solution.
For interactive practice with auto-grading, run TorchCode locally:pip install torch-judge then use check("gpt2_block")
Pre-norm, causal MHA + MLP (4x, GELU), residual