All Problems Description Template Solution

Causal Self-Attention

Autoregressive masking with -inf, GPT-style

Hard Attention

Problem Description

Implement causal (masked) self-attention โ€” the attention used in GPT-style decoders.

Same as softmax attention, but each position can only attend to itself and earlier positions (no peeking at future tokens).

$$\text{scores}_{ij} = \begin{cases} \frac{Q_i \cdot K_j}{\sqrt{d_k}} & \text{if } j \le i \\ -\infty & \text{if } j > i \end{cases}$$

Signature

def causal_attention(Q, K, V): # Q, K, V: (batch, seq, d) โ†’ output: (batch, seq, d_v)

Rules

• Do NOT use F.scaled_dot_product_attention

• Position i can only attend to positions \le i

• You may use torch.softmax, torch.bmm, torch.triu

Template

Implement the function below. Use only basic PyTorch operations.

# โœ๏ธ YOUR IMPLEMENTATION HERE def causal_attention(Q, K, V): pass # Replace this

Test Your Implementation

Use this code to debug before submitting.

# ๐Ÿงช Debug torch.manual_seed(0) Q = torch.randn(1, 4, 8) K = torch.randn(1, 4, 8) V = torch.randn(1, 4, 8) out = causal_attention(Q, K, V) print("Output shape:", out.shape) # (1, 4, 8) print("Pos 0 == V[0]?", torch.allclose(out[:, 0], V[:, 0], atol=1e-5)) # should be True

Reference Solution

Try solving it yourself first! Click below to reveal the solution.

# โœ… SOLUTION def causal_attention(Q, K, V): d_k = K.size(-1) scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(d_k) S = scores.size(-1) mask = torch.triu(torch.ones(S, S, device=scores.device, dtype=torch.bool), diagonal=1) scores = scores.masked_fill(mask.unsqueeze(0), float('-inf')) weights = torch.softmax(scores, dim=-1) return torch.bmm(weights, V)

Tips

Run Locally

For interactive practice with auto-grading, run TorchCode locally:
pip install torch-judge then use check("causal_attention")

Key Concepts

Autoregressive masking with -inf, GPT-style

Causal Self-Attention

Description Template Test Solution Tips