Causal Self-Attention

Autoregressive masking with -inf, GPT-style

Hard Attention

Problem Description

Implement causal (masked) self-attention — the attention used in GPT-style decoders.

Same as softmax attention, but each position can only attend to itself and earlier positions (no peeking at future tokens).

$$\text{scores}_{ij} = \begin{cases} \frac{Q_i \cdot K_j}{\sqrt{d_k}} & \text{if } j \le i \\ -\infty & \text{if } j > i \end{cases}$$

Signature

def causal_attention(Q, K, V): # Q, K, V: (batch, seq, d) → output: (batch, seq, d_v)

Rules

• Do NOT use F.scaled_dot_product_attention

• Position $i$ can only attend to positions $\le i$

• You may use torch.softmax, torch.bmm, torch.triu

Template

Implement the function below. Use only basic PyTorch operations.

# ✏️ YOUR IMPLEMENTATION HERE

def causal_attention(Q, K, V):
    pass  # Replace this

Test Your Implementation

Use this code to debug before submitting.

# 🧪 Debug
torch.manual_seed(0)
Q = torch.randn(1, 4, 8)
K = torch.randn(1, 4, 8)
V = torch.randn(1, 4, 8)
out = causal_attention(Q, K, V)
print("Output shape:", out.shape)          # (1, 4, 8)
print("Pos 0 == V[0]?", torch.allclose(out[:, 0], V[:, 0], atol=1e-5))  # should be True

Reference Solution

Try solving it yourself first! Click below to reveal the solution.

# ✅ SOLUTION

def causal_attention(Q, K, V):
    d_k = K.size(-1)
    scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(d_k)
    S = scores.size(-1)
    mask = torch.triu(torch.ones(S, S, device=scores.device, dtype=torch.bool), diagonal=1)
    scores = scores.masked_fill(mask.unsqueeze(0), float('-inf'))
    weights = torch.softmax(scores, dim=-1)
    return torch.bmm(weights, V)

Tips

Run Locally

For interactive practice with auto-grading, run TorchCode locally:
pip install torch-judge then use check("causal_attention")

Key Concepts

Autoregressive masking with -inf, GPT-style

Multi-Head Attention Grouped Query Attention