Sliding Window Attention

Mistral-style local attention, O(n*w) complexity

Hard Attention

Problem Description

Implement Sliding Window Attention — used in Longformer, Mistral, etc. for efficient long-context processing.

Each position $i$ can only attend to positions $j$ where $|i - j| \le w$ (the window size).

Signature

def sliding_window_attention(Q, K, V, window_size):
    # Q, K, V: (batch, seq, d) → output: (batch, seq, d_v)
    # window_size: int — position i attends to [i-w, i+w]

Rules

• Do NOT use sparse attention libraries

• Mask positions outside the window with -inf

• window_size=0: only self — output should equal V

• window_size >= seq_len: equivalent to full attention

Template

Implement the function below. Use only basic PyTorch operations.

# ✏️ YOUR IMPLEMENTATION HERE

def sliding_window_attention(Q, K, V, window_size):
    pass  # Replace this

Test Your Implementation

Use this code to debug before submitting.

# 🧪 Debug
Q = torch.randn(1, 6, 8)
K = torch.randn(1, 6, 8)
V = torch.randn(1, 6, 8)

out = sliding_window_attention(Q, K, V, window_size=1)
print("Output shape:", out.shape)  # (1, 6, 8)

# window=0 should return V
out0 = sliding_window_attention(Q, K, V, window_size=0)
print("window=0 == V?", torch.allclose(out0, V, atol=1e-5))

Reference Solution

Try solving it yourself first! Click below to reveal the solution.

# ✅ SOLUTION

def sliding_window_attention(Q, K, V, window_size):
    d_k = K.size(-1)
    scores = torch.bmm(Q, K.transpose(1, 2)) / math.sqrt(d_k)
    S = Q.size(1)
    idx = torch.arange(S, device=Q.device)
    mask = (idx.unsqueeze(0) - idx.unsqueeze(1)).abs() > window_size
    scores = scores.masked_fill(mask.unsqueeze(0), float('-inf'))
    weights = torch.softmax(scores, dim=-1)
    return torch.bmm(weights, V)

Tips

Run Locally

For interactive practice with auto-grading, run TorchCode locally:
pip install torch-judge then use check("sliding_window")

Key Concepts

Mistral-style local attention, O(n*w) complexity

Grouped Query Attention Linear Attention