Mistral-style local attention, O(n*w) complexity
Hard AttentionImplement Sliding Window Attention — used in Longformer, Mistral, etc. for efficient long-context processing.
Each position i can only attend to positions j where |i - j| \le w (the window size).
• Do NOT use sparse attention libraries
• Mask positions outside the window with -inf
• window_size=0: only self — output should equal V
• window_size >= seq_len: equivalent to full attention
Implement the function below. Use only basic PyTorch operations.
Use this code to debug before submitting.
Try solving it yourself first! Click below to reveal the solution.
For interactive practice with auto-grading, run TorchCode locally:pip install torch-judge then use check("sliding_window")
Mistral-style local attention, O(n*w) complexity