softmax(QK^T/sqrt(d_k))V, the foundation
Hard AttentionImplement the core attention mechanism used in Transformers.
• Do NOT use F.scaled_dot_product_attention
• You may use torch.softmax and torch.bmm
• Must support autograd
• Must handle cross-attention (seq_q ≠ seq_k)
Implement the function below. Use only basic PyTorch operations.
Use this code to debug before submitting.
Try solving it yourself first! Click below to reveal the solution.
For interactive practice with auto-grading, run TorchCode locally:pip install torch-judge then use check("attention")
softmax(QK^T/sqrt(d_k))V, the foundation