Gated FFN, SiLU(gate) * up, LLaMA/Mistral-style
Medium FundamentalsImplement the SwiGLU MLP (feed-forward network) used in modern LLMs like LLaMA.
where \text{SiLU}(x) = x \cdot \sigma(x)
• Inherit from nn.Module
• self.gate_proj: nn.Linear(d_model, d_ff)
• self.up_proj: nn.Linear(d_model, d_ff)
• self.down_proj: nn.Linear(d_ff, d_model)
• Activation: SiLU (a.k.a. Swish) — F.silu or implement as x * torch.sigmoid(x)
Unlike the classic Linear → ReLU/GELU → Linear FFN, SwiGLU uses a gating mechanism:
the gate projection controls information flow, while the up projection provides the content.
This consistently outperforms standard FFNs in practice (PaLM, LLaMA, Mistral all use it).
Implement the function below. Use only basic PyTorch operations.
Use this code to debug before submitting.
Try solving it yourself first! Click below to reveal the solution.
For interactive practice with auto-grading, run TorchCode locally:pip install torch-judge then use check("mlp")
Gated FFN, SiLU(gate) * up, LLaMA/Mistral-style