Back to Course

Mixture of Experts in LLMs

Exploring the architecture that powers today's most efficient large language models — from Mixtral and DeepSeek-V3 to Grok-1 and beyond.

What is Mixture of Experts (MoE)?

Mixture of Experts (MoE) is a neural network architecture where instead of a single dense model, the network consists of multiple sub-models called "experts." Each expert specializes in processing different types of input.

In Large Language Models, MoE replaces dense feedforward layers with a set of expert networks, allowing a model to have a much larger total parameter count while keeping per-token computation costs manageable.

The key insight: for any given input token, only a small subset of parameters needs to activate. A "router" or "gating network" decides which expert(s) process each token — enabling massive capacity with efficient inference.

"MoE enables models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size without a proportional increase in training costs." — Hugging Face Blog

8–64
Experts per layer
Top-2
Typical routing
~20%
Active params
Mixture of Experts Diagram
Visualization of the MoE architecture: multiple specialized expert networks with a routing mechanism that selects the best experts for each input token.

How MoE Works in LLMs

MoE Working Process
Token routing through expert networks in an MoE layer

Step-by-Step Process

  1. 1. Input Processing: The token embedding enters the MoE layer inside the transformer block.
  2. 2. Router Evaluation: The gate network computes a score for each expert via a learned linear projection + softmax.
  3. 3. Expert Selection: Top-k experts (typically k=2) are selected per token.
  4. 4. Parallel Processing: Selected experts process the token independently and in parallel.
  5. 5. Output Aggregation: Expert outputs are combined as a weighted sum using router probabilities.
  6. 6. Forward Propagation: Combined output continues through the next transformer layer.

This allows models to have billions of sparse parameters while activating only a fraction per token — efficient inference at massive scale.

Key Components of MoE

Experts

Individual FFN sub-networks. Each expert learns to specialize in particular input patterns, enabling both breadth and depth of knowledge across diverse topics.

Experts in MoE are analogous to specialists: some may excel at mathematical reasoning, others at code, creative writing, or multilingual tasks.

In most implementations, experts have identical architecture but differ in learned weights. Specialization emerges naturally during training through routing decisions.

Router / Gate Network

Selects which expert(s) process each token by computing a probability distribution over all experts and choosing the top-k most relevant ones.

The router is a simple linear layer: nn.Linear(d_model, num_experts). It maps the token embedding to a score per expert, then a softmax yields probabilities.

Routing strategies: "top-1" (single expert, Switch Transformer), "top-2" (Mixtral), or "expert choice" where experts pick tokens instead of tokens picking experts.

Load Balancing

Ensures tokens distribute evenly across experts, preventing "expert collapse" where the router favors only 1–2 experts, wasting the rest.

Common techniques:

  • Auxiliary loss: penalizes uneven routing distributions during training
  • Expert capacity: caps tokens per expert per batch
  • Router z-loss: prevents extreme routing logits (Mistral's approach)

Without balancing, most tokens flow to 1–2 experts, effectively reducing the model capacity.

MoE Layer Detailed
Detailed view of an MoE layer: the router selects experts per token and combines their outputs with learned weights.

Advantages of MoE

Improved Scalability

MoE models can reach hundreds of billions of parameters while activating only a fraction per token — enabling scale that would be prohibitively expensive with dense models.

Computational Efficiency

Selective routing eliminates redundant computation. DeepSeek-V3 (671B params) runs at the cost of a ~37B dense model per token — 18× parameter efficiency.

Specialized Knowledge

Each expert naturally develops specialization for particular domains — math, code, language — allowing depth and breadth simultaneously.

Training Efficiency

Mixtral 8x7B matches Llama 2 70B quality using ~5× less compute per token during pretraining. DeepSeek-V3 trained on 2.788M H800 GPU-hours — a fraction of comparable dense models.

Flexible Resource Allocation

The architecture dynamically focuses compute where it's most needed for each specific token, unlike dense models that apply uniform computation everywhere.

"Mixtral outperforms Llama 2 70B on most benchmarks while using 5× fewer active parameters per token."

— Mistral AI, 2023

Dense vs. MoE: Side-by-Side

Metric Dense Model MoE Model
Active params per token100% of total~10–25% of total
Memory at inferenceProportional to sizeHigh (all experts loaded)
Inference FLOP/tokenHigh for large modelsLow relative to param count
Training computeHigher for same qualityLower for same quality
Knowledge specializationUniform across layersExpert-specific niches
Fine-tuning stabilityMore stableRequires careful tuning

Real-world Implementations

MoE has been adopted across the most powerful open and closed models of 2023–2025:

Mixtral 8x7B

Mixtral 8x7B / 8x22B

Mistral AI's open-source MoE model. 8 experts per layer, top-2 routing. The 8x7B (46.7B total) uses only 12.9B active params per token, outperforming Llama 2 70B.

  • 8x22B: 141B total, ~39B active — rivals GPT-4 class models
  • Fully open weights, Apache 2.0 license
  • Token-level routing, 32K context window

Mixtral replaces every FFN layer in a Mistral-7B-style transformer with 8 experts. For each token, the router picks top-2 experts and combines their outputs with normalized softmax weights.

The 8x22B variant (April 2024) features 65B active params for complex tasks with a 64K context window, matching GPT-4 on coding and reasoning benchmarks.

V3
DeepSeek MoE
671B total · 37B active

DeepSeek-V3 (2024)

DeepSeek's open-source flagship MoE. 671B total parameters, 37B active per token with 256 experts per layer and top-8 routing. Tops most open-source benchmarks.

  • Multi-Head Latent Attention (MLA) for KV cache efficiency
  • Trained on 14.8T tokens for $5.5M — remarkably cheap
  • Matches GPT-4o on coding (HumanEval 90.2%)

DeepSeek-V3 introduces an auxiliary-loss-free load balancing strategy via learned bias terms added to routing logits, avoiding the trade-off between balance and performance.

Also uses FP8 mixed-precision training and a novel multi-token prediction auxiliary task that boosts benchmark performance at no extra inference cost.

Switch Transformer

Switch Transformer (Google)

Google Research's pioneering MoE transformer (2021) that proved the viability of trillion-parameter sparse models using simple top-1 routing.

  • Top-1 routing: one expert per token, maximum simplicity
  • Scales to 1.6 trillion parameters
  • 7× speedup vs T5-XXL at same compute budget

Switch Transformer simplified MoE by routing each token to exactly one expert, reducing communication overhead. Key innovations: expert capacity factor, selective precision, and auxiliary load balancing loss.

This paper established the modern MoE recipe and proved that routing instability could be solved at scale.

Read the paper

MoE Model Timeline (2020–2025)

Model Organization Architecture Year
GShardGoogleTop-2, alternating MoE layers, expert capacity2020
Switch TransformerGoogleTop-1, 1.6T params, load balancing loss2021
GLaMGoogle1.2T params, 64 experts/layer, 95% compute reduction2021
NLLB-MoEMeta AITranslation MoE, 1.5B active, 128 languages2022
Mixtral 8x7BMistral AI8 experts, top-2, 46.7B total / 12.9B active2023
Grok-1xAI314B total, 8 experts, top-2, fully open weights2024
Mixtral 8x22BMistral AI8 experts, top-2, 141B total / 39B active, 64K ctx2024
Qwen1.5-MoEAlibaba14.3B total, 2.7B active, 1/4 training cost of 7B dense2024
DeepSeek-V2DeepSeek236B total, 21B active, MLA + fine-grained MoE2024
DeepSeek-V3DeepSeek671B total, 37B active, 256 experts/layer, top-82024

Code Examples — Build MoE Step by Step

Follow these five steps to build a complete Mixture of Experts layer in PyTorch from scratch, then see how to use a production MoE model via HuggingFace.

1
The Expert Network — a feedforward sub-model

Each "expert" is just a two-layer feedforward network, identical in structure to the FFN inside a standard transformer block. They only differ in their learned weights.

import torch import torch.nn as nn import torch.nn.functional as F # A single Expert = a 2-layer feedforward network (same as a transformer FFN) class Expert(nn.Module): def __init__(self, d_model: int, d_ff: int): super().__init__() self.fc1 = nn.Linear(d_model, d_ff) # expand self.fc2 = nn.Linear(d_ff, d_model) # project back def forward(self, x: torch.Tensor) -> torch.Tensor: # x: [num_tokens, d_model] return self.fc2(F.relu(self.fc1(x))) # Try it out expert = Expert(d_model=512, d_ff=2048) tokens = torch.randn(4, 512) # batch of 4 tokens output = expert(tokens) print(f"Input: {tokens.shape}") # [4, 512] print(f"Output: {output.shape}") # [4, 512] — same shape!

In a model with 8 experts you'd create 8 of these with nn.ModuleList — each will learn different specializations during training.

2
The Router — deciding which experts process each token

The router is a single learned linear layer that maps each token embedding to a score for every expert. Top-k scores select which experts run.

class TopKRouter(nn.Module): """Routes each token to the top-k most relevant experts.""" def __init__(self, d_model: int, num_experts: int, top_k: int = 2): super().__init__() self.top_k = top_k self.num_experts = num_experts # Learned: maps token embedding → score per expert self.gate = nn.Linear(d_model, num_experts, bias=False) def forward(self, x: torch.Tensor): # x: [num_tokens, d_model] # Step A: raw routing scores for every expert logits = self.gate(x) # [num_tokens, num_experts] # Step B: softmax → probability distribution probs = F.softmax(logits, dim=-1) # [num_tokens, num_experts] # Step C: pick the top-k experts per token top_k_probs, top_k_idx = torch.topk(probs, self.top_k, dim=-1) # top_k_probs: [num_tokens, k] — how much weight per expert # top_k_idx: [num_tokens, k] — which expert indices # Step D: renormalize so the k weights sum to 1 top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True) return top_k_probs, top_k_idx, probs # probs used for load-balancing loss # Example: 4 tokens, 8 experts, select top-2 router = TopKRouter(d_model=512, num_experts=8, top_k=2) tokens = torch.randn(4, 512) weights, indices, all_probs = router(tokens) print("Expert indices per token:", indices) # e.g. tensor([[3, 6], [1, 7], [0, 2], [5, 3]]) print("Expert weights per token: ", weights.round(decimals=2)) # e.g. tensor([[0.62, 0.38], [0.71, 0.29], ...])
3
The MoE Layer — combining router + experts

Now put it all together. The MoE layer takes a sequence of tokens, routes each one to its top-k experts, runs them in parallel (grouped by expert for efficiency), and returns the weighted sum of outputs.

class MoELayer(nn.Module): """ Sparse Mixture of Experts Layer. Replaces a single FFN in a Transformer with N expert FFNs. Only k out of N experts activate per token (k << N). """ def __init__(self, d_model: int, d_ff: int, num_experts: int = 8, top_k: int = 2): super().__init__() self.top_k = top_k self.num_experts = num_experts # N independent expert networks self.experts = nn.ModuleList([ Expert(d_model, d_ff) for _ in range(num_experts) ]) # Router selects which experts process each token self.router = TopKRouter(d_model, num_experts, top_k) def forward(self, x: torch.Tensor): # x: [batch, seq_len, d_model] B, S, D = x.shape # Flatten batch and sequence dims — treat all tokens equally x_flat = x.reshape(-1, D) # [B*S, d_model] N = x_flat.shape[0] # total tokens # Get routing decisions weights, indices, all_probs = self.router(x_flat) # weights: [N, top_k] — how much each expert contributes # indices: [N, top_k] — which expert to use output = torch.zeros_like(x_flat) # [N, d_model] # For each expert slot (0 to top_k-1) for k in range(self.top_k): expert_ids = indices[:, k] # [N] which expert for slot k slot_weight = weights[:, k:k+1] # [N, 1] contribution weight # Group tokens by their assigned expert and run each expert once for e in range(self.num_experts): mask = (expert_ids == e) # which tokens go to expert e if not mask.any(): continue # skip unused experts (saves compute!) expert_out = self.experts[e](x_flat[mask]) # [n_e, D] output[mask] += slot_weight[mask] * expert_out # weighted add return output.reshape(B, S, D), all_probs # restore original shape # Test the full MoE layer moe = MoELayer(d_model=512, d_ff=2048, num_experts=8, top_k=2) x = torch.randn(2, 10, 512) # batch=2, seq_len=10 out, probs = moe(x) print(f"Input: {x.shape}") # torch.Size([2, 10, 512]) print(f"Output: {out.shape}") # torch.Size([2, 10, 512]) ✓ # Count parameters: 8 experts × (512×2048 + 2048×512) = ~33.6M expert params # But only 2/8 experts activate per token = 25% active compute total = sum(p.numel() for p in moe.parameters()) active = total * (moe.top_k / moe.num_experts) print(f"Total params: {total:,} | Active per token: {int(active):,}")
4
Load Balancing Loss — preventing expert collapse

Without a load balancing term, the router quickly learns to send all tokens to 1–2 experts, wasting the rest. This auxiliary loss penalizes uneven expert utilization and is added to the main training loss with a small coefficient (typically 0.01).

def auxiliary_load_balancing_loss( router_probs: torch.Tensor, # [N, num_experts] — softmax output expert_indices: torch.Tensor, # [N, top_k] — selected expert IDs num_experts: int ) -> torch.Tensor: """ Switch Transformer / Mixtral style auxiliary load-balancing loss. Encourages all experts to receive roughly equal token counts. Loss = num_experts * sum_i(f_i * P_i) f_i = fraction of tokens routed to expert i P_i = mean router probability for expert i Both terms should be ~1/num_experts for uniform distribution. """ N = router_probs.shape[0] # f_i: fraction of tokens dispatched to each expert one_hot = F.one_hot(expert_indices, num_experts).float() # [N, k, E] tokens_per_expert = one_hot.sum(dim=[0, 1]) / (N * expert_indices.shape[1]) # Shape: [num_experts] — ideally all equal to 1/num_experts # P_i: mean router probability assigned to each expert mean_probs = router_probs.mean(dim=0) # Shape: [num_experts] — ideally all equal to 1/num_experts # Dot product × num_experts: 1.0 when perfectly balanced loss = num_experts * (tokens_per_expert * mean_probs).sum() return loss # ------ How to use it in your training loop ------ moe = MoELayer(d_model=512, d_ff=2048, num_experts=8, top_k=2) x = torch.randn(2, 10, 512) out, all_probs = moe(x) _, indices, _ = moe.router(x.reshape(-1, 512)) # Main task loss (e.g. cross-entropy for language modelling) task_loss = torch.tensor(2.5) # placeholder # Add a small load-balancing penalty (coefficient α = 0.01) balance_loss = auxiliary_load_balancing_loss(all_probs, indices, num_experts=8) total_loss = task_loss + 0.01 * balance_loss print(f"Task loss: {task_loss.item():.4f}") print(f"Balance loss: {balance_loss.item():.4f}") # ideally close to 1.0 print(f"Total loss: {total_loss.item():.4f}")
5
Using Mixtral / DeepSeek via HuggingFace Transformers

You don't need to build MoE from scratch to use it. HuggingFace Transformers ships Mixtral and other MoE models out of the box. Here's how to load and run them:

# Install: pip install transformers accelerate bitsandbytes from transformers import AutoModelForCausalLM, AutoTokenizer import torch # ── Option A: Mixtral 8x7B (open weights, Apache-2.0) ────────────────────── model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1" tokenizer = AutoTokenizer.from_pretrained(model_name) # Load with 4-bit quantization: reduces VRAM from ~90 GB → ~24 GB model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", # spread across available GPUs/CPU torch_dtype=torch.float16, load_in_4bit=True, # requires bitsandbytes ) # Prepare a chat-formatted prompt messages = [ {"role": "user", "content": "Explain Mixture of Experts in three sentences."} ] input_ids = tokenizer.apply_chat_template( messages, return_tensors="pt", add_generation_prompt=True ).to(model.device) # Generate a response with torch.no_grad(): output_ids = model.generate( input_ids, max_new_tokens=256, temperature=0.7, do_sample=True, ) response = tokenizer.decode( output_ids[0][input_ids.shape[1]:], skip_special_tokens=True ) print(response) # ── Option B: Inspect which experts were activated ───────────────────────── # Access the router logits for interpretability outputs = model(input_ids, output_router_logits=True) router_logits = outputs.router_logits # tuple of tensors per MoE layer for layer_idx, logits in enumerate(router_logits): top2 = logits.topk(2, dim=-1).indices # [seq, 2] print(f"Layer {layer_idx}: top-2 experts used → {top2[0].tolist()}")

Tip — Small MoE model for experimentation

Use mistralai/Mixtral-8x7B-v0.1 for the base model or deepseek-ai/deepseek-moe-16b-chat for a smaller DeepSeek variant that fits on a single 24GB GPU.

Technical Challenges

Training & Inference

High Memory Requirements

Although MoE activates only a subset of parameters per token, all experts must reside in VRAM. Mixtral 8x7B needs ~90 GB FP16 — requiring 4-bit quantization for most hardware.

Training Instabilities

The discrete routing decisions (non-differentiable top-k) can cause gradient instability. Auxiliary losses, router z-loss, and careful initialization are required.

Fine-tuning Challenges

MoE models overfit faster than dense models during supervised fine-tuning. Techniques like LoRA applied only to router/non-expert weights help stabilize it.

Architecture & Deployment

Load Balancing

Without active balancing, routers collapse to using 1–2 experts for everything, wasting model capacity. Auxiliary losses and capacity factors are essential.

Communication Overhead

In distributed training, tokens must be sent to experts on different devices (all-to-all communication). This latency can bottleneck training throughput by 30–50%.

Expert Capacity Planning

Token drops occur when an expert receives more tokens than its capacity budget. Overflow tokens are dropped, potentially hurting quality. Capacity factor tuning is non-trivial.

Solutions & Mitigations

Training Stability

  • Auxiliary load-balancing loss (Switch Transformer)
  • Router z-loss to prevent extreme logits (ST-MoE)
  • Bias-based load balancing without auxiliary loss (DeepSeek-V3)
  • Selective precision: FP32 for router, FP16/BF16 for experts

Deployment Efficiency

  • 4-bit / 8-bit quantization (bitsandbytes, GPTQ, AWQ)
  • Expert parallelism: each GPU holds a subset of experts
  • Flash Attention + MLA (Multi-Head Latent Attention) for memory
  • vLLM and TensorRT-LLM for optimized MoE inference serving

Future Directions

Hardware Co-design

Next-gen accelerators will optimize for sparse expert dispatch patterns — reducing the all-to-all communication overhead that currently limits MoE training throughput.

  • Sparse tensor cores for expert computation
  • NVLink bandwidth optimized for expert parallelism
  • On-demand expert paging from CPU ↔ GPU

Advanced Routing

Beyond simple top-k: smarter routing that understands semantic context, task type, and language, enabling better expert specialization and lower token drop rates.

  • Expert-choice routing (experts select tokens)
  • Soft/differentiable routing mechanisms
  • Auxiliary-loss-free load balancing (DeepSeek-V3 style)

Multimodal MoE

MoE architectures naturally extend to multimodal models — vision, audio, and text can each have dedicated experts, with cross-modal experts for integration.

  • Modality-specific expert clusters
  • Combining MoE + RAG for grounded generation
  • Dynamic expert merging / pruning post-training

Emerging Research Trends (2025)

Expert Merging & Model Soup

MoE experts can be merged post-training to create compact dense models. Techniques like DARE and TIES-merging enable knowledge distillation from MoE → dense without retraining.

Long-context MoE

Combining MoE with sliding window attention (Mixtral) and Multi-Head Latent Attention (DeepSeek) enables 128K+ context windows at manageable memory cost.

MoE for Reasoning Models

Chain-of-thought and RLHF training (GRPO, DPO) combined with MoE architectures. DeepSeek-R1 uses MoE + reinforcement learning for state-of-the-art mathematical reasoning.

Efficient On-device MoE

Small MoE models (Qwen-MoE, MobileMoE) target on-device deployment — using sparse activation to run large-capacity models on mobile GPUs and Apple Neural Engine.

Explore MoE in Your Projects

Sparse Mixture of Experts is the dominant scaling paradigm for 2024–2025. Whether you're training from scratch or fine-tuning open models, MoE gives you more capability per FLOP than any dense alternative.