DeepSeekMoE Paper - Notes by Chris Hayduk

## Summary Prior mixture-of-experts (MoE) models typically rely on selecting the top _K_ experts (often 1 or 2) out of _N_ possible experts for each token in a sequence. While this approach does reduce computational load—since only a small fraction of experts are activated—it also forces those few activated experts to capture _all_ aspects of the token, including common linguistic structure that is often duplicated across experts. Consequently, an enormous portion of each expert’s capacity is spent memorizing redundant information, leaving less room for true specialization. DeepSeekMoE solves this redundancy problem by: 1. **Using a larger number of smaller experts (Fine-Grained Expert Segmentation)** Instead of a few large experts, DeepSeek splits capacity into many more experts, each of which is smaller in dimensionality. The model then increases the number of selected experts by the same factor, creating a dramatically larger space of potential expert combinations. Despite this combinatorial explosion, the overall parameter count and per-token activated parameters remain _exactly the same_ as in a conventional MoE setup—meaning we gain richer representational capacity without paying extra in total parameter count or computational cost. 2. **Separating Experts into Shared and Routing Experts** DeepSeek also partitions its experts into two sets. The shared experts, which are _always activated_ for every token, learn the broad “common knowledge” required by all inputs (e.g., syntax, high-level semantics). The routing experts, by contrast, are only activated if they are relevant to a specific token, allowing them to focus on niche or domain-specific information. This further decreases redundancy and promotes parameter efficiency: shared experts handle language “fundamentals,” while routing experts handle specialization. 3. **Load Balancing Through Additional Loss Terms** Finally, DeepSeek addresses load balancing in two senses. It enforces a roughly equal usage of each active routing expert across tokens—ensuring no single expert is under- or over-utilized—and distributes the experts themselves across multiple GPUs to avoid hardware bottlenecks. Both of these aims are achieved by incorporating new balancing terms into the training objective. ![[Pasted image 20250129220131.png]] Taken together, these modifications produce a model that is both parameter-efficient and highly flexible. By boosting expert variety, removing needless duplication, and balancing the workload across experts and devices, DeepSeekMoE provides a substantially more effective way to leverage MoE architectures—achieving greater specialization and capacity without increasing the overall parameter footprint. Let's dive in deeper to these three optimizations now and see how they alter the standard MoE transformer architecture. ## Standard Mixture of Expert Models In standard MoE architecture, expert layers will typically replace the feed-forward layer that occurs after self-attention. Experts can be thought of as a set of *N* feed-forward layers that are structurally identical to the original feed-forward layer. Only a subset of these *N* possible feed-forward networks will be activated for any individual token, with many prior MoE architectures selecting 1 or 2 of these *N* possible networks for a given token. Whether or not a network is activated is determined by taking the dot product of the output of the attention layer for that token (i.e. the hidden vector for token i) with the centroid of the current expert. We then take the softmax of this value to force it into the range of 0 to 1. You can think of this like an attention score computed over the experts instead of the tokens - we want to see which expert aligns most closely with the current token under consideration. These scores are computed for each expert, and then the experts are ranked according to this score. The top *K* (usually 1 or 2) experts are selected based on this ranking, and the token embeddings are then passed to those feed-forward expert networks. The output of these experts are added together alongside the initial hidden state for the token (i.e. the token vector prior to the application of the experts). This produces the final output for the given layer. The major obstacle with this approach is the following: since most prior MoE models only selected the top 1 or 2 experts for each token, the selected expert(s) must capture *everything* about a given token, including redundant information such as language structure. This wastes a large amount of the model's capacity to learn useful information, forcing the weights of each expert to memorize redundant information that is already captured by the other experts. ## Fine-Grained Expert Segmentation One of DeepSeek's solutions to the redundancy problem is to **make experts smaller but more numerous**. That is, the DeepSeekMoE approach reduces the dimensionality of each individual expert's feed-forward network (and therefore its computational cost and representational capacity) by a factor of 1/m compared to the network's standard feed-forward networks. Correspondingly, it increases the number of total experts by a factor of m *and* the number of selected experts by the same factor of m. This results in the same number of parameters for the model on net, but allows for substantially more variety when selecting the experts to use for a specific token. We can see this increased variety when examining the combinatorics of the expert space. Suppose our standard feed-forward network has hidden dimension 4096, and our standard mixture of experts model uses 8 of these experts in total, with 2 selected for any given token. This results in the following number of possible expert combinations for each token in the standard mixture of experts model: ${8 \choose 2} = 28 \; \text{possible expert combinations}$ Now, using the DeepSeekMoE architecture, suppose we have m = 8. That is, we are going to increase our number of experts by a factor of 8 (and reduce the hidden dimension by a factor of 1/8). This gives us a hidden dimension of 512 per expert, with 64 total experts and 16 experts selected for any given token. This results in the following number of possible expert combinations for each token in the DeepSeekMoE version of the model: ${64 \choose 16} \approx 489,000,000,000,000 \; \text{possible expert combinations}$ That is, we go from 28 possible expert combinations to nearly 489 trillion possible expert combinations! This allows for *significantly* more specialization across experts and much more variety in knowledge application on a token-by-token basis. Astonishingly, even with this huge increase in variety, the number of tokens stays exactly the same! The number of total parameters in each model is given by: $\begin{align*} \text{Original MoE model parameters} &= 8 \text{ experts} * 4096 \text{ parameters per expert} = 32,768\\ \text{DeepSeekMoE model parameters} &= 64 \text{ experts} * 512 \text{ parameters per expert} = 32,768 \end{align*} $ Similarly, the number of parameters activated for any given token is exactly the same: $\begin{align*} \text{Original MoE activated parameters per token} &= 2 \text{ activated experts per token} * 4096 \text{ parameters per activated expert} = 8192\\ \text{DeepSeekMoE activated parameters per token} &= 16 \text{ activated experts per token} * 512 \text{ parameters per activated expert} = 8192 \end{align*} $ Hence, we get basically a free lunch here - significantly higher representational capacity in our model with the same number of parameters used! ## Shared Experts Another approach DeepSeek took to avoid capturing redundancy in its experts is to segment the expert population into two groups: shared experts and routing experts. Shared experts are **always activated**, regardless of the input token. This incentivizes these expert modules to capture common knowledge relevant to all queries (e.g. language semantics). By contrast, routing experts are only activated if the token is relevant to the expert, as described in the "Standard Mixture of Expert Models" section. That is, the initial *mN* experts are split into two groups: *K_s* shared experts and *K_r = mN - K_s* routing experts. *All* of the *K_s* shared experts are activated for all tokens, while a subset of the *K_r* are selected for each token. Mathematically, this looks like the following: $\begin{align} \mathbf{h}^l_t &= \sum_{i=1}^{K_s} \text{FFN}_i(\mathbf{u}^l_t) + \sum_{i=K_s+1}^{mN} \left( g_{i,t} \cdot \text{FFN}_i(\mathbf{u}_t^l) \right) + \mathbf{u}_t^l, \\[8pt] \text{where} \quad \mathbf{h}_t^l &\text{ is the hidden vector output for the } t\text{-th token at the } l\text{-th layer,} \\[5pt] \text{FFN}_i &\text{ is the feed-forward network representing the } i\text{-th expert,} \\[5pt] K_s &\text{ is the number of shared experts,} \\[5pt] mN &\text{ is the total number of experts,} \\[5pt] \mathbf{u}_t^l &\text{ is the output of the attention mechanism for token } t \text{ at layer } l, \\[5pt] g_{i,t} &\text{ is the gating factor for expert } i \text{ and token } t, \text{ given by:}\\ g_{i,t} &= \begin{cases} s_{i,t}, & s_{i,t} \in \text{Top}_k\left( \{s_{j,t} \mid 1 \leq j \leq mN\}, mK \right) \\[5pt] 0, & \text{otherwise} \end{cases} \\[8pt] s_{i,t} &= \text{Softmax}_i \left( \mathbf{u}_t^l{}^\top \mathbf{e}_i^l \right), \end{align} $ Hence, we can see that the hidden vector output of token t at layer L *always* uses all of the shared experts (denoted by the first summation in the equation) and *always* includes the residual (denoted by the last term). The middle term, representing the routing experts, includes a gating factor that controls which experts are turned on for any specific token. In particular, the gating factor is the output of a softmax if the expert ranked in the top *mK* experts. Otherwise, it is 0. As a result, not only do we eliminate most of the possible experts (thereby greatly reducing the number of active parameters), we also weight the final output based on how *close* each chosen routing expert is to the token. In other words, the more a chosen routing expert "knows" about a topic, the more heavily we weight its opinion. This setup allows the routing experts to ignore the redundant information captured by the shared experts and instead focus on learning concepts and information that are relevant to its area of specialization. This promotes parameter efficiency in the model, as each marginal parameter added to the routing experts will be encouraged through the learning process to acquire information that is distinct from the existing parameters. ## Load Balancing Now that we have a better-designed MoE network with fine-grained experts and expert sharing, there still remains one major challenge to ensure the parameters are used maximally - we need to load balance requests across the available experts. Essentially, our goal is to force each token to attend to the outputs of the *mK* chosen routing experts roughly equally. This makes certain that, when we activate routing expert parameters to process a particular token, all of the activated parameters are contributing meaningfully to the output. As a result, we maximize the utilization of the MoE architecture. In addition to load balancing across experts, we would like to load balance across devices. Experts are typically stored on many separate GPUs, since these models are too large to fit in the memory of a single GPU. Given this fact, we would like the chosen experts for a token to be evenly spread across devices, thus preventing overloading of any single GPU. These two goals are achieved by DeepSeekMoE through introducing two new terms to the loss function. ## Results and Key Takeaways With the above optimizations, DeepSeek was able to mitigate many of the most challenging problems facing MoE models. Together, fine-grained segmentation, shared experts, and load balancing work to maximize the amount of unique, useful information stored in a given set of parameters. As a result, DeepSeekMoE is able to outperform models with *fewer* active parameters. Below, we can see that DeepSeekMoE outperformed LLaMA2 7B (a dense model that does *not* use any experts) across a number of benchmarks with fewer than half of the active parameters. ![[Pasted image 20250129215446.png]] When compared to another mixture of experts model, GShard, we see that DeepSeekMoE again outperforms it with the same total parameters and only half of the activated parameters. ![[Pasted image 20250129215133.png]] In sum, DeepSeek's optimizations for the MoE architecture served to substantially expand the possibilities for local and edge inference. Since only a small percentage of the model's total parameters for any given token, during inference the model's performance requirements are much closer to that of a small, weak model. However, its output quality matches that of a large, well-trained dense LLM. This innovation was critical for laying the groundwork towards DeepSeek-R1, ensuring that state-of-the-art base LLM performance would be possible for smaller models.