DeepSeek-V3 Paper - Notes by Chris Hayduk

DeepSeek-V3, a large Mixture-of-Experts (MoE) **model with 671B parameters**, of which **37B are activated for each token**. Key Innovations: 1. Loss-free strategy for load balancing across experts 2. Multi-token prediction training objective 3. DualPipe algorithm for efficient pipeline parallelism 4. Mixed precision FP8 training for faster computation and lower memory footprint 5. Optimize memory footprint, making it possible to train DeepSeek-V3 without costly tensor parallelism 6. Distills chain of thought reasoning from DeepSeek-R1 into the model during post-training Features maintained from DeepSeek-V2: 1. Multi-head latent attention for efficient inference 2. DeepSeek Mixture of experts for cost-effective training