Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Memory as the bottleneck

Transformer models grow faster than GPU memory. Shoeybi et al. split layers across devices so no single GPU holds the full parameter set. The approach is intra-layer model parallelism: partition weight matrices, insert all-reduce communication, stay in native PyTorch.

Engineering over novelty

The paper's strength is systems work. They report 8.3B-parameter models on 512 GPUs with strong scaling efficiency. Layer normalization placement matters for BERT-like models at scale. Details that theory papers skip become the difference between convergence and failure.

Lineage

Megatron-LM underpins much of NVIDIA's large-model stack and influenced how the industry thinks about training infrastructure. Attention gave the architecture; papers like this made it trainable at scale.