NVIDIA Clocks World’s Fastest BERT Training Time and Largest Transformer Based Model


This is an impressive feat:

The NVIDIA DGX SuperPOD with 92 DGX-2H nodes set a new record by training BERT-Large in just 53 minutes. This record was set using 1,472 V100 SXM3-32GB 450W GPUs and 8 Mellanox Infiniband compute adapters per node, running PyTorch(...)

The post reads like the press release it is, but was interesting nonetheless. My primary takeaway was the level of scalability the team was able to achieve: 76% efficiency vs. baseline in a setup with 512 GPUs. Impressive.


Want to receive more content like this in your inbox?