High-speed RDMA networks are getting rapidly adopted in the industry for...
Scaling model parameters usually improves model quality, but at the pric...
Transformer is the cornerstone model of Natural Language Processing (NLP...
Gradient compression (GC) is a promising approach to addressing the
comm...
Distributed training using multiple devices (e.g., GPUs) has been widely...
Companies build separate training and inference GPU clusters for deep
le...
Graph neural networks (GNNs) have extended the success of deep neural
ne...
Today's auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor pro...
Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs...
The increasingly complicated and diverse applications have distinct netw...
The learning rate (LR) schedule is one of the most important hyper-param...
Network failures continue to plague datacenter operators as their sympto...