A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs
We propose ParDNN, an automatic, generic, and non-intrusive partitioning strategy for large DNN models that do not fit into single device memory.ParDNN decides a placement of DNN's underlying computational graph operations across multiple devices so that the devices' memory constraints are met and the training time is minimized.ParDNN is completely independent of the deep learning aspects of a DNN and requires no modification neither at the model nor at the systems level implementation of operation kernels. It partitions DNNs having billions of parameters and hundreds of thousands of operations in seconds to a few minutes. Our experiments with TensorFlow on 16 GPUs demonstrate efficient training of 5 very large models while achieving super-linear scaling for both the batch size and training throughput. In comparison to related work (Mesh-TensorFlow and gradient Checkpointing), ParDNN either outperforms or qualitatively improves upon them.
READ FULL TEXT