A Novel Distributed Deep Learning Training Scheme Based on Distributed Skip Mesh List

Masaya Suzuki; Kimihiro Mizutani

Summary

International Conference on Emerging Technologies for Communications

2020

Session Number:E2

Session:

Number:E2-4

A Novel Distributed Deep Learning Training Scheme Based on Distributed Skip Mesh List

Masaya Suzuki, Kimihiro Mizutani,

pp.-

Publication Date:2020/12/2

Online ISSN:2188-5079

DOI:10.34385/proc.63.E2-4

PDF download

Summary:

Complex regression and classification problems can be solved using large and complex neural networks. The scale of neural networks will continue to expand, and the demand for its construction scheme will increase in the future. To train the neural network, its model data are distributed for multiple computers (nodes), and each node distributively calculates the learning error (i.e., gradient) of the model with own training data. In aggregation process, each nodes sends the gradient data to master node (the topmost of topology), and the master node refines own model with the aggregated gradients. The previous construction schemes of large scale neural network focus on how to cooperate nodes, for examples, the conventional schemes apply a tree topology into nodes connections and the model aggregation/distribution is executed iteratively on the topology. Its cooperation manner on the tree topology prevents the learning data traffic from concentrating on the master node. While realizing traffic reduction, the node management is executed by a manual operation, that is, the operators must handle a node insertion/deletion for the neural networks by themselves. The tree topology also has a weak point that each node has only one connection for a parent node, hence, the data aggregation process is not stable for the parent node failure. To solve the two abovementioned problems, we propose herein a novel distributed training management scheme based on Distributed Skip Mesh List (DSML). DSML has a feature that each node has multiple connections for some parent nodes; therefore, the aggregation stability is higher than that of the conventional tree topology scheme. In addition, a node insertion/deletion cost (i.e., the number of transmissions for dealing with it) is only O(log N) where N is the number of nodes. In the evaluation, we confirmed that the node insertion/deletion cost does not significantly increase. In addition, the elapsed time for the aggregation is shorter than the conventional one in the environment node deletions occur frequently.