Presentation | 2016-08-09 A simulation study on fault tolerancy of parallel machine learning systems with parameter servers Mingxi Li, Yusuke Tanimura, Hidemoto Nakada, |
---|---|
PDF Download Page | PDF download Page Link |
Abstract(in Japanese) | (See Japanese page) |
Abstract(in English) | Parallel computation is essential for machine learning systems to be more faster. There are two techniques to build parallel machine learning systems; namely data parallel method and model parallel method. In this paper, we only discuss data parallel where large number of parameter servers and computation servers communicate each other to perform computation. Fault tolerancy is a big problem on large scale computation system in general, however, there are not much discussions about the fault folerancy of parallel machine learning system. in this paper, we discuss the fault tolerancy of parallel machine learning systems which use parameter servers. Parameter servers gives extra redundancy to the system and could double as the checkpoint server. We also quantitatively evaluate several fault tolerance method using parallel environment simulator SimGrid. |
Keyword(in Japanese) | (See Japanese page) |
Keyword(in English) | Fault Tolerancy / Parameter Server / Machine Learning / Simulations / Distributed Systems |
Paper # | CPSY2016-20,DC2016-17 |
Date of Issue | 2016-08-01 (CPSY), 2016-08-02 (DC) |
Conference Information | |
Committee | CPSY / DC / IPSJ-ARC |
---|---|
Conference Date | 2016/8/8(3days) |
Place (in Japanese) | (See Japanese page) |
Place (in English) | Kissei-Bunka-Hall (Matsumoto) |
Topics (in Japanese) | (See Japanese page) |
Topics (in English) | Parallel, Distributed and Cooperative Processing |
Chair | Yasuhiko Nakashima(NAIST) / Michiko Inoue(NAIST) |
Vice Chair | Koji Nakano(Hiroshima Univ.) / Hidetsugu Irie(Univ. of Tokyo) / Satoshi Fukumoto(Tokyo Metropolitan Univ.) |
Secretary | Koji Nakano(Fujitsu Labs.) / Hidetsugu Irie(NII) / Satoshi Fukumoto(Kyoto Sangyo Univ.) / (Tokyo Inst. of Tech.) |
Assistant | Takeshi Ohkawa(Utsunomiya Univ.) / Shinya Takameda(NAIST) |
Paper Information | |
Registration To | Technical Committee on Computer Systems / Technical Committee on Dependable Computing / Special Interest Group on System Architecture |
---|---|
Language | JPN |
Title (in Japanese) | (See Japanese page) |
Sub Title (in Japanese) | (See Japanese page) |
Title (in English) | A simulation study on fault tolerancy of parallel machine learning systems with parameter servers |
Sub Title (in English) | |
Keyword(1) | Fault Tolerancy |
Keyword(2) | Parameter Server |
Keyword(3) | Machine Learning |
Keyword(4) | Simulations |
Keyword(5) | Distributed Systems |
1st Author's Name | Mingxi Li |
1st Author's Affiliation | University of Tsukubay(Univ. of Tsukuba) |
2nd Author's Name | Yusuke Tanimura |
2nd Author's Affiliation | National Institute of Advanced Industrial Science and Technology(AIST) |
3rd Author's Name | Hidemoto Nakada |
3rd Author's Affiliation | National Institute of Advanced Industrial Science and Technology(AIST) |
Date | 2016-08-09 |
Paper # | CPSY2016-20,DC2016-17 |
Volume (vol) | vol.116 |
Number (no) | CPSY-177,DC-178 |
Page | pp.pp.125-130(CPSY), pp.1-6(DC), |
#Pages | 6 |
Date of Issue | 2016-08-01 (CPSY), 2016-08-02 (DC) |