Presentation 2016-08-09
A simulation study on fault tolerancy of parallel machine learning systems with parameter servers
Mingxi Li, Yusuke Tanimura, Hidemoto Nakada,
PDF Download Page PDF download Page Link
Abstract(in Japanese) (See Japanese page)
Abstract(in English) Parallel computation is essential for machine learning systems to be more faster. There are two techniques to build parallel machine learning systems; namely data parallel method and model parallel method. In this paper, we only discuss data parallel where large number of parameter servers and computation servers communicate each other to perform computation. Fault tolerancy is a big problem on large scale computation system in general, however, there are not much discussions about the fault folerancy of parallel machine learning system. in this paper, we discuss the fault tolerancy of parallel machine learning systems which use parameter servers. Parameter servers gives extra redundancy to the system and could double as the checkpoint server. We also quantitatively evaluate several fault tolerance method using parallel environment simulator SimGrid.
Keyword(in Japanese) (See Japanese page)
Keyword(in English) Fault Tolerancy / Parameter Server / Machine Learning / Simulations / Distributed Systems
Paper # CPSY2016-20,DC2016-17
Date of Issue 2016-08-01 (CPSY), 2016-08-02 (DC)

Conference Information
Committee CPSY / DC / IPSJ-ARC
Conference Date 2016/8/8(3days)
Place (in Japanese) (See Japanese page)
Place (in English) Kissei-Bunka-Hall (Matsumoto)
Topics (in Japanese) (See Japanese page)
Topics (in English) Parallel, Distributed and Cooperative Processing
Chair Yasuhiko Nakashima(NAIST) / Michiko Inoue(NAIST)
Vice Chair Koji Nakano(Hiroshima Univ.) / Hidetsugu Irie(Univ. of Tokyo) / Satoshi Fukumoto(Tokyo Metropolitan Univ.)
Secretary Koji Nakano(Fujitsu Labs.) / Hidetsugu Irie(NII) / Satoshi Fukumoto(Kyoto Sangyo Univ.) / (Tokyo Inst. of Tech.)
Assistant Takeshi Ohkawa(Utsunomiya Univ.) / Shinya Takameda(NAIST)

Paper Information
Registration To Technical Committee on Computer Systems / Technical Committee on Dependable Computing / Special Interest Group on System Architecture
Language JPN
Title (in Japanese) (See Japanese page)
Sub Title (in Japanese) (See Japanese page)
Title (in English) A simulation study on fault tolerancy of parallel machine learning systems with parameter servers
Sub Title (in English)
Keyword(1) Fault Tolerancy
Keyword(2) Parameter Server
Keyword(3) Machine Learning
Keyword(4) Simulations
Keyword(5) Distributed Systems
1st Author's Name Mingxi Li
1st Author's Affiliation University of Tsukubay(Univ. of Tsukuba)
2nd Author's Name Yusuke Tanimura
2nd Author's Affiliation National Institute of Advanced Industrial Science and Technology(AIST)
3rd Author's Name Hidemoto Nakada
3rd Author's Affiliation National Institute of Advanced Industrial Science and Technology(AIST)
Date 2016-08-09
Paper # CPSY2016-20,DC2016-17
Volume (vol) vol.116
Number (no) CPSY-177,DC-178
Page pp.pp.125-130(CPSY), pp.1-6(DC),
#Pages 6
Date of Issue 2016-08-01 (CPSY), 2016-08-02 (DC)