Summary

International Technical Conference on Circuits/Systems, Computers and Communications

2016

Session Number:P3

Session:

Number:P3-7

Spark vs. Virtualized Spark: A Performance Analysis

Wenjing Jin,  Jae W. Lee ,  

pp.1019-1020

Publication Date:2016/7/10

Online ISSN:2188-5079

DOI:10.34385/proc.61.P3-7

PDF download (953.7KB)

Summary:
Apache Spark is an open-source framework for scalable big data processing. OpenStack is a popular virtualization framework that provides Infrastructure as a Service (IaaS) on cloud. Deploying Spark on OpenStack provides many benefits such as on-demand resource scaling, greater availability and flexibility. However, this virtualized Spark is likely to have very different performance characteristics from the native Spark. This paper aims to quantize the cost of virtualization on a Spark cluster. Our experiments demonstrate that (i) the virtualized Spark with four nodes is about 1.58X slower than the native Spark, (ii) all of network, CPU and GC cause this slowdown. Overall, the network waiting time and CPU time contribute the most to the increased execution time, and the GC time has the highest increasing rate.