View article

H5spark: bridging the i/o gap between spark and scientific data formats on hpc systems

Authors

Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon, Alex Gittens, Lisa Gerhardt, Surendra Byna, Mike F Ringenburg

Publication date

2016

Journal

Cray user group

Description

The Spark framework has been tremendously powerful for performing Big Data analytics in distributed data centers. However, using Spark to analyze large-scale scientific data on HPC systems has several challenges. For instance, parallel file systems are shared among all computing nodes, in contrast to shared-nothing architectures. Additionally, accessing data stored in commonly used scientific data formats, such as HDF5 and netCDF, is not natively supported in Spark. Our study focuses on improving I/O performance of Spark on HPC systems when reading and writing scientific data stored in HDF5/netCDF. We select several scientific use cases to drive the design of an efficient parallel I/O API for Spark on HPC systems, called H5Spark, which optimizes I/O performance and takes into account Lustre file system striping. We evaluate the performance of H5Spark on Cori, a Cray XC40 system located at NERSC.

Total citations

Cited by 34

201620172018201920202021202220235 6 6 5 4 4 2 1

Scholar articles

H5spark: bridging the i/o gap between spark and scientific data formats on hpc systems

J Liu, E Racah, Q Koziol, RS Canon, A Gittens… - Cray user group, 2016