Authors
Galen M Shipman, David A Dillow, Douglas Fuller, Raghul Gunasekaran, Jason Hill, Youngjae Kim, Sarp Oral, Doug Reitz, James Simmons, Feiyi Wang
Publication date
2012/5
Journal
Proceedings of Cray User Group Conference (CUG 2012)
Description
When deployed in 2008/2009 the Spider system at the Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) was the world’s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF’s diverse computational environment, Spider has since become a blueprint for shared Lustre environments deployed worldwide. Designed to support the parallel I/O requirements of the Jaguar XT5 system and other smallerscale platforms at the OLCF, the upgrade to the Titan XK6 heterogeneous system will begin to push the limits of Spider’s original design by mid 2013. With a doubling in total system memory and a 10x increase in FLOPS, Titan will require both higher bandwidth and larger total capacity. Our goal is to provide a 4x increase in total I/O bandwidth from over 240GB/sec today to 1TB/sec and a doubling in total capacity. While aggregate bandwidth and total capacity remain important capabilities, an equally important goal in our efforts is dramatically increasing metadata performance, currently the Achilles heel of parallel file systems at leadership. We present in this paper an analysis of our current I/O workloads, our operational experiences with the Spider parallel file systems, the high-level design of our Spider upgrade, and our efforts in developing benchmarks that synthesize our performance requirements based on our workload characterization studies.
Total citations
2012201320142015201620172018201920201233311
Scholar articles
GM Shipman, DA Dillow, D Fuller, R Gunasekaran… - Proceedings of Cray User Group Conference (CUG …, 2012