Authors
Galen Shipman, David Dillow, Sarp Oral, Feiyi Wang
Publication date
2009/5
Journal
Proceedings, Cray User Group (CUG) Conference, Atlanta, GA
Description
The Leadership Computing Facility (LCF) at Oak Ridge National Laboratory (ORNL) has a diverse portfolio of computational resources ranging from a petascale XT4/XT5 simulation system (Jaguar) to numerous other systems supporting development, visualization, and data analytics. In order to support vastly different I/O needs of these systems Spider, a Lustre-based center wide file system was designed and deployed to provide over 240 GB/s of aggregate throughput with over 10 Petabytes of formatted capacity. A multi-stage InfiniBand network, dubbed as Scalable I/O Network (SION), with over 889 GB/s of bisectional bandwidth was deployed as part of Spider to provide connectivity to our simulation, development, visualization, and other platforms. To our knowledge, while writing this paper, Spider is the largest and fastest POSIX-compliant parallel file system in production. This paper will detail the overall architecture of the Spider system, challenges in deploying and initial testings of a file system of this scale, and novel solutions to these challenges which offer key insights into file system design in the future.
Total citations
2009201020112012201320142015201620172018201920202021202216127147682311
Scholar articles
G Shipman, D Dillow, S Oral, F Wang - Proceedings, Cray User Group (CUG) Conference …, 2009