Authors
Galen Shipman, David Dillow, Sarp Oral, Feiyi Wang, Douglas Fuller, Jason Hill, Zhe Zhang
Publication date
2010/5/24
Journal
The 52nd Cray user group conference
Description
The Spider system at the Oak Ridge National Laboratory’s Leadership Computing Facility (OLCF) is the world’s largest scale Lustre parallel file system. Envisioned as a shared parallel file system capable of delivering both the bandwidth and capacity requirements of the OLCF’s diverse computational environment, the project had a number of ambitious goals. To support the workloads of the OLCF’s diverse computational platforms, the aggregate performance and storage capacity of Spider exceed that of our previously deployed systems by a factor of 6x-240 GB/sec, and 17x-10 Petabytes, respectively. Furthermore, Spider supports over 26,000 clients concurrently accessing the file system, which exceeds our previously deployed systems by nearly 4x. In addition to these scalability challenges, moving to a center-wide shared file system required dramatically improved resiliency and fault-tolerance mechanisms. This paper details our efforts in designing, deploying, and operating Spider. Through a phased approach of research and development, prototyping, deployment, and transition to operations, this work has resulted in a number of insights into large-scale parallel file system architectures, from both the design and the operational perspectives. We present in this paper our solutions to issues such as network congestion, performance baselining and evaluation, file system journaling overheads, and high availability in a system with tens of thousands of components. We also discuss areas of continued challenges, such as stressed metadata performance and the need for file system quality of service alongside with our efforts to address them …
Total citations
20112012201320142015201620172018201920201263331234
Scholar articles
G Shipman, D Dillow, S Oral, F Wang, D Fuller, J Hill… - The 52nd Cray user group conference, 2010