Authors
Ross Miller, Jason Hill, David A Dillow, Raghul Gunasekaran, Galen M Shipman, Don Maxwell
Publication date
2010/5
Journal
Proceedings of Cray User Group Conference (CUG 2010)
Description
Operating computing systems, file systems, and associated networks at unprecedented scale offer unique challenges for fault monitoring, performance monitoring and problem diagnosis. Conventional system monitoring tools are insufficient to process the increasingly large and diverse volume of performance and status log data produced by the world’s largest systems. In addition to the large data volume, the wide variety of systems employed by the largest computing facilities present diverse information from multiple sources, further complicating analysis efforts. At leadership scale, new tool development is required to acquire, condense, correlate, and present status and performance data to systems staff for timely evaluation.
This paper details a set of system monitoring tools developed by the authors and utilized by systems staff at Oak Ridge National Laboratory’s Leadership Computing Facility, which includes the Cray XT5 Jaguar. These tools include utilities to correlate I/O performance and event data with specific systems, resources, and jobs. Where possible, existing utilities are incorporated to reduce development effort and increase community participation. Future work may include additional integration among tools and implementation of fault-prediction tools.
Total citations
2010201120122013201420152016201720182019202020212022202311225427241231
Scholar articles
R Miller, J Hill, DA Dillow, R Gunasekaran… - Proceedings of Cray User Group Conference (CUG …, 2010