View article

[PDF] from cug.org

Monitoring tools for large scale systems

Authors

Ross Miller, Jason Hill, David A Dillow, Raghul Gunasekaran, Galen M Shipman, Don Maxwell

Publication date

2010/5

Journal

Proceedings of Cray User Group Conference (CUG 2010)

Description

Operating computing systems, file systems, and associated networks at unprecedented scale offer unique challenges for fault monitoring, performance monitoring and problem diagnosis. Conventional system monitoring tools are insufficient to process the increasingly large and diverse volume of performance and status log data produced by the world’s largest systems. In addition to the large data volume, the wide variety of systems employed by the largest computing facilities present diverse information from multiple sources, further complicating analysis efforts. At leadership scale, new tool development is required to acquire, condense, correlate, and present status and performance data to systems staff for timely evaluation.

This paper details a set of system monitoring tools developed by the authors and utilized by systems staff at Oak Ridge National Laboratory’s Leadership Computing Facility, which includes the Cray XT5 Jaguar. These tools include utilities to correlate I/O performance and event data with specific systems, resources, and jobs. Where possible, existing utilities are incorporated to reduce development effort and increase community participation. Future work may include additional integration among tools and implementation of fault-prediction tools.

Total citations

Cited by 37

201020112012201320142015201620172018201920202021202220231 1 2 2 5 4 2 7 2 4 1 2 3 1

Scholar articles

Monitoring tools for large scale systems

R Miller, J Hill, DA Dillow, R Gunasekaran… - Proceedings of Cray User Group Conference (CUG …, 2010