Authors
H Sarp Oral, Feiyi Wang, David A Dillow, Ross G Miller, Galen M Shipman, Don E Maxwell, Jeffrey L Becklehimer, Jeffrey M Larkin, David Henseler
Publication date
2010/1/1
Publisher
Oak Ridge National Lab.(ORNL), Oak Ridge, TN (United States). National Center for Computational Sciences (NCCS)
Description
Operating system (OS) noise is defined as interference generated by the OS that prevents a compute core from performing ``useful'' work. Compute node kernel daemons, network interfaces, and other OS related services are major sources of such interference. This interference on individual compute cores can vary in duration and frequency, and can cause de-synchronization (jitter) in collective communication tasks and thus results in variable (degraded) overall parallel application performance. This behavior is more observable in large-scale applications using certain types of collective communication primitives, such as MPI\_Allreduce. This paper presents our effort towards reducing the overall effect of OS noise on our large-scale parallel applications. Our tests were performed on the quad-core Jaguar, the Cray XT5 at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). At the time of these tests, Jaguar was a 1.4 PFLOPS supercomputer with 149,504 compute cores and 8 cores per node. We aggregated OS noise sources onto a single core for each node. The scientific application was then run on six of the remaining cores in each node. Our results show that we were able to improve the MPI_Allreduce performance by two orders of magnitude. We demonstrated up to a 30% boost in the performance of the Parallel Ocean Program (POP) using this technique.
Total citations
200920102011201220132014201520162017201820192020202120222023202411641325234111
Scholar articles
HS Oral, F Wang, DA Dillow, RG Miller, GM Shipman… - 2010