Authors
Onur Kayıran, Adwait Jog, Mahmut T Kandemir, Chita R Das
Publication date
2013/9/7
Conference
Proceedings of the 22nd international conference on Parallel architectures and compilation techniques
Pages
157-166
Publisher
IEEE
Description
General-purpose graphics processing units (GPG-PUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of HPC applications. To facilitate such high TLP, emerging programming models like CUDA and OpenCL allow programmers to create work abstractions in terms of smaller work units, called cooperative thread arrays (CTAs). CTAs are groups of threads and can be executed in any order, thereby providing ample opportunities for TLP. The state-of-the-art GPGPU schedulers allocate maximum possible CTAs per-core (limited by available on-chip resources) to enhance performance by exploiting TLP. However, we demonstrate in this paper that executing the maximum possible number of CTAs on a core is not always the optimal choice from the performance perspective. High number of concurrently executing threads might cause more …
Total citations
20122013201420152016201720182019202020212022202320241527474939573123171182
Scholar articles
O Kayıran, A Jog, MT Kandemir, CR Das - Proceedings of the 22nd international conference on …, 2013