We continue our short series of reports from the SC11 conference based on the input of our colleague Will Sawyer. This posting is dedicated to advanced topics in heterogeneous programming with OpenCL
Trying to find out whether OpenCL is the right horse to bet on, we attended Advanced Topics in Heterogeneous Programming with OpenCL, given by Tim Mattson (Intel), Ben Gaster (AMD), Ian Buck (NVIDIA), Peng Wang (NVIDIA), and Mike Houston (AMD).
Tim gave a quick summary of the OpenCL introduction (previous post) and treated the limited synchronization model: there are synchronization primitives within work groups, but between groups all synchronization is through the level of commands. He also reminded us of the Event model: commands return events and obey wait lists.
Ben discussed the mapping of hardware platforms to OpenCL, using the AMD 5870 and Cell BE as examples.
Ian discussed the Fermi (GTX 480) architecture. The mapping was not well presented, leading Tim to ask what values will exactly be returned from the OpenCL querying commands the various architectures.
Mike answered that question for all the architectures, e.g., the Fermi will support 15 work groups, with 32 work items per group. To fill these chips, you need typically need tens of thousands of work items.
Peng presented his implementation of the conjugate gradient solver for sparse matrices, closely analogous to Nathan Bell’s and Mark Garland’s (both from NVIDIA) work on CUSP. The hybrid (ELLPACK+COO) format performs best for almost all matrices tested. The conversion from/to typical CSR format is expensive, but can be parallelized on multiple cores. Unlike the CUSP team, Peng has no plans to wrap this work into a library, which seems quite unfortunate.
Mike Houston presented the discrete Fourier transform for GPUs, which illuminated many of the performance issues, but illustrated very little OpenCL code. The discursive nature of this session symbolized that lack of maturity of OpenCL. Clearly, OpenCL has a lot of potential, but it is work in progress. Other paradigms, such as CUDA, may be proprietary and less solid, but they are farther along.