Slidecast 1/3 – Course on Getting the Best Out of Multi-core

The Swiss National Supercomputing Centre CSCS in Lugano, Switzerland,  organized on December 10-12, 2012 the course “Getting the best out of multi-core”.

Modern multi-core x86 processors have 100 times more peak performance than similar single-core processors from ten years ago, but most applications haven’t been able to leverage this power to their advantage. The three-day hands-on oriented course shoed how to get the most out of Intel Sandy Bridge and AMD Interlagos processors by investigating the following

  • Code vectorization
    • Understanding processor architecture and the potential speedup from vectorization
    • Using compiler feedback to understand where vectorization is and is not achieved
    • Using compiler feedback, compiler options and pragmas to improve vectorization
  • Tuning for the cache hierarchy
    • Understanding the cache and memory hierarchy on modern multi-core processors
    • Analysing performance reports to determine poor cache utilisation
    • Code changes and compiler options to improve cache utilisation
  • Multi-threading
    • An example of a threading model – OpenMP
    • Use of tools to help produce multi-threaded code
    • Understanding of threading pitfalls that affect code correctness
    • Understanding of threading performance issues on multi-socket multi-core nodes

The course used powerful tools to help understand code performance and to introduce vectorization and threading, with the Cray tools CrayPAT/Apprentice2/Reveal being used on a Cray system and Intel tools on a Sandy Bridge cluster. In particular the Reveal tool has been used to analyse compiler optimisations and performance reports and its powerful OpenMP directive insertion options have been used to help introduce multi-threading into codes.

The course has been rich in hands-on practical sessions to demonstrate these tools and in addition the course allowed the developer to see the critical effects of poor resource utilization, methods to alleviate these problems, and best practices in implementing multi-process multi-threaded codes.

A demonstration of how these techniques can be applied to the Intel Xeon Phi (also known as MIC – Many Integrated Core) architecture has also been given.

Welcome – Introdction to Multi-core – Architecture of Modern Multi-Core Node

Neil Stringfellow, CSCS

Cache Hierarchy and TLB

Neil Stringfellow, CSCS

Vectorization

Neil Stringfellow, CSCS

Vectorizing with Intel Compilers

Benjamin Cumming, CSCS

Code Vectorization: Tools and Utilities that Can Help

Sadaf Alam, CSCS