What does it take to produce application code that performs as close as possible to a parallel architecture’s compute or memory peak performance? This question is one that programmers of high-performance architectures contemplate regularly since using such systems efficiently can solve problems faster, or solve larger or more complex problems.
This question fundamentally changes the approach to programming.
Programming is no longer simply about the correct specification of an algorithm, but expands to understanding and exploiting features of the target architecture in all aspects of an application: algorithm choice, data structures and data layout, where to exploit parallelism, how to make the best use of the memory hierarchy, and how to avoid costly communication and synchronization between cooperating computations. Building applications while addressing performance and scalability concerns is difficult and frequently leads to low-level software that exposes architectural details.