- Venue and location
- The 27th International Conference on Supercomputing (ICS2013)
- University of Oregon Eugene, Oregon, USA
- Duration and schedule:
- Half Day (Monday, June 10th 13:30 - 17:00)
- Speakers:
- Aurelien Bouteiller
- Piotr Luszczek

Today, a desktop computer with a multicore processor and a GPU accelerator or a many-core accelerator can already provide a Tera-FLOP of performance. This tremendous computational power can only be fully utilized with the appropriate software infrastructure. Most often a major part of the computational effort in scientific and engineering computing goes towards solving linear algebra sub-problems. This tutorial shows design and optimization techniques of the state-of-the-art numerical libraries for solving problems in dense linear algebra.

The *main objective* of this tutorial is to show specific methods and their
implementations that deal with portability and scalability of high performance
codes. The use case of numerical linear algebra serves as a convenient example
of how these techniques achieve their main objective -- maximizing the
efficiency with respect to the metric of choice: peak floating-point
performance of the machine.

The tutorial consists of three parts. The first part focuses on the challenges
of multicore programming. We show some of the ways of dealing with prevalent
need of parallelism, pitfalls of concurrency, aspects of affinity and locality,
varying task granularity, load imbalance, and separation of concerns. We
compare our scheduling approach based on DAGs (Direct Acyclic Graphs) against
the commonly known standards, libraries, and languages such as OpenMP and its
tasks, Cilk's extension to C, Intel's Thread Building Block's for C++, and
Apple's Grand Central Dispatch. The concepts are illustrated by the actual
techniques applied within the **PLASMA** (Parallel Linear Algebra Software for
Multicore Architectures) and **QUARK** (QUeing And Runtime for Kernels) projects.
The second part discusses GPU and/or coprocessor acceleration issues including
the software heterogeneity, system bus bottleneck, and overlapping techniques
available in the various ports of the MAGMA (Matrix Algebra on GPU and
Multicore Architectures) project, and also. Finally, the third part will treat
the ongoing efforts in linear algebra software for distributed memory machines
with heterogeneous nodes: the **PARSEC** (Parallel Runtime Scheduling and Execution
Controller) and **DPLASMA** projects. The key concepts covered in this part are
communication-computation overlap, modern techniques for flow control, data
distribution, dependence discovery and tracking through both compiler-oriented
methods and runtime discovery.

The *target audience* consists mainly of users of parallel machines interested
in advanced optimization techniques on distributed memory heterogeneous architectures
as well as users of dense linear algebra libraries.

The *prerequisite knowledge* includes basic understanding of modern hardware
and familiarity with parallel software for multi-core and accelerator units.

- Speaker: Luszczek, Duration: 1 hour
- Programming and optimizing for multicore systems with NUMA memory architecture with PLASMA
- Break: 15 min.
- Speaker: Luszczek, Duration: 1 hour
- Programming and optimizing for hardware accelerators with MAGMA
- Break: 15 min.
- Speaker: Bouteiller, Duration: 1 hour
- Distributed memory programming and optimization with PARSEC and DPLASMA