Minh Quan Ho - Optimization of data transfer on many-core processors, applied to dense linear algebra and stencil computations

11:45
Thursday
5
Jul
2018
Organized by: 
Minh Quan Ho
Speaker: 
Minh Quan Ho
Teams: 

Composition du jury :

  • Bernard Tourancheau - Professeur, Université Grenoble Alpes - Directeur de thèse
  • Joël Falcou - Maitre de conferences, Université Paris-Sud - Rapporteur
  • Francisco Daniel Igual Pena - Assistant professeur, Univ. Complutense De Madrid - Rapporteur
  • Christian Obrecht - Maitre de conferences, INSA Lyon - Examinateur
  • Benoît  Dupont De Dinechin - Directeur de la Technologie, Kalray S.A - Examinateur
  • Raphaël Couturier - Professeur, Université Bourgogne Franche-Comté - Examinateur

 

 

Upcoming Exascale target in High Performance Computing (HPC) and disruptive achievements in artificial intelligence give emergence of alternative non-conventional many-core architectures, with energy efficiency typical of embedded systems, and providing the same software ecosystem as classic HPC platforms. A key enabler of energy-efficient computing on many-core architectures is the exploitation of data locality, specifically the use of scratchpad memories in combination with DMA engines in order to overlap computation and communication. Such software paradigm raises considerable programming challenges to both the vendor and the application developer. In this thesis, we tackle the memory transfer and performance issues, as well as the programming challenges of memory- and compute-intensive HPC applications on the Kalray MPPA many-core architecture.

With the first memory-bound use-case of the lattice Boltzmann method (LBM), we provide generic and fundamental techniques for decomposing three-dimensional iterative stencil problems onto clustered many-core processors fitted with scratchpad memories and DMA engines. The developed DMA-based streaming and overlapping algorithm delivers 33% performance gain over the default cache-based implementation. High-dimensional stencil computation suffers serious I/O bottleneck and limited on-chip memory space. We developed a new in-place LBM propagation algorithm, which reduces by half the memory footprint and yields 1.5 times higher performance-per-byte efficiency than the state-of-the-art out-of-place algorithm.

On the compute-intensive side with dense linear algebra computations, we build an optimized matrix multiplication benchmark based on exploitation of scratchpad memory and efficient asynchronous DMA communication. These techniques are then extended to a DMA module of the BLIS framework, which allows us to instantiate an optimized and portable level-3 BLAS numerical library on any DMA-based architecture, in less than 100 lines of code. We achieve 75% peak performance on the MPPA processor with the matrix multiplication operation (GEMM) from the standard BLAS library, without having to write thousands of lines of laboriously optimized code for the same result