Performance-Optimierung und Parallelisierung eines Codes zur Lösung von partiellen Differentialgleichungen auf dünnen Gittern


Prof. Dr. Christoph Pflaum, Riccarda Scherner-Grießhammer
Chair for Computer Science 10 – System Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg

Project Summary

In the C++ finite element library entitled Expression Templates for Partial Differential Equations on Sparse Grids (ExPDESG), an algorithm for the matrix-vector multiplication of stiffness matrices is implemented for the solution of partial differential equations with variable coefficients on locally adaptive sparse grids.

With the support of the KONWIHR funding, a hotspot analysis was carried out on the Fritz Cluster using the Intel VTune Profiler. In this project, we decided to focus on optimizing and parallelizing the matrix-vector multiplication and the Jacobi preconditioning of the underlying discretization.
In order to optimize the performance of the matrix-vector multiplication, a hybrid parallelization strategy was employed, which utilizes both shared memory (OpenMP) and distributed memory (MPI) parallelization. Two new approaches were implemented, leading to a significant improvement of the runtime. First, a new cell-based data structure was developed, which allows the calculation of local stiffness matrices on cells. The calculation of the local stiffness matrices is then distributed via OpenMP. Secondly, a MPI parallelization of 2d independent cases of prolongations and restrictions was employed. However, a load imbalance within the different MPI tasks was identified and therefore we implemented a dynamic MPI scheduling.
For the optimization of the Jacobi preconditioning we tested two different approaches of parallelization strategies. The first involves distributing the calculation of the diagonal entries of the stiffness matrix via OpenMP. The second is a hybrid approach, which distributes the diagonal entries via MPI and then parallelizes the calculation of 5d independent evaluations of the bilinear form, which arise from the usage of prewavelets, with OpenMP.

As a result of these performance optimizations and parallelizations the runtime of a matrix-vector multiplication and the Jacobi preconditioning based on 6-dimensional sparse grid with 10625 grid points could be significantly reduced. These strategies can be also used when implementing a numerical quadrature for the evaluation of the local stiffness matrix.