%0 Conference Paper %B 52nd International Conference on Parallel Processing (ICPP 2023) %D 2023 %T O(N) distributed direct factorization of structured dense matrices using runtime systems %A Sameer Deshmukh %A Rio Yokota %A George Bosilca %A Qinxiang Ma %B 52nd International Conference on Parallel Processing (ICPP 2023) %I ACM %C Salt Lake City, Utah %8 2023-08 %@ 9798400708435 %G eng %U https://dl.acm.org/doi/proceedings/10.1145/3605573 %R 10.1145/3605573.3605606 %0 Journal Article %J The International Journal of High Performance Computing Applications %D 2019 %T Distributed-Memory Lattice H-Matrix Factorization %A Ichitaro Yamazaki %A Akihiro Ida %A Rio Yokota %A Jack Dongarra %X We parallelize the LU factorization of a hierarchical low-rank matrix (ℋ-matrix) on a distributed-memory computer. This is much more difficult than the ℋ-matrix-vector multiplication due to the dataflow of the factorization, and it is much harder than the parallelization of a dense matrix factorization due to the irregular hierarchical block structure of the matrix. Block low-rank (BLR) format gets rid of the hierarchy and simplifies the parallelization, often increasing concurrency. However, this comes at a price of losing the near-linear complexity of the ℋ-matrix factorization. In this work, we propose to factorize the matrix using a “lattice ℋ-matrix” format that generalizes the BLR format by storing each of the blocks (both diagonals and off-diagonals) in the ℋ-matrix format. These blocks stored in the ℋ-matrix format are referred to as lattices. Thus, this lattice format aims to combine the parallel scalability of BLR factorization with the near-linear complexity of ℋ-matrix factorization. We first compare factorization performances using the ℋ-matrix, BLR, and lattice ℋ-matrix formats under various conditions on a shared-memory computer. Our performance results show that the lattice format has storage and computational complexities similar to those of the ℋ-matrix format, and hence a much lower cost of factorization than BLR. We then compare the BLR and lattice ℋ-matrix factorization on distributed-memory computers. Our performance results demonstrate that compared with BLR, the lattice format with the lower cost of factorization may lead to faster factorization on the distributed-memory computer. %B The International Journal of High Performance Computing Applications %V 33 %P 1046–1063 %8 2019-08 %G eng %N 5 %R https://doi.org/10.1177/1094342019861139 %0 Conference Paper %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %D 2018 %T Analyzing Performance of BiCGStab with Hierarchical Matrix on GPU Clusters %A Ichitaro Yamazaki %A Ahmad Abdelfattah %A Akihiro Ida %A Satoshi Ohshima %A Stanimire Tomov %A Rio Yokota %A Jack Dongarra %X ppohBEM is an open-source software package im- plementing the boundary element method. One of its main software tasks is the solution of the dense linear system of equations, for which, ppohBEM relies on another software package called HACApK. To reduce the cost of solving the linear system, HACApK hierarchically compresses the coefficient matrix using adaptive cross approximation. This hierarchical compression greatly reduces the storage and time complexities of the solver and enables the solution of large-scale boundary value problems. To extend the capability of ppohBEM, in this paper, we carefully port the HACApK’s linear solver onto GPU clusters. Though the potential of the GPUs has been widely accepted in high-performance computing, it is still a challenge to utilize the GPUs for a solver, like HACApK’s, that requires fine-grained computation and global communication. First, to utilize the GPUs, we integrate the batched GPU kernel that was recently released in the MAGMA software package. We discuss several techniques to improve the performance of the batched kernel. We then study various techniques to address the inter-GPU communication and study their effects on state-of- the-art GPU clusters. We believe that the techniques studied in this paper are of interest to a wide range of software packages running on GPUs, especially with the increasingly complex node architectures and the growing costs of the communication. We also hope that our efforts to integrate the GPU kernel or to setup the inter-GPU communication will influence the design of the future-generation batched kernels or the communication layer within a software stack. %B IEEE International Parallel and Distributed Processing Symposium (IPDPS) %I IEEE %C Vancouver, BC, Canada %8 2018-05 %G eng