The Performance of Low-Synchronization Variants of Reorthogonalized Block Classical Gram--Schmidt
Abstract
Numerous applications, such as Krylov subspace solvers, make extensive use of the block classical Gram-Schmidt (BCGS) algorithm and its reorthogonalized variants for orthogonalizing a set of vectors. For large-scale problems in distributed memory settings, the communication cost, particularly the global synchronization cost, is a major performance bottleneck. In recent years, many low-synchronization BCGS variants have been proposed in an effort to reduce the number of synchronization points. The work [E. Carson, Y. Ma, arXiv preprint 2411.07077] recently proposed stable one-synchronization and two-synchronization variants of BCGS, i.e., BCGSI+P-1S and BCGSI+P-2S. In this work, we evaluate the performance of BCGSI+P-1S and BCGSI+P-2S on a distributed memory system compared to other well-known low-synchronization BCGS variants. In comparison to the classical reorthogonalized BCGS algorithm (BCGSI+), numerical experiments demonstrate that BCGSI+P-1S and BCGSI+P-2S can achieve up to 4 times and 2 times speedups, respectively, and perform similarly to other (less stable) one-synchronization and two-synchronization variants. BCGSI+P-1S and BCGSI+P-2S are therefore recommended as the best choice in practice for computing an economic QR factorization on distributed memory systems due to their superior stability when compared to other variants with the same synchronization cost.