SuperLU-DIST

This figure (select to enlarge) shows a heatmap plot of performance (TFLOPS/s)<br />for two matrices K2D5pt4096 and nlpkkt80, with different combinations of 3D process grid P_xy X P_z. Here we increase P_xy and P_z by a factor of two along x and y-axis respectively. Thus, the bottommost row (P_z = 1) is the baseline 2D algorithm. In the best case, the 3D algorithm achieved 27.4x speedup for graph K2D5pt4096.

Recently we have developed a communication-avoiding 3D sparse LU factorization in SuperLU_DIST, which reduces the latency by a factor of O(log n) for plannar graphs and O(n^(1/3)) for non-plannar graphs.

Our new 3D code achieves speedups up to 27x for planar graphs and up to 2.5x for non-planar graphs over the baseline 2D SuperLU_DIST when run on 24,000 cores of a Cray XC30, Edison at NERSC.

Contact: Sherry Li, xsli@lbl.gov