GraphBLAST Targets GPU Graph Analytics Performance Issues

October 6, 2022

By Kathy Kincade
Contact: cscomms@lbl.gov

Since its introduction into the public domain five years ago, GraphBLAS has been widely adopted across commercial, scientific, and computational research communities. Heralded for its contributions to the long-standing Basic Linear Algebra Subprograms (BLAS) library, the open-source GraphBLAS framework applies linear algebra to high-performance graph analytics to improve scalability for a broad range of applications, from artificial intelligence and cybersecurity to computational science and parallel computing. It’s also been used to develop graph and combinatorial algorithms for exascale systems through the U.S. Department of Energy’s Exascale Computing Program.

Aydın Buluç, Carl Yang, John Owens pose together

GraphBLAST developers are (left to right) Aydın Buluç, Carl Yang, and John Owens. (Credit: Aydın Buluç, Berkeley Lab)

Now a new iteration, “GraphBLAST” – developed by researchers from Lawrence Berkeley National Laboratory (Berkeley Lab) and the University of California, Davis (UC Davis) – further enhances the performance of this popular collection of graph algorithm building blocks by overcoming design and performance challenges specific to GPU processors.

“This is the first GPU-based graph framework that has the same performance as some of the traditional vertex-centric graph frameworks on the GPUs,” said Carl Yang, a software engineer at UC Davis and Berkeley Lab and lead GraphBLAST developer. He is also first author on a comprehensive paper describing GraphBLAST and its capabilities published in the journal ACM Transaction on Mathematical Software.

Overcoming a GPU Performance Bottleneck

Graphs are mathematical representations that emerge when solving problems in domains such as bioinformatics, social network analysis, molecular synthesis, and route planning, the authors explain in their paper, “GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU.” Graphs can contain billions of vertices and edges, “so parallelization has become a must,” they note.

But in recent years, a new challenge has emerged: a mismatch between the languages users and graph algorithm designers prefer and the programming languages used for parallel hardware. This gap is what fueled the development and growing adoption of GraphBLAS and related frameworks. But these software packages have faced their own speed bumps: namely, a lack of high-performance implementations for GPUs.

“GPUs in general have more bandwidth, meaning they can move data faster than most CPUs do,” said Aydın Buluç, a senior scientist in Berkeley Lab’s Applied Math and Computational Research Division and co-author on the ACM paper. “However, CPUs are optimized for latency, and a lot of the graph codes, historically, were framed more like latency-bound applications. So it is a natural fit that a linear algebraic programming model would unlock the bandwidth advantages of GPUs.”

This is where GraphBLAST - the first open-source, high-performance, linear-algebra-based graph framework for GPUs - comes in. It specifically optimizes the software for use on GPUs by addressing three key components: load balancing, memory management, and an array of potential functionalities.

“The biggest thing is the load balancing,” Yang said. “In GraphBLAST we use more low-level load balancing techniques than previous implementations of GraphBLAS on the GPU. It is important to get calculations to run quickly on the GPU to make sure that each of the parallel units on the GPU, or threads, have a similar amount of work to do. You can’t give one thread a lot of work while giving all the other threads very little work. So the number one thing we did was make sure the kernels have good load-balancing techniques, which is a big difference in terms of performance.”

GPU memory management is also important because to run things on a GPU requires being able to move memory from the CPU to the GPU and copy the output of the calculations that the GPU has done back to the CPU. This is something that GraphBLAST does well, Yang noted. “It handles automatically the GPU-to-CPU management, which means the user doesn’t have to spend a lot of time or code tinkering with things themselves.”

Another key component is the incorporation of direction optimization. There are two ways to do a graph traversal (the process of checking and/or updating each vertex in a graph), Yang explained: a push or a pull. “GraphBLAST chooses push or pull when it is beneficial to do so, and this is direction optimization,” he said. “Previous frameworks have required the user to write a lot of boilerplate code implementing the push or the pull themselves for a particular graph algorithm, and to have the user decide when it is better to use which. With GraphBLAST, this is unnecessary and the user gets all the benefits out of the box.”

“The language of matrices is really appreciated by people who develop and design compilers, for example,” Buluç added. “Some of the leading groups that write domain-specific languages and compilers were using the techniques presented in one of Carl’s previous papers that enable direction optimization on graphs using linear algebraic primitives on the background.”

Performance Comparison Outcomes

To demonstrate its performance, the team tested GraphBLAST on SuiteSparse GraphBLAS (a complete implementation of the GraphBLAS standard) and Ligra and Gunrock (state-of-the-art graph frameworks on CPUs and GPUs) on nine datasets including both real-world and synthetically generated graphs. The comparison was made using five commonly used graph algorithms: breadth-first-search, single-source shortest-path, PageRank, connected components, and triangle counting. Their results show that, on a single NVIDIA GPU, GraphBLAST has at least an order of magnitude speedup over previous implementations (SuiteSparse and GBTL, 43.51x and 31.83x, respectively), comparable performance to Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.

This work represents two separate large sets of contributions, noted co-author John Owens, a professor of electrical and computer engineering at UC Davis. “One is the contribution of the framework itself, which Carl built. It performs terrifically, and we can use it for all the same things that the CPU people can do. And second, it encapsulates some of the previous work the three of us have done that advances the state of the art in GPU implementations of matrix primitives, especially in direction optimization. Carl will be really modest and say ‘hey, this is just something out there that we discovered’ – but he’s the one who discovered it.”

“There are certain productivity advantages of doing graph algorithms in the language of matrices, and we include code comparisons at the end of the paper to say how much shorter it is to run it this way,” Buluç added. “We don’t have a distributed memory implementation yet, but things like GraphBLAST are a big step in that direction because now we can use GraphBLAS on machines like Perlmutter, and we are already using it on Summit.”

About Berkeley Lab

Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 16 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.