Skip to navigation Skip to content
Careers | Phone Book | A - Z Index
Performance and Algorithms Research


PROTEAS-TUNE is a multi-institutional ECP software technology project spanning the topics of compilers, code generation, auto-tuning, and profiling. Broadly speaking, under PROTEAS-TUNE, LBL has formed a tight collaboration with the University of Utah focused on the development of the Brick Library to affect scalable, performance-portable computations on structured grids.  

Research Topics

The exploitation of data locality is essential in attaining high performance on many applications that perform operations on structured grids (stencils, matrices, tensors, FFTs).  Traditionally, long cache lines transparently deliver spatial locality to computations that exhibit reuse in the unit-stride dimension.  Unfortunately, many computations require the exploitation of data locality in multiple dimensions.  Whereas traditional compiler techniques leverage loop tiling to affect data locality, the bricks library transforms the data structure to ensure one can exploit multi-dimensional data locality via spatial locality.  In essence, a 3D 256^3 array of doubles can be transformed into a 64^3 array of 4^3 "Bricks" of doubles.  Each 4^3 Brick represents 512-bytes of contiguous data.  Thus striding in the i-, j-,k-, dimensions imply striding by 1-, 4-, or 16-doubles when bounded within a brick.  Operations that require data from neighboring bricks, must locate the relevant brick and extract the relevant data.  Overall, this technique has produced a number of research opportunities including

  • Code generation technologies that hide the complexity of inter-brick accesses from users,
  • Autotuning the optimal brick dimensions and code generation techniques,
  • Assessing the performance portability of bricks across multiple GPU and CPU platforms,
  • Extending the bricks technology to a wide range of application domains,
  • Exploring the use of bricks to improve the strong-scaling performance of distributed memory applications, and
  • Exploring the use of bricks to affect model parallelism in AI training.


LBL Researchers




Oscar Antepara, Hans Johansen, Samuel Williams, Tuowen Zhao, Samantha Hirsch, Priya Goyal, Mary Hall, "Performance portability evaluation of blocked stencil computations on GPUs", International Workshop on Performance, Portability & Productivity in HPC (P3HPC), November 2023,


Benjamin Sepanski, Tuowen Zhao, Hans Johansen, Samuel Williams, "Maximizing Performance Through Memory Hierarchy-Driven Data Layout Transformations", MCHPC, November 2022,


Tuowen Zhao, Mary Hall, Hans Johansen, Samuel Williams, "Improving Communication by Optimizing On-Node Data Movement with Data Layout", PPoPP, February 2021,


Tuowen Zhao, Mary Hall, Samuel Williams, Hans Johansen, "Exploiting Reuse and Vectorization in Blocked Stencil Computations on CPUs and GPUs", Supercomputing (SC), November 2019,


Tuowen Zhao, Samuel Williams, Mary Hall, Hans Johansen, "Delivering Performance Portable Stencil Computations on CPUs and GPUs Using Bricks", International Workshop on Performance, Portability and Productivity in HPC (P3HPC), November 2018,

Tuowen Zhao, Mary Hall, Protonu Basu, Samuel Williams, Hans Johansen, "SIMD code generation for stencils on brick decompositions", Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), February 2018,