AMReX Co-Design Center Helps Five ECP Projects Hit Performance Goals

September 26, 2018

Contact: Mike Bernhardt, Bernhardtme@ornl.gov, +1 503.804.1714
This story was originally published on the Exascale Computing Project website.

A computational motif known as block-structured adaptive mesh refinement (AMR) provides a natural way to focus computing power in the most efficient way possible on the most critical parts of a problem. The AMReX Co-Design Center makes available a state-of-the-art AMR infrastructure with the functionality that five ECP application projects and other AMR applications use to be able to effectively take advantage of current and future architectures.

The success of the AMReX Co-Design Center, led by John Bell of Lawrence Berkeley National Laboratory, can be measured by the success of the application teams that use AMReX to provide core computational infrastructure, allowing them to focus on the parts of their problems specific to their applications. For example, the WarpX accelerator modeling team can focus on highly innovative algorithms for accelerator modeling without worrying about developing and maintaining the hybrid parallelism provided by AMReX. The Pele combustion project can focus on the chemistry and transport associated with new fuels without having to implement new kernel-launching strategies for GPUs. The MFIX-Exa multiphase flow modeling team can focus on new fluid algorithms without having to develop new software to separately load balance the fluid and particle work.

Research Objectives and Context

Block-structured AMR provides the basis for discretizing—in both time and space—the equations that govern a physical system. AMR reduces the computational cost and memory footprint compared with a uniform mesh while preserving the essential local descriptions of different physical processes in complex multiphysics algorithms. Fundamental to block-structured AMR algorithms is a hierarchical representation of the solution at multiple levels of resolution. Beyond that spatial representation, AMReX enables algorithm developers unprecedented flexibility in using AMR software. Block-structured AMR was first developed in the 1980s for compressible hydrodynamics; the equations were hyperbolic, the methods were explicit in time, and the physics representation was simple by today’s standards. Today, AMR software can support simulations of flows at the length scale of microns, the equations governing structure formation in the Universe, and anywhere in-between. One of the fundamental principles of the AMReX team is that the software should not dictate what physics is modeled or what algorithm is used. The goal of AMReX is to enable new science at the exascale by supporting advanced algorithms in a cutting-edge software framework.

To do this, AMReX supplies data containers and iterators that understand the underlying hierarchical parallelism for field variables on a mesh, particle data, and embedded boundary (cut cell) representations of complex geometries. Both particles and embedded boundary representations introduce additional irregularity and complexity in the way the data is stored and operated on, requiring special attention in the presence of the dynamically changing hierarchical mesh structure and AMR time-stepping approaches. The computational patterns associated with particles vary widely depending on what the particles represent, from dark matter in the Universe to sprays in a combustor to oxygen carriers in a chemical looping reactor.

In a scaling study of a proxy application on Cori, asynchronous iteration generated as much as a 26% reduction in runtime.

The different multiphysics applications have widely different performance characteristics as well. AMReX offers a rich set of tools to provide sufficient flexibility so that performance can be tuned for different situations. Dynamic scheduling of logical tiles, work estimates based on runtime measurement of computational costs, dual grids for mesh and particle data, and asynchronous coarse-grained (fork-join) task parallelism are just some of the ways in which AMReX provides application developers the software flexibility to optimize their applications. And unlike some software, AMReX does not impose specific language requirements on the kernels that represent the different physics components; while the AMReX infrastructure itself is written in C++, many AMReX users prefer to write kernels in modern Fortran, for example. In addition, the AMReX team provides in-house AMR-aware performance modeling of potential work-scheduling and data layout strategies on current and future architectures.

Recent Success

AMReX began by developing fast, efficient implementations of all the basic operations needed for AMR application codes. As an early example of the benefits of improved software design, almost exactly one year ago, AMReX reported orders of magnitude speed-up in the embedded boundary capability over previous instantiations. This capability will be used to model the burning in realistic combustor geometries in the Pele project and the particle-laden flow in the cyclone region of the chemical looping reactor in the MFIX-Exa project.

More recently, AMReX has delivered a pair of programming model techniques that can be used to mitigate the impact of communication overheads, which are anticipated to increasingly impact total run time. First, asynchronous iteration targets fine-grain tasks by overlapping their communication with computation via an asynchronous mesh iterator interface. To enable asynchronous mesh iteration, the AMReX team developed a capability to automatically construct task dependency graphs from AMReX metadata. The team also implemented a hierarchical thread system and computational task scheduler to efficiently execute the task graph. In a scaling study of a proxy application on Cori, asynchronous iteration generated as much as a 26% reduction in runtime.

Second, fork-join parallelism targets coarse-grain tasks by splitting the job’s processes into subgroups that compute tasks independently of one another, thus localizing communication within each subgroup. To enable fork-join task parallelism, the team designed a simple programming interface for developers to express their coarse-grain tasks and the data those tasks require, then implemented the runtime mechanisms for parallel task forking, data migration, and task output redirection.

The core data structure in AMReX is the MultiFab, which is a distributed data structure representing numerical values over the spatial coordinates within a set of grids. This figure shows example ways to pass MultiFab data between parent and child tasks. The user specifies the strategy (duplicate, split, single) and intent (in, out, inout) for how the subtasks access the data.

In keeping with the close ties between AMReX and the application projects it supports, the Pele project is already taking advantage of this new capability and has refactored the part of the low Mach number combustion algorithm to use the fork-join parallelism for the multiple parabolic solves it must perform to diffuse all of the chemical species.

Co-Design Collaborations

ECP Applications

Five ECP application projects—in the areas of accelerator design (WarpX), astrophysics (ExaStar), combustion (Pele), cosmology (ExaSky), and multiphase flow (MFiX-Exa)—include codes based on AMReX. All codes make use of the basic mesh data structures and iterators and additional capabilities.

In addition, the AMReX team has had regular communications with the CoPA Co-Design Center concerning best practices for particle data layout and operations.

ECP Software Technology (ST)

The AMReX team has interacted with many of the ECP ST teams. The most significant of those interactions have been with the SUNDIALS, HDF5, ALPINE and Pagoda projects. The AMReX team had a shared milestone with the SUNDIALS project for the SUNDIALS team to develop and release a vectorized version of CVODE, and for the AMReX team to provide an interface for it to use in AMReX. This milestone has been successfully completed and the vectorized version is in use in the Nyx code. Plans are underway for development of a new version of CVODE that will work effectively on GPUs.

In addition, the AMReX team has regular interactions with the HDF5 project, the ALPINE project, and the Pagoda project to pursue various objectives. Finally, the AMReX team has been working with the xSDK team to ensure xSDK compatibility and interoperability; AMReX will be a part of the October 2018 xSDK release.

ECP Hardware and Integration (HI)

Several members of the AMReX team regularly read the relevant PathForward milestones, reports, and presentations from vendors and attend the PathForward review meetings. The team has also had more-focused face-to-face discussions with vendor architects, compiler developers, and performance modeling teams. These interactions provided a characterization of AMReX’s computational needs and actionable information to the vendors. They also served to keep the AMReX team abreast of vendor architecture directions and advancements to help understand the computational challenges for relevant numerical methods on future exascale architectures. With this information, the AMReX team can help applications prioritize which components should be tweaked or re-implemented, and which components need algorithmic innovation to deliver performance on exascale systems. In addition, the AMReX team has used LIKWID (a performance counter tool) to characterize the distributed-memory AMReX-based applications in terms of bandwidth (L1, L2, L3, DRAM), vectorization rates, and instructions per cycle.

Impact and Next Steps

To date, AMReX has helped five of the application projects move forward on the path to exascale through improved performance on multicore machines by supplying highly efficient core infrastructure and support for strategies such as hybrid parallelism, effective load balancing, and asynchronous task parallelism. The next thrust area for AMReX is to provide these same applications with support for running on hybrid CPU/GPU systems such as Summit and eventually, Frontier. Because the applications each have slightly different needs, a key tradeoff in this effort is to maintain flexibility of use and portability across architectures while optimizing performance. AMReX will provide applications with a mechanism to use GPUs effectively without needing to extensively rewrite code. As always, however, this does not preclude developers from implementing custom tuning strategies for a particular application.

Researchers

Ann Almgren, John Bell, Marc Day, Andrew Myers, Andy Nonaka, Steven Reeves, Weiqun Zhang, Cy Chan, Tan Nguyen, Sam Williams, Brian Friesen, Kevin Gott, Anshu Dubey, Saurabh Chawdhary, Jared O’Neal, Klaus Weide, Ray Grout, and Shashank Yellapantula.

About Berkeley Lab

Founded in 1931 on the belief that the biggest scientific challenges are best addressed by teams, Lawrence Berkeley National Laboratory and its scientists have been recognized with 16 Nobel Prizes. Today, Berkeley Lab researchers develop sustainable energy and environmental solutions, create useful new materials, advance the frontiers of computing, and probe the mysteries of life, matter, and the universe. Scientists from around the world rely on the Lab’s facilities for their own discovery science. Berkeley Lab is a multiprogram national laboratory, managed by the University of California for the U.S. Department of Energy’s Office of Science.

DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science.