Performance and Algorithms Research

# Performance Analysis of AI Hardware and Software

The performance characteristics of AI training and inference can be quite distinct from HPC applications despite possessing similar computational methods (large/small matrix multiplications, stencils, gather/scatter, etc...) albeit at reduced precision (single, half, BFLOAT16).  Where possible, vendors are attempting to create specialized architectures subset of computations used in AI training and inference.  Understanding the interplay between science, AI method, framework, and architecture is essential in not only in quantifying the computational potential for current and future architectures running AI models, but also identifying the bottlenecks and the ultimate limits of today's models.

## Researchers

• Samuel Williams
• Nick Wright
• Khaled Ibrahim
• Hai Ah Nam
• Leonid Oliker
• Tan Nguyen
• Nan Ding
• Steve Farrell
• Wahid Bhimji

## Publications

### Dan Bonachea, Paul H. Hargrove,An Introduction to GASNet-EX for Chapel Users,9th Annual Chapel Implementers and Users Workshop (CHIUW 2022),June 10, 2022,

Have you ever typed "export CHPL_COMM=gasnet"? If you’ve used Chapel with multi-locale support on a system without "Cray" in the model name, then you’ve probably used GASNet. Did you ever wonder what GASNet is? What GASNet should mean to you? This talk aims to answer those questions and more. Chapel has system-specific implementations of multi-locale communication for Cray-branded systems including the Cray XC and HPE Cray EX lines. On other systems, Chapel communication uses the GASNet communication library embedded in third-party/gasnet. In this talk, that third-party will introduce itself to you in the first person.

Video Presentation

### Paul H. Hargrove, Dan Bonachea, Amir Kamil, Colin A. MacLean, Damian Rouson, Daniel Waters,"UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'22)",Poster at Exascale Computing Project (ECP) Annual Meeting 2022,May 5, 2022,

We present UPC++ and GASNet-EX, distributed libraries which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Procedure Call (RPC) and for Remote Memory Access (RMA) to host and GPU memories.  The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients.  GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems.

### John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters,"UPC++ v1.0 Programmer’s Guide, Revision 2022.3.0",Lawrence Berkeley National Laboratory Tech Report,March 2022,LBNL 2001453, doi: 10.25344/S41C7Q

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

### Dan Bonachea, Amir Kamil,"UPC++ v1.0 Specification, Revision 2022.3.0",Lawrence Berkeley National Laboratory Tech Report,March 2022,LBNL 2001452, doi: 10.25344/S4530J

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

### Daniel Waters, Colin A. MacLean, Dan Bonachea, Paul H. Hargrove,"Demonstrating UPC++/Kokkos Interoperability in a Heat Conduction Simulation (Extended Abstract)",Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM),November 2021,doi: 10.25344/S4630V

We describe the replacement of MPI with UPC++ in an existing Kokkos code that simulates heat conduction within a rectangular 3D object, as well as an analysis of the new code’s performance on CUDA accelerators. The key challenges were packing the halos in Kokkos data structures in a way that allowed for UPC++ remote memory access, and streamlining synchronization costs. Additional UPC++ abstractions used included global pointers, distributed objects, remote procedure calls, and futures. We also make use of the device allocator concept to facilitate data management in memory with unique properties, such as GPUs. Our results demonstrate that despite the algorithm’s good semantic match to message passing abstractions, straightforward modifications to use UPC++ communication deliver vastly improved performance and scalability in the common case. We find the one-sided UPC++ version written in a natural way exhibits good performance, whereas the message-passing version written in a straightforward way exhibits performance anomalies. We argue this represents a productivity benefit for one-sided communication models.

### Amir Kamil, Dan Bonachea,"Optimization of Asynchronous Communication Operations through Eager Notifications",Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM),November 2021,doi: 10.25344/S42C71

UPC++ is a C++ library implementing the Asynchronous Partitioned Global Address Space (APGAS) model. We propose an enhancement to the completion mechanisms of UPC++ used to synchronize communication operations that is designed to reduce overhead for on-node operations. Our enhancement permits eager delivery of completion notification in cases where the data transfer semantics of an operation happen to complete synchronously, for example due to the use of shared-memory bypass. This semantic relaxation allows removing significant overhead from the critical path of the implementation in such cases. We evaluate our results on three different representative systems using a combination of microbenchmarks and five variations of the the HPCChallenge RandomAccess benchmark implemented in UPC++ and run on a single node to accentuate the impact of locality. We find that in RMA versions of the benchmark written in a straightforward manner (without manually optimizing for locality), the new eager notification mode can provide up to a 25% speedup when synchronizing with promises and up to a 13.5x speedup when synchronizing with conjoined futures. We also evaluate our results using a graph matching application written with UPC++ RMA communication, where we measure overall speedups of as much as 11% in single-node runs of the unmodified application code, due to our transparent enhancements.

### Paul H. Hargrove, Dan Bonachea, Colin A. MacLean, Daniel Waters,"GASNet-EX Memory Kinds: Support for Device Memory in PGAS Programming Models",The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'21) Research Poster,November 2021,doi: 10.25344/S4P306

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work includes two major components: UPC++ (a C++ template library) and GASNet-EX (a portable, high-performance communication library). This poster describes recent advances in GASNet-EX to efficiently implement Remote Memory Access (RMA) operations to and from memory on accelerator devices such as GPUs. Performance is illustrated via benchmark results from UPC++ and the Legion programming system, both using GASNet-EX as their communications library.

### Katherine A. Yelick, Amir Kamil, Damian Rouson, Dan Bonachea, Paul H. Hargrove,UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (SC21),Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC21),November 15, 2021,

UPC++ is a C++ library supporting Partitioned Global Address Space (PGAS) programming. UPC++ offers low-overhead one-sided Remote Memory Access (RMA) and Remote Procedure Calls (RPC), along with future/promise-based asynchrony to express dependencies between computation and asynchronous data movement. UPC++ supports simple/regular data structures as well as more elaborate distributed applications where communication is fine-grained and/or irregular. UPC++ provides a uniform abstraction for one-sided RMA between host and GPU/accelerator memories anywhere in the system. UPC++'s support for aggressive asynchrony enables applications to effectively overlap communication and reduce latency stalls, while the underlying GASNet-EX communication library delivers efficient low-overhead RMA/RPC on HPC networks.

This tutorial introduces UPC++, covering the memory and execution models and basic algorithm implementations. Participants gain hands-on experience incorporating UPC++ features into application proxy examples. We examine a few UPC++ applications with irregular communication (metagenomic assembler and COVID-19 simulation) and describe how they utilize UPC++ to optimize communication performance.

### Devarshi Ghoshal, Ludovico Bianchi, Abdelilah Essiari, Drew Paine, Sarah Poon, Michael Beach, Alpha N'Diaye, Patrick Huck, Lavanya Ramakrishnan,"Science Capsule: Towards Sharing and Reproducibility of Scientiﬁc Workﬂows",2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS),November 15, 2021,doi: 10.1109/WORKS54523.2021.00014

Workflows are increasingly processing large volumes of data from scientific instruments, experiments and sensors. These workflows often consist of complex data processing and analysis steps that might include a diverse ecosystem of tools and also often involve human-in-the-loop steps. Sharing and reproducing these workflows with collaborators and the larger community is critical but hard to do without the entire context of the workflow including user notes and execution environment. In this paper, we describe Science Capsule, which is a framework to capture, share, and reproduce scientific workflows. Science Capsule captures, manages and represents both computational and human elements of a workflow. It automatically captures and processes events associated with the execution and data life cycle of workflows, and lets users add other types and forms of scientific artifacts. Science Capsule also allows users to create workflow snapshots' that keep track of the different versions of a workflow and their lineage, allowing scientists to incrementally share and extend workflows between users. Our results show that Science Capsule is capable of processing and organizing events in near real-time for high-throughput experimental and data analysis workflows without incurring any significant performance overheads.

### John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Daniel Waters,"UPC++ v1.0 Programmer’s Guide, Revision 2021.9.0",Lawrence Berkeley National Laboratory Tech Report,September 2021,LBNL 2001424, doi: 10.25344/S4SW2T

UPC++ is a C++ library that supports Partitioned Global Address Space (PGAS) programming. It is designed for writing efficient, scalable parallel programs on distributed-memory parallel computers. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). The UPC++ control model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. The PGAS memory model additionally provides one-sided RMA communication to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ also features Remote Procedure Call (RPC) communication, making it easy to move computation to operate on data that resides on remote processes.

UPC++ was designed to support exascale high-performance computing, and the library interfaces and implementation are focused on maximizing scalability. In UPC++, all communication operations are syntactically explicit, which encourages programmers to consider the costs associated with communication and data movement. Moreover, all communication operations are asynchronous by default, encouraging programmers to seek opportunities for overlapping communication latencies with other useful work. UPC++ provides expressive and composable abstractions designed for efficiently managing aggressive use of asynchrony in programs. Together, these design principles are intended to enable programmers to write applications using UPC++ that perform well even on hundreds of thousands of cores.

### Dan Bonachea, Amir Kamil,"UPC++ v1.0 Specification, Revision 2021.9.0",Lawrence Berkeley National Laboratory Tech Report,September 2021,LBNL 2001425, doi: 10.25344/S4XK53

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

### Dan Bonachea,"UPC++ as_eager Working Group Draft, Revision 2020.6.2",Lawrence Berkeley National Laboratory Tech Report,August 9, 2021,LBNL 2001416, doi: 10.25344/S4FK5R

This draft proposes an extension for a new future-based completion variant that can be more effectively streamlined for RMA and atomic access operations that happen to be satisfied at runtime using purely node-local resources. Many such operations are most efficiently performed synchronously using load/store instructions on shared-memory mappings, where the actual access may only require a few CPU instructions. In such cases we believe it’s critical to minimize the overheads imposed by the UPC++ runtime and completion queues, in order to enable efficient operation on hierarchical node hardware using shared-memory bypass.

The new upcxx::{source,operation}_cx::as_eager_future() completion variant accomplishes this goal by relaxing the current restriction that future-returning access operations must return a non-ready future whose completion is deferred until a subsequent explicit invocation of user-level progress. This relaxation allows access operations that are completed synchronously to instead return a ready future, thereby avoiding most or all of the runtime costs associated with deferment of future completion and subsequent mandatory entry into the progress engine.

We additionally propose to make this new as_eager_future() completion variant the new default completion for communication operations that currently default to returning a future. This should encourage use of the streamlined variant, and may provide performance improvements to some codes without source changes. A mechanism is proposed to restore the legacy behavior on-demand for codes that might happen to rely on deferred completion for correctness.

Finally, we propose a new as_eager_promise() completion variant that extends analogous improvements to promise-based completion, and corresponding changes to the default behavior of as_promise().

### Drew Paine, Sarah Poon, Lavanya Ramakrishnan,"Investigating User Experiences with Data Abstractions on High Performance Computing Systems",June 29, 2021,LBNL LBNL-2001374,

Scientific exploration generates expanding volumes of data that commonly require High Performance Computing (HPC) systems to facilitate research. HPC systems are complex ecosystems of hardware and software that frequently are not user friendly. The Usable Data Abstractions (UDA) project set out to build usable software for scientific workflows in HPC environments by undertaking multiple rounds of qualitative user research. Qualitative research investigates how individuals accomplish their work and our interview-based study surfaced a variety of insights about the experiences of working in and with HPC ecosystems. This report examines multiple facets to the experiences of scientists and developers using and supporting HPC systems. We discuss how stakeholders grasp the design and configuration of these systems, the impacts of abstraction layers on their ability to successfully do work, and the varied perceptions of time that shape this work. Examining the adoption of the Cori HPC at NERSC we explore the anticipations and lived experiences of users interacting with this system's novel storage feature, the Burst Buffer. We present lessons learned from across these insights to illustrate just some of the challenges HPC facilities and their stakeholders need to account for when procuring and supporting these essential scientific resources to ensure their usability and utility to a variety of scientific practices.

### Devarshi Ghoshal, Drew Paine, Gilberto Pastorello, Abdelrahman Elbashandy, Dan Gunter, Oluwamayowa Amusat, Lavanya Ramakrishnan,"Experiences with Reproducibility: Case Studies from Scientific Workflows",(P-RECS'21) Proceedings of the 4th International Workshop on Practical Reproducible Evaluation of Computer Systems,ACM,June 21, 2021,doi: 10.1145/3456287.3465478

Reproducible research is becoming essential for science to ensure transparency and for building trust. Additionally, reproducibility provides the cornerstone for sharing of methodology that can improve efficiency. Although several tools and studies focus on computational reproducibility, we need a better understanding about the gaps, issues, and challenges for enabling reproducibility of scientific results beyond the computational stages of a scientific pipeline. In this paper, we present five different case studies that highlight the reproducibility needs and challenges under various system and environmental conditions. Through the case studies, we present our experiences in reproducing different types of data and methods that exist in an experimental or analysis pipeline. We examine the human aspects of reproducibility while highlighting the things that worked, that did not work, and that could have worked better for each of the cases. Our experiences capture a wide range of scenarios and are applicable to a much broader audience who aim to integrate reproducibility in their everyday pipelines.

### Paul H. Hargrove, Dan Bonachea, Max Grossman, Amir Kamil, Colin A. MacLean, Daniel Waters,"UPC++ and GASNet: PGAS Support for Exascale Apps and Runtimes (ECP'21)",Poster at Exascale Computing Project (ECP) Annual Meeting 2021,April 2021,

We present UPC++ and GASNet-EX, which together enable one-sided, lightweight communication such as arises in irregular applications, libraries and frameworks running on exascale systems.

UPC++ is a C++ PGAS library, featuring APIs for Remote Memory Access (RMA) and Remote Procedure Call (RPC).  The combination of these two features yields performant, scalable solutions to problems of interest within ECP.

GASNet-EX is PGAS communication middleware, providing the foundation for UPC++ and Legion, plus numerous non-ECP clients.  GASNet-EX RMA interfaces match or exceed the performance of MPI-RMA across a variety of pre-exascale systems

### Marco Pritoni, Drew Paine, Gabriel Fierro, Cory Mosiman, Michael Poplawski, Joel Bender, Jessica Granderson,"Metadata Schemas and Ontologies for Building Energy Applications: A Critical Review and Use Case Analysis",Energies,April 6, 2021,doi: 10.3390/en14072024

Digital and intelligent buildings are critical to realizing efficient building energy operations and a smart grid. With the increasing digitalization of processes throughout the life cycle of buildings, data exchanged between stakeholders and between building systems have grown significantly. However, a lack of semantic interoperability between data in different systems is still prevalent and hinders the development of energy-oriented applications that can be reused across buildings, limiting the scalability of innovative solutions. Addressing this challenge, our review paper systematically reviews metadata schemas and ontologies that are at the foundation of semantic interoperability necessary to move toward improved building energy operations. The review finds 40 schemas that span different phases of the building life cycle, most of which cover commercial building operations and, in particular, control and monitoring systems. The paper’s deeper review and analysis of five popular schemas identify several gaps in their ability to fully facilitate the work of a building modeler attempting to support three use cases: energy audits, automated fault detection and diagnosis, and optimal control. Our findings demonstrate that building modelers focused on energy use cases will find it difficult, labor intensive, and costly to create, sustain, and use semantic models with existing ontologies. This underscores the significant work still to be done to enable interoperable, usable, and maintainable building models. We make three recommendations for future work by the building modeling and energy communities: a centralized repository with a search engine for relevant schemas, the development of more use cases, and better harmonization and standardization of schemas in collaboration with industry to facilitate their adoption by stakeholders addressing varied energy-focused use cases.

### Dan Bonachea, Amir Kamil,"UPC++ v1.0 Specification, Revision 2021.3.0",Lawrence Berkeley National Laboratory Tech Report,March 31, 2021,LBNL 2001388, doi: 10.25344/S4K881

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

### Ed Younis, Koushik Sen, Katherine Yelick, Costin Iancu,QFAST: Quantum Synthesis Using a Hierarchical Continuous Circuit Space,Bulletin of the American Physical Society,March 2021,

We present QFAST, a quantum synthesis tool designed to produce short circuits and to scale well in practice. Our contributions are: 1) a novel representation of circuits able to encode placement and topology; 2) a hierarchical approach with an iterative refinement formulation that combines "coarse-grained" fast optimization during circuit structure search with a good, but slower, optimization stage only in the final circuit instantiation. When compared against state-of-the-art techniques, although not always optimal, QFAST can reduce circuits for "time-dependent evolution" algorithms, as used by domain scientists, by 60x in depth. On typical circuits, it provides 4x better depth reduction than the widely used Qiskit and UniversalQ compilers. We also show the composability and tunability of our formulation in terms of circuit depth and running time. For example, we show how to generate shorter circuits by plugging in the best available third party synthesis algorithm at a given hierarchy level. Composability enables portability across chip architectures, which is missing from similar approaches.
QFAST is integrated with Qiskit and available at github.com/bqskit.

### Akel Hashim, Ravi Naik, Alexis Morvan, Jean-Loup Ville, Brad Mitchell, John Mark Kreikebaum, Marc Davis, Ethan Smith, Costin Iancu, Kevin O Brien, Ian Hincks, Joel Wallman, Joseph V Emerson, David Ivan Santiago, Irfan Siddiqi,Scalable Quantum Computing on a Noisy Superconducting Quantum Processor via Randomized Compiling,Bulletin of the American Physical Society,2021,

Coherent errors in quantum hardware severely limit the performance of quantum algorithms in an unpredictable manner, and mitigating their impact is necessary for realizing reliable, large-scale quantum computations. Randomized compiling achieves this goal by converting coherent errors into stochastic noise, dramatically reducing unpredictable errors in quantum algorithms and enabling accurate predictions of aggregate performance via cycle benchmarking estimates. In this work, we demonstrate significant performance gains under randomized compiling for both the four-qubit quantum Fourier transform algorithm and for random circuits of variable depth on a superconducting quantum processor. We also validate solution accuracy using experimentally-measured error rates. Our results demonstrate that randomized compiling can be utilized to maximally-leverage and predict the capabilities of modern-day noisy quantum processors, paving the way forward for scalable quantum computing.

### Dan Bonachea,GASNet-EX: A High-Performance, Portable Communication Library for Exascale,Berkeley Lab – CS Seminar,March 10, 2021,

Partitioned Global Address Space (PGAS) models, pioneered by languages such as Unified Parallel C (UPC) and Co-Array Fortran, expose one-sided communication as a key building block for High Performance Computing (HPC) applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in future exascale machines. The library is an evolution of the popular GASNet communication system, building on 20 years of lessons learned. We describe several features and enhancements that have been introduced to address the needs of modern runtimes and exploit the hardware capabilities of emerging systems. Microbenchmark results demonstrate the RMA performance of GASNet-EX is competitive with several MPI implementations on current systems. GASNet-EX provides communication services that help to deliver speedups in HPC applications written using the UPC++ library, enabling new science on pre-exascale systems.

### Thijs Steel, Daan Camps, Karl Meerbergen, Raf Vandebril,"A Multishift, Multipole Rational QZ Method with Aggressive Early Deflation",SIAM Journal on Matrix Analysis and Applications,February 19, 2021,42:753-774,doi: 10.1137/19M1249631

In the article “A Rational QZ Method” by D. Camps, K. Meerbergen, and R. Vandebril [SIAM J. Matrix Anal. Appl., 40 (2019), pp. 943--972], we introduced rational QZ (RQZ) methods. Our theoretical examinations revealed that the convergence of the RQZ method is governed by rational subspace iteration, thereby generalizing the classical QZ method, whose convergence relies on polynomial subspace iteration. Moreover the RQZ method operates on a pencil more general than Hessenberg---upper triangular, namely, a Hessenberg pencil, which is a pencil consisting of two Hessenberg matrices. However, the RQZ method can only be made competitive to advanced QZ implementations by using crucial add-ons such as small bulge multishift sweeps, aggressive early deflation, and optimal packing. In this paper we develop these techniques for the RQZ method. In the numerical experiments we compare the results with state-of-the-art routines for the generalized eigenvalue problem and show that the presented method is competitive in terms of speed and accuracy.

### Daan Camps, Roel Van Beeumen,"Approximate quantum circuit synthesis using block encodings",PHYSICAL REVIEW A,November 11, 2020,102, doi: 10.1103/PhysRevA.102.052411

One of the challenges in quantum computing is the synthesis of unitary operators into quantum circuits with polylogarithmic gate complexity. Exact synthesis of generic unitaries requires an exponential number of gates in general. We propose a novel approximate quantum circuit synthesis technique by relaxing the unitary constraints and interchanging them for ancilla qubits via block encodings. This approach combines smaller block encodings, which are easier to synthesize, into quantum circuits for larger operators. Due to the use of block encodings, our technique is not limited to unitary operators and can be applied for the synthesis of arbitrary operators. We show that operators which can be approximated by a canonical polyadic expression with a polylogarithmic number of terms can be synthesized with polylogarithmic gate complexity with respect to the matrix dimension.

### Katherine A. Yelick, Amir Kamil, Dan Bonachea, Paul H Hargrove,UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (SC20),Tutorial at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC20),November 10, 2020,

UPC++ is a C++ library supporting Partitioned Global Address Space (PGAS) programming. The UPC++ API offers low-overhead one-sided Remote Memory Access (RMA) and Remote Procedure Calls (RPC), along with future/promise-based asynchrony to express dependencies between asynchronous computations and data movement. UPC++ supports simple, regular data structures as well as more elaborate distributed structures where communication is fine-grained, irregular, or both. UPC++'s support for aggressive asynchrony enables the application to overlap communication to reduce communication wait times, and the GASNet communication layer provides efficient low-overhead RMA/RPC on HPC networks.

This tutorial introduces basic concepts and advanced optimization techniques of UPC++. We discuss the UPC++ memory and execution models and examine basic algorithm implementations. Participants gain hands-on experience incorporating UPC++ features into several application examples. We also examine two irregular applications (metagenomic assembler and multifrontal sparse solver) and describe how they leverage UPC++ features to optimize communication performance.

### John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ v1.0 Programmer’s Guide, Revision 2020.10.0",Lawrence Berkeley National Laboratory Tech Report,October 2020,LBNL 2001368, doi: 10.25344/S4HG6Q

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### Drew Paine, Lavanya Ramakrishnan,"Understanding Interactive and Reproducible Computing With Jupyter Tools at Facilities",LBNL Technical Report,October 31, 2020,LBNL LBNL-2001355,

Increasingly Jupyter tools are being adopted and incorporated into High Performance Computing (HPC) and scientific user facilities. Adopting Jupyter tools enables more interactive and reproducible computational work at facilities across data life cycles. As the volume, variety, and scope of data grow, scientists need to be able to analyze and share results in user friendly ways. Human-centered research highlights design challenges around computational notebooks, and our qualitative user study shifts focus to better characterize how Jupyter tools are being used in HPC and science user facilities today. We conducted twenty-nine interviews, and obtained 103 survey responses from NERSC Jupyter users, to better understand the increasing role of interactive computing tools in DOE sponsored scientific work. We examine a range of issues that emerge using and supporting Jupyter in HPC ecosystems, including: how Jupyter is being used by scientists in HPC and user facility ecosystems; how facilities are purposefully supporting Jupyter in their ecosystems; feedback NERSC users have about the facility’s deployment, and, discuss features NERSC indicated would be helpful. We offer a variety of takeaways for staff supporting Jupyter at facilities, Project Jupyter and related open source communities, and funding agencies supporting interactive computing work.

### Dan Bonachea, Amir Kamil,"UPC++ v1.0 Specification, Revision 2020.10.0",Lawrence Berkeley National Laboratory Tech Report,October 30, 2020,LBNL 2001367, doi: 10.25344/S4CS3F

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

### Marc G. Davis, Ethan Smith, Ana Tudor, Koushik Sen, Irfan Siddiqi, Costin Iancu,"Towards Optimal Topology Aware Quantum Circuit Synthesis",2020 IEEE International Conference on Quantum Computing and Engineering (QCE),Denver, CO, USA,IEEE,October 12, 2020,doi: 10.1109/QCE49297.2020.00036

We present an algorithm for compiling arbitrary unitaries into a sequence of gates native to a quantum processor. As CNOT gates are error-prone for the foreseeable Noisy-Intermediate-Scale Quantum devices era, our A* inspired algorithm minimizes their count while accounting for connectivity. We discuss the formulation of synthesis as a search problem as well as an algorithm to find solutions. For a workload of circuits with complexity appropriate for the NISQ era, we produce solutions well within the best upper bounds published in literature and match or exceed hand tuned implementations, as well as other existing synthesis alternatives. In particular, when comparing against state-of-the-art available synthesis packages we show 2.4× average (up to 5.3×) reduction in CNOT count. We also show how to re-target the algorithm for a different chip topology and native gate set while obtaining similar quality results. We believe that tools like ours can facilitate algorithmic exploration and guide gate set discovery for quantum processor designers, as well as being useful for optimization in the quantum compilation tool-chain.

### Daan Camps, Thomas Mach, Raf Vandebril, David Watkins,"On pole-swapping algorithms for the eigenvalue problem",ETNA - Electronic Transactions on Numerical Analysis,September 18, 2020,52:480-508,doi: 10.1553/etna_vol52s480

Pole-swapping algorithms, which are generalizations of the QZ algorithm for the generalized eigenvalue problem, are studied. A new modular (and therefore more flexible) convergence theory that applies to all pole-swapping algorithms is developed. A key component of all such algorithms is a procedure that swaps two adjacent eigenvalues in a triangular pencil. An improved swapping routine is developed, and its superiority over existing methods is demonstrated by a backward error analysis and numerical tests. The modularity of the new convergence theory and the generality of the pole-swapping approach shed new light on bi-directional chasing algorithms, optimally packed shifts, and bulge pencils, and allow the design of novel algorithms.

### Drew Paine, Devarshi Ghoshal, Lavanya Ramakrishnan,"Experiences with a Flexible User Research Process to Build Data Change Tools",Journal of Open Research Software,September 1, 2020,doi: 10.5334/jors.284

Scientific software development processes are understood to be distinct from commercial software development practices due to uncertain and evolving states of scientific knowledge. Sustaining these software products is a recognized challenge, but under-examined is the usability and usefulness of such tools to their scientific end users. User research is a well-established set of techniques (e.g., interviews, mockups, usability tests) applied in commercial software projects to develop foundational, generative, and evaluative insights about products and the people who use them. Currently these approaches are not commonly applied and discussed in scientific software development work. The use of user research techniques in scientific environments can be challenging due to the nascent, fluid problem spaces of scientific work, varying scope of projects and their user communities, and funding/economic constraints on projects.

In this paper, we reflect on our experiences undertaking a multi-method user research process in the Deduce project. The Deduce project is investigating data change to develop metrics, methods, and tools that will help scientists make decisions around data change. There is a lack of common terminology since the concept of systematically measuring and managing data change is under explored in scientific environments. To bridge this gap we conducted user research that focuses on user practices, needs, and motivations to help us design and develop metrics and tools for data change. This paper contributes reflections and the lessons we have learned from our experiences. We offer key takeaways for scientific software project teams to effectively and flexibly incorporate similar processes into their projects.

### Oguz Selvitopi*, Saliya Ekanayake*, Giulia Guidi, Georgios Pavlopoulos, Ariful Azad, Aydın Buluç,"Distributed Many-to-Many Protein Sequence Alignment Using Sparse Matrices",Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’20).,2020,

(*:joint first authors)

### Patricia Gonzalez-Guerrero, Tommy Tracy II, Xinfei Guo, Rahul Sreekumar, Marzieh Lenjani, Kevin Skadron, Mircea R Stan,"Towards on-node Machine Learning for Ultra-low-power Sensors Using Asynchronous Σ Δ Streams",Journal on Emerging Technologies in Computing Systems (JETC),August 26, 2020,doi: https://doi.org/10.1145/3404975

We propose a novel architecture to enable low-power, complex on-node data processing, for the next generation of sensors for the internet of things (IoT), smartdust, or edge intelligence. Our architecture combines near-analog-memory-computing (NAM) and asynchronous-computing-with-streams (ACS), eliminating the need for ADCs. ACS enables ultra-low power, massive computational resources required to execute on-node complex Machine Learning (ML) algorithms; while NAM addresses the memory-wall that represents a common bottleneck for ML and other complex functions. In ACS an analog value is mapped to an asynchronous stream that can take one of two logic levels (vhvl). This stream-based data representation enables area/power-efficient computing units such as a multiplier implemented as an AND gate yielding savings in power of ∼90% compared to digital approaches. The generation of streams for NAM and ACS in a brute force manner, using analog-to-digital-converters (ADCs) and digital-to-streams-converters, would sky-rocket the power-latency-energy cost making the approach impractical. Our NAM-ACS architecture eliminates expensive conversions, enabling an end-to-end processing on asynchronous streams data-path. We tailor the NAM-ACS architecture for random forest (RaF), an ML algorithm, chosen for its ability to classify using a reduced number of features. Simulations show that our NAM-ACS architecture enables 75% of savings in power compared with a single ADC, obtaining a classification accuracy of 85% using an RaF-inspired algorithm

### Drew Paine, Devarshi Ghoshal, Lavanya Ramakrishnan,"Investigating Scientific Data Change with User Research Methods",August 20, 2020,LBNL LBNL-2001347,

Scientific datasets are continually expanding and changing due to fluctuations with instruments, quality assessment and quality control processes, and modifications to software pipelines. Datasets include minimal information about these changes or their effects requiring scientists manually assess modifications through a number of labor intensive and ad-hoc steps. The Deduce project is investigating data change to develop metrics, methods, and tools that will help scientists systematically identify and make decisions around data changes. Currently, there is a lack of understanding, and common practices, for identifying and evaluating changes in datasets since systematically measuring and managing data change is under explored in scientific work. We are conducting user research to address this need by exploring scientist's conceptualizations, behaviors, needs, and motivations when dealing with changing datasets. Our user research utilizes multiple methods to produce foundational, generative insights and evaluate research products produced by our team. In this paper, we detail our user research process and outline our findings about data change that emerge from our studies. Our work illustrates how scientific software teams can push beyond just usability testing user interfaces or tools to better probe the underlying ideas they are developing solutions to address.

### Matthew Li, Nicolas Chan, Viraat Chandra, Krishna Muriki,"Cluster Usage Policy Enforcement Using Slurm Plugins and an HTTP API",Practice and Experience in Advanced Research Computing,New York, NY, USA,Association for Computing Machinery,July 26, 2020,232–238,doi: 10.1145/3311790.3397341

Managing and limiting cluster resource usage is a critical task for computing clusters with a large number of users. By enforcing usage limits, cluster managers are able to ensure fair availability for all users, bill users accordingly, and prevent the abuse of cluster resources. As this is such a common problem, there are naturally many existing solutions. However, to allow for greater control over usage accounting and submission behavior in Slurm, we present a system composed of: a web API which exposes accounting data; Slurm plugins that communicate with a REST-like HTTP implementation of that API; and client tools that use it to report usage. Key advantages of our system include a customizable resource accounting formula based on job parameters, preemptive blocking of user jobs at submission time, project-level and user-level resource limits, and support for the development of other web and command-line clients that query the extensible web API. We deployed this system on Berkeley Research Computing’s institutional cluster, Savio, allowing us to automatically collect and store accounting data, and thereby easily enforce our cluster usage policy.

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick,UPC++: An Asynchronous RMA/RPC Library for Distributed C++ Applications (ALCF'20),Argonne Leadership Computing Facility (ALCF) Webinar Series,May 27, 2020,

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The UPC++ API offers low-overhead one-sided RMA communication and Remote Procedure Calls (RPC), along with futures and promises. These constructs enable the programmer to express dependencies between asynchronous computations and data movement. UPC++ supports the implementation of simple, regular data structures as well as more elaborate distributed data structures where communication is fine-grained, irregular, or both. The library’s support for asynchrony enables the application to aggressively overlap and schedule communication and computation to reduce wait times.

UPC++ is highly portable and runs on platforms from laptops to supercomputers, with native implementations for HPC interconnects. As a C++ library, it interoperates smoothly with existing numerical libraries and on-node programming models (e.g., OpenMP, CUDA).

In this webinar, hosted by DOE’s Exascale Computing Project and the ALCF, we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through basic algorithm implementations. We will also look at irregular applications and show how they can take advantage of UPC++ features to optimize their performance.

ALCF'20 Event page

ALCF'20 Video recording

### John Bachan, Scott B. Baden, Dan Bonachea, Max Grossman, Paul H. Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ v1.0 Programmer’s Guide, Revision 2020.3.0",Lawrence Berkeley National Laboratory Tech Report,March 2020,LBNL 2001269, doi: 10.25344/S4P88Z

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Dan Bonachea, Amir Kamil,"UPC++ v1.0 Specification, Revision 2020.3.0",Lawrence Berkeley National Laboratory Tech Report,March 12, 2020,LBNL 2001268, doi: 10.25344/S4T01S

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The key communication facilities in UPC++ are one-sided Remote Memory Access (RMA) and Remote Procedure Call (RPC). All communication operations are syntactically explicit and default to non-blocking; asynchrony is managed through the use of futures, promises and continuation callbacks, enabling the programmer to construct a graph of operations to execute asynchronously as high-latency dependencies are satisfied. A global pointer abstraction provides system-wide addressability of shared memory, including host and accelerator memories. The parallelism model is primarily process-based, but the interface is thread-safe and designed to allow efficient and expressive use in multi-threaded applications. The interface is designed for extreme scalability throughout, and deliberately avoids design features that could inhibit scalability.

### Muammar El Khatib, Wibe De Jong,Feature Extraction Using Semi-Supervised Deep Learning.,APS March 2020,March 5, 2020,

Features are defined as measurable properties that characterize observed phenomena and represent a key part of machine learning (ML) algorithms. In materials sciences, ML has successfully accelerated atomistic simulations using man-engineered features for tasks such as energy or atomic forces predictions. These features fulfill physics constraints such as rotational and translational invariance, uniqueness and, locality (the sum of local contributions reconstructs a global quantity). However, these ML models are known to perform poorly when operating out of the training set regime because features are not representative of the underlying structure of the data. This could be improved if features are extracted with advanced hybrid architectures e.g. a variational autoencoder that is trained with physics constraints introduced with an external task and a loss function. We will explore how the use of semi-supervised learning techniques can be a powerful tool for the extraction of features for atomistic simulations. All results shown herein can be reproduced with ML4Chem: a free software package for machine learning in chemistry and materials sciences.

### Marzieh Lenjani, Patricia Gonzalez, Elaheh Sadredini, Shuangchen Li, Yuan Xie, Ameen Akel, Sean Eilert, Mircea R Stan, Kevin Skadron,"Fulcrum: a simplified control and access mechanism toward flexible and practical in-situ accelerators",International Symposium on High Performance Computer Architecture (HPCA),San Diego, CA, USA,IEEE,February 22, 2020,doi: 10.1109/HPCA47549.2020.00052

In-situ approaches process data very close to the memory cells, in the row buffer of each subarray. This minimizes data movement costs and affords parallelism across subarrays. However, current in-situ approaches are limited to only row-wide bitwise (or few-bit) operations applied uniformly across the row buffer. They impose a significant overhead of multiple row activations for emulating 32-bit addition and multiplications using bitwise operations and cannot support operations with data dependencies or based on predicates. Moreover, with current peripheral logic, communication among subarrays is inefficient, and with typical data layouts, bits in a word are not physically adjacent. The key insight of this work is that in-situ, single-word ALUs outperform in-situ, parallel, row-wide, bitwise ALUs by reducing the number of row activations and enabling new operations and optimizations. Our proposed lightweight access and control mechanism, Fulcrum, sequentially feeds data into the single-word ALU and enables operations with data dependencies and operations based on a predicate. For algorithms that require communication among subarrays, we augment the peripheral logic with broadcasting capabilities and a previously-proposed method for low-cost inter-subarray data movement. The sequential processor also enables overlapping of broadcasting and computation, and reuniting bits that are physically adjacent. In order to realize true subarray-level parallelism, we introduce a lightweight column-selection mechanism through shifting one-hot encoded values. This technique enables independent column selection in each subarray. We integrate Fulcrum with Compress Express Link (CXL), a new interconnect standard. Fulcrum with one memory stack delivers on average (up to) 23.4 (76) speedup over a server-class GPU, NVIDIA P100, with three stacks of HBM2 memory, (ii) 70 (228) times speedup per memory stack over the GPU, and (iii) 19 (178.9) times speedup per memory stack over an ideal model of the GPU, which only accounts for the overhead of data movement.

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick,UPC++: A PGAS/RPC Library for Asynchronous Exascale Communication in C++ (ECP'20),Tutorial at Exascale Computing Project (ECP) Annual Meeting 2020,February 6, 2020,

UPC++ is a C++ library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. The UPC++ API offers low-overhead one-sided RMA communication and Remote Procedure Calls (RPC), along with futures and promises. These constructs enable the programmer to express dependencies between asynchronous computations and data movement. UPC++ supports the implementation of simple, regular data structures as well as more elaborate distributed data structures where communication is fine-grained, irregular, or both. The library’s support for asynchrony enables the application to aggressively overlap and schedule communication and computation to reduce wait times.

UPC++ is highly portable and runs on platforms from laptops to supercomputers, with native implementations for HPC interconnects. As a C++ library, it interoperates smoothly with existing numerical libraries and on-node programming models (e.g., OpenMP, CUDA).

In this tutorial we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through basic algorithm implementations. We will also look at irregular applications and show how they can take advantage of UPC++ features to optimize their performance.

ECP'20 Event page

### J Corbino, J Castillo,"High-order mimetic finite-difference operators satisfying the extended Gauss divergence theorem",Journal of Computational and Applied Mathematics,2020,doi: 10.1016/j.cam.2019.06.042

We present high-order mimetic finite-difference operators that satisfy the extended Gauss theorem. These operators have the same order of accuracy in the interior and at the boundary, no free parameters and optimal bandwidth. They are defined over staggered grids, using weighted inner products with a diagonal norm. We present several examples to demonstrate that mimetic finite-difference schemes using these operators produce excellent results.

### A Boada, J Corbino, J Castillo,"High-order mimetic difference simulation of unsaturated flow using Richards equation",Mathematics in Applied Sciences and Engineering,2020,doi: 10.5206/mase/10874

The vadose zone is the portion of the subsurface above the water table and its pore space usually contains air and water. Due to the presence of infiltration, erosion, plant growth, microbiota, contaminant transport, aquifer recharge, and discharge to surface water, it is crucial to predict the transport rate of water and other substances within this zone. However, ow in the vadose zone has many complications as the parameters that control it are extremely sensitive to the saturation of the media, leading to a nonlinear problem. This ow is referred as unsaturated ow and is governed by Richards equation. Analytical solutions for this equation exists only for simplified cases, so most practical situations require a numerical solution. Nevertheless, the nonlinear nature of Richards equation introduces challenges that causes numerical solutions for this problem to be computationally expensive and, in some cases, unreliable. High-order mimetic finite difference operators are discrete analogs of the continuous differential operators and have been extensively used in the fields of fluid and solid mechanics. In this work, we present a numerical approach involving high-order mimetic operators along with a Newton root- finding algorithm for the treatment of the nonlinear component. Fully-implicit time discretization scheme is used to deal with the problem's stiffness.

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Kathy Yelick,UPC++ Tutorial (NERSC Dec 2019),National Energy Research Scientific Computing Center (NERSC),December 16, 2019,

This event was a repeat of the tutorial delivered on November 1, but with the restoration of the hands-on component which was omitted due to uncertainty surrounding the power outage at NERSC.

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. UPC++ provides mechanisms for low-overhead one-sided communication, moving computation to data through remote-procedure calls, and expressing dependencies between asynchronous computations and data movement. It is particularly well-suited for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces are designed to be composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds.

In this tutorial we introduced basic concepts and advanced optimization techniques of UPC++. We discussed the UPC++ memory and execution models and walked through implementing basic algorithms in UPC++. We also discussed irregular applications and how to take advantage of UPC++ features to optimize their performance. The tutorial included hands-on exercises with basic UPC++ constructs. Registrants were given access to run their UPC++ exercises on NERSC’s Cori (currently the #14 fastest computer in the world).

NERSC Dec 2019 Event page

### Marc Grau Davis, Ethan Smith, Ana Tudor, Koushik Sen, Irfan Siddiqi, Costin Iancu,"Heuristics for Quantum Compiling with a Continuous Gate Set",3rd International Workshop on Quantum Compilation as part of the International Conference On Computer Aided Design 2019,December 5, 2019,

We present an algorithm for compiling arbitrary unitaries into a sequence of gates native to a quantum processor. As accurate CNOT gates are hard for the foreseeable Noisy- Intermediate-Scale Quantum devices era, our A* inspired algorithm attempts to minimize their count, while accounting for connectivity. We discuss the search strategy together with metrics to expand the solution frontier. For a workload of circuits with complexity appropriate for the NISQ era, we produce solutions well within the best upper bounds published in literature and match or exceed hand tuned implementations, as well as other existing synthesis alternatives. In particular, when comparing against state-of-the-art available synthesis packages we show 2.4x average (up to 5.3x) reduction in CNOT count. We also show how to re-target the algorithm for a different chip topology and native gate set, while obtaining similar quality results. We believe that empirical tools like ours can facilitate algorithmic exploration, gate set discovery for quantum processor designers, as well as providing useful optimization blocks within the quantum compilation tool-chain.

### Paul H. Hargrove, Dan Bonachea,"Efficient Active Message RMA in GASNet Using a Target-Side Reassembly Protocol (Extended Abstract)",IEEE/ACM Parallel Applications Workshop, Alternatives To MPI+X (PAW-ATM),Lawrence Berkeley National Laboratory Technical Report,November 17, 2019,LBNL 2001238, doi: 10.25344/S4PC7M

GASNet is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models on future exascale machines. This paper investigates strategies for efficient implementation of GASNet’s “AM Long” API that couples an RMA (Remote Memory Access) transfer with an Active Message (AM) delivery.
We discuss several network-level protocols for AM Long and propose a new target-side reassembly protocol. We present a microbenchmark evaluation on the Cray XC Aries network hardware. The target-side reassembly protocol on this network improves AM Long end-to-end latency by up to 33%, and the effective bandwidth by up to 49%, while also enabling asynchronous source completion that drastically reduces injection overheads.
The improved AM Long implementation for Aries is available in GASNet-EX release v2019.9.0 and later.

### Dan Bonachea, Paul H. Hargrove,"GASNet-EX: A High-Performance, Portable Communication Library for Exascale",LNCS 11882: Proceedings of Languages and Compilers for Parallel Computing (LCPC'18),edited by Hall M., Sundar H.,November 2019,11882:138-158,doi: 10.1007/978-3-030-34627-0_11

Partitioned Global Address Space (PGAS) models, typified by such languages as Unified Parallel C (UPC) and Co-Array Fortran, expose one-sided communication as a key building block for High Performance Computing (HPC) applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in future exascale machines. The library is an evolution of the popular GASNet communication system, building upon over 15 years of lessons learned. We describe and evaluate several features and enhancements that have been introduced to address the needs of modern client systems. Microbenchmark results demonstrate the RMA performance of GASNet-EX is competitive with several MPI-3 implementations on current HPC systems.

### Christine T. Wolf, Julia Bullard, Stacy Wood, Amelia Acker, Drew Paine, Charlotte P. Lee,"Mapping the “How” of Collaborative Action: Research Methods for Studying Contemporary Sociotechnical Processes",CSCW '19: Conference Companion Publication of the 2019 on Computer Supported Cooperative Work and Social Computing,November 10, 2019,doi: 10.1145/3311957.3359441

Process has been a concern since the beginning of CSCW. Developments in sociotechnical landscapes raise new challenges for studying processes (e.g., massive online communities bringing together vast crowds; Big Data technologies connecting many through the flow of data). This re-opens questions about how we study, document, conceptualize, and design to support processes in complex, contemporary sociotechnical systems. This one-day workshop will bring together researchers to discuss the CSCW community’s unique focus and methodological toolkit for studying process and workflow; provide a collaborative space for the improvement and extension of research projects within this space; and catalyze a network of scholars with expertise and interest in addressing challenging methodological questions around studying process in contemporary, sociotechnical systems.

### Marzieh Lenjani, Patricia Gonzalez, Elaheh Sadredini, M Arif Rahman, Mircea R Stan,"An overflow-free quantized memory hierarchy in general-purpose processors",International Symposium on Workload Characterization (IISWC),Orlando, FL, USA,IEEE,November 3, 2019,doi: 10.1109/IISWC47752.2019.9042035

Data movement comprises a significant portion of energy consumption and execution time in modern applications. Accelerator designers exploit quantization to reduce the bitwidth of values and reduce the cost of data movement. However, any value that does not fit in the reduced bitwidth results in an overflow (we refer to these values as outliers). Therefore accelerators use quantization for applications that are tolerant to overflows. We observe that in most applications the rate of outliers is low and values are often within a narrow range, providing the opportunity to exploit quantization in general-purpose processors. However, a software implementation of quantization in general-purpose processors has three problems. First, the programmer has to manually implement conversions and the additional instructions that quantize and dequantize values, imposing a programmer's effort and performance overhead. Second, to cover outliers, the bitwidth of the quantized values often become greater than or equal to the original values. Third, the programmer has to use standard bitwidth; otherwise, extracting non-standard bitwidth (i.e., 1-7, 9-15, and 17-31) for representing narrow integers exacerbates the overhead of software-based quantization. The key idea of this paper is to propose a hardware support in the memory hierarchy of general-purpose processors for quantization, which represents values by few and flexible numbers of bits and stores outliers in their original format in a separate space, preventing any overflow. We minimize metadata and the overhead of locating quantized values using a software-hardware interaction that transfers quantization parameters and data layout to hardware. As a result, our approach has three advantages over cache compression techniques: (i) less metadata, (ii) higher compression ratio for floating-point values and cache blocks with multiple data types, and (iii) lower overhead for locating the compressed blocks. It delivers on average 1.40/1.45/1.56× speedup and 24/26/30% energy reduction compared to a baseline that uses full-length variables in a 4/8/16-core system. Our approach also provides 1.23× speedup, in a 4-core system, compared to the state of the art cache compression techniques and adds only 0.25% area overhead to the baseline processor.

### Patricia Gonzalez-Guerrero, Mircea R. Stan,"Asynchronous Stochastic Computing",53rd Asilomar Conference on Signals, Systems, and Computers,Pacific Grove, CA, USA,IEEE,November 3, 2019,doi: 10.1109/IEEECONF44664.2019.9049011

Asynchronous Stochastic Computing (ASC) leverages Synchronous Stochastic Computing (SSC) advantages and addresses its drawbacks. In SSC a multiplier is a single AND gate, saving ~ 90% of power and area compared with a typical 8bit binary multiplier. The key for SSC power-area efficiency comes from mapping numbers to streams of 1s and 0s. Despite the power-area efficiency, SSC drawbacks such as long latency, costly clock distribution network (CDN), and expensive stream generation, causes the energy consumption to grow prohibitively large. In this work, we introduce the foundations for ASC using Continuous-time-Markov-chains, and analyze the computing error due to random fluctuations. In ASC data is mapped to asynchronous-continuous-time streams, which yields two advantages over the synchronous counterpart: (1) CDN elimination, and (2) better accuracy performance. We compare ASC with SSC for three applications: (1) multiplication, (2) an image processing algorithm: gamma-correction, and (3) a singlelayer of a fully-connected artificial-neural-network (ANN) using a FinFET1X technology. Our Matlab, Spice-level simulations and post-place&route (P&R) reports demonstrate that ASC yields savings of 10%-55%, 33%-44%, and 50% in latency, power, and energy respectively. These savings make ASC a good candidate to address the ultra-low-power requirements of machine learning for the IoT.

### Amir Kamil, John Bachan, Scott B. Baden, Dan Bonachea, Rob Egan, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Kathy Yelick,UPC++ Tutorial (NERSC Nov 2019),National Energy Research Scientific Computing Center (NERSC),November 1, 2019,

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. UPC++ provides mechanisms for low-overhead one-sided communication, moving computation to data through remote-procedure calls, and expressing dependencies between asynchronous computations and data movement. It is particularly well-suited for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces are designed to be composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds.

In this tutorial we will introduce basic concepts and advanced optimization techniques of UPC++. We will discuss the UPC++ memory and execution models and walk through implementing basic algorithms in UPC++. We will also look at irregular applications and how to take advantage of UPC++ features to optimize their performance.

NERSC Nov 2019 Event Page

### Patricia Gonzalez-Guerrero, Tommy Tracy II, Xinfei Guo, Mircea R Stan,"Towards low-power random forest using asynchronous computing with streams",Tenth International Green and Sustainable Computing Conference (IGSC),EEE Computer Society,October 1, 2019,doi: 10.1109/IGSC48788.2019.8957193

We propose a sensor architecture for the internet of things (IoT), smartdust or edge-intelligence (EI) that combines near-analog-memory (NAM) processing and asynchronous computing with streams (ACS) addressing the need for machine learning (ML) capabilities at low power budgets. In ACS an analog value is mapped to an asynchronous stream that can take one of two values (vh, vl). This stream-based data representation enables area-power efficient computing units such as the multiplier implemented as an AND gate yielding savings in power of 90% compared with digital approaches. However, a major bottleneck for computing on streams, vision sensors, and NAM approaches is the cost of analog-to-digital (ADC) and digital-to-stream-to-digital converters. Our NAM-ACS architecture, simplifies the sensor and eliminates the need for the expensive conversions. The architecture is tailored for random forest (Raf), a ML algorithm, chosen for its ability to classify using a reduced number of features. Our simulations show that using an analog-memory array of 256 512, the power consumption of the ACS-core combined with the memory interface is comparable with the consumption of an ADC based memory interface, obtaining an accuracy of 83%.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ v1.0 Programmer’s Guide, Revision 2019.9.0",Lawrence Berkeley National Laboratory Tech Report,September 2019,LBNL 2001236, doi: 10.25344/S4V30R

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ v1.0 Specification, Revision 2019.9.0",Lawrence Berkeley National Laboratory Tech Report,September 14, 2019,LBNL 2001237, doi: 10.25344/S4ZW2C

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Patricia Gonzalez-Guerrero, Stephen G Wilson, Mircea R Stan,"Error-latency Trade-off for Asynchronous Stochastic Computing with ΣΔ Streams for the IoT",International System-on-Chip Conference (SOCC),Singapore,IEEE,September 3, 2019,doi: 10.1109/SOCC46988.2019.1570548453

Asynchronous stochastic computing (ASC) using continuous-time-asynchronous ΣΔ modulators (SC-AΣΔM) has the potential to enable ultra-low-power, on-node machine learning algorithms for the next generation of sensors for the Internet of Things (IoT). Similar to synchronous stochastic computing (SSC 1 ), in SC-AΣΔM complex processing units can be implemented with simple gates because numbers are represented with streams. For example a multiplier is implemented with a XNOR gate, yielding savings in power and area of 90% compared with the typical binary approach. Previous work demonstrated that SC-AΣΔM leverages SSC advantages and addresses its drawbacks, achieving significant savings in energy, power and latency. In this work, we study a theoretical model to determine the fundamental limits of accuracy and computing time for SCAΣΔM. Since the ΣΔ streams are periodic the final computing error is non-zero and depends on the period of the input streams. We validate our theoretical model with Spice-level simulations and evaluate the power and energy consumption using a standard FinFet1X2 technology for two cases: 1) multiplication and 2) gamma correction, an image processing algorithm. Our work determines circuit design guidelines for SC-AΣΔM and shows that multiplication with SC-AΣΔM requires at least 6X less time than SSC. The latency reduction and novel architecture positively impacts the overall energy consumption in the IoT node, enabling savings in energy of 79% compared with the binary approach.

### Patricia Gonzalez-Guerrero, Mircea R Stan,"Asynchronous Stream Computing for Low Power IoT",International Midwest Symposium on Circuits and Systems (MWSCAS),Dallas, TX, USA,IEEE,August 4, 2019,doi: 10.1109/MWSCAS.2019.8885388

Asynchronous circuits have many advantages over their synchronous counterparts in terms of robustness to parameter variations, wide supply voltage ranges, and potentially low power by not needing a clock, yet their promise has not been translated yet into commercial success due to several issues related to design methodologies and the need for handshake signals. Stochastic computing is another processing paradigm that has shown promises of low power and extremely compact circuits but has yet to become a commercial success mainly because of the need for a fast clock to generate the random streams. The Asynchronous Stream Processing circuits described in this paper combine the best features of asynchronous circuits (lack of clock, robustness) with the best features of stochastic computing (processing on streams) to enable extremely compact and low power IoT sensing nodes that can finally fulfill the promise of smart dust, another concept that was ahead of its time and yet to achieve commercial success.

### J. Choi, A. Sim,Data reduction methods, systems and devices,U.S. Patent No. 10,366,078,2019,

U.S. Patent No. 10,366,078, “DATA REDUCTION METHODS, SYSTEMS, AND DEVICES”, LBNL IB2013-133.

### John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H. Hargrove, Hadia Ahmed,"UPC++: A High-Performance Communication Framework for Asynchronous Computation",33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'19),Rio de Janeiro, Brazil,IEEE,May 2019,doi: 10.25344/S4V88H

UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC).
We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x.
UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications.

### Francois P. Hamon, Martin Schreiber, Michael L. Minion,"Parallel-in-Time Multi-Level Integration of the Shallow-Water Equations on the Rotating Sphere",April 12, 2019,

Submitted to Journal of Computational Physics

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Programmer's Guide, v1.0-2019.3.0",Lawrence Berkeley National Laboratory Tech Report,March 2019,LBNL 2001191, doi: 10.25344/S4F301

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Specification v1.0, Draft 10",Lawrence Berkeley National Laboratory Tech Report,March 15, 2019,LBNL 2001192, doi: 10.25344/S4JS30

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Patricia Gonzalez-Guerrero, Xinfei Guo, Mircea R Stan,"ASC-FFT: Area-efficient low-latency FFT design based on asynchronous stochastic computing",10th Latin American Symposium on Circuits & Systems (LASCAS),Armenia, Colombia,IEEE,February 24, 2019,doi: 10.1109/LASCAS.2019.8667599

Asynchronous Stochastic Computing (ASC) is a new paradigm that addresses Synchronous Stochastic Computing (SSC) drawbacks, expensive stochastic number generation (SNG) and long latency, by using continuous time streams (CTS). To go beyond the basic operations of addition and multiplication in ASC we need to incorporate a memory element. Although for SSC the natural memory element is a clocked-flip-flop, using the same approach with no synchronized data leads to unacceptable large error. In this paper, we propose to use a capacitor embedded in a feedback loop as the ASC memory element. Based on this idea, we design a low-error asynchronous adder that stores the carry information in the capacitor. Our adder enables the implementation of more complex computation logic. As an example, we implement an asynchronous stochastic Fast Fourier Transform (ASC-FFT) using a FinFET1X 1 technology. The proposed adder requires 76%-24% less hardware cost compared against conventional and SSC adders respectively. Besides, the ASC-FFT shows 3X less latency when compared with SSC-FFT approaches and significant improvements in latency and area over conventional FFT architectures with no degradation of the computation accuracy measured by the FFT Signal to Noise Ratio (SNR).

In submission

### Daniel F. Martin, Stephen L. Cornford, Antony J. Payne,"Millennial‐scale Vulnerability of the Antarctic Ice Sheet to Regional Ice Shelf Collapse",Geophysical Research Letters,January 9, 2019,doi: 10.1029/2018gl081229

Abstract:

The Antarctic Ice Sheet (AIS) remains the largest uncertainty in projections of future sea level rise. A likely climate‐driven vulnerability of the AIS is thinning of floating ice shelves resulting from surface‐melt‐driven hydrofracture or incursion of relatively warm water into subshelf ocean cavities. The resulting melting, weakening, and potential ice‐shelf collapse reduces shelf buttressing effects. Upstream ice flow accelerates, causing thinning, grounding‐line retreat, and potential ice sheet collapse. While high‐resolution projections have been performed for localized Antarctic regions, full‐continent simulations have typically been limited to low‐resolution models. Here we quantify the vulnerability of the entire present‐day AIS to regional ice‐shelf collapse on millennial timescales treating relevant ice flow dynamics at the necessary ∼1km resolution. Collapse of any of the ice shelves dynamically connected to the West Antarctic Ice Sheet (WAIS) is sufficient to trigger ice sheet collapse in marine‐grounded portions of the WAIS. Vulnerability elsewhere appears limited to localized responses.

Plain Language Summary:

The biggest uncertainty in near‐future sea level rise (SLR) comes from the Antarctic Ice Sheet. Antarctic ice flows in relatively fast‐moving ice streams. At the ocean, ice flows into enormous floating ice shelves which push back on their feeder ice streams, buttressing them and slowing their flow. Melting and loss of ice shelves due to climate changes can result in faster‐flowing, thinning and retreating ice leading to accelerated rates of global sea level rise.To learn where Antarctica is vulnerable to ice‐shelf loss, we divided it into 14 sectors, applied extreme melting to each sector's floating ice shelves in turn, then ran our ice flow model 1000 years into the future for each case. We found three levels of vulnerability. The greatest vulnerability came from attacking any of the three ice shelves connected to West Antarctica, where much of the ice sits on bedrock lying below sea level. Those dramatic responses contributed around 2m of sea level rise. The second level came from four other sectors, each with a contribution between 0.5‐1m. The remaining sectors produced little to no contribution. We examined combinations of sectors, determining that sectors behave independently of each other for at least a century.

### BA Brock, Y Chen, J Yan, J Owens, A Buluç, K Yelick,"RDMA vs. RPC for implementing distributed data structures",2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms, IA3 2019,January 1, 2019,17--22,doi: 10.1109/IA349570.2019.00009

Distributed data structures are key to implementing scalable applications for scientific simulations and data analysis. In this paper we look at two implementation styles for distributed data structures: remote direct memory access (RDMA) and remote procedure call (RPC). We focus on operations that require individual accesses to remote portions of a distributed data structure, e.g., accessing a hash table bucket or distributed queue, rather than global operations in which all processors collectively exchange information. We look at the trade-offs between the two styles through microbenchmarks and a performance model that approximates the cost of each. The RDMA operations have direct hardware support in the network and therefore lower latency and overhead, while the RPC operations are more expressive but higher cost and can suffer from lack of attentiveness from the remote side. We also run experiments to compare the real-world performance of RDMA- and RPC-based data structure operations with the predicted performance to evaluate the accuracy of our model, and show that while the model does not always precisely predict running time, it allows us to choose the best implementation in the examples shown. We believe this analysis will assist developers in designing data structures that will perform well on current network architectures, as well as network architects in providing better support for this class of distributed data structures.

### Beytullah Yildiz, Kesheng Wu, Suren Byna, Arie Shoshani,"Parallel membership queries on very large scientific data sets using bitmap indexes",Concurrency and Computation: Practice and Experience,January 1, 2019,31:e5157,

Many scientific applications produce very large amounts of data as advances in hardware fuel computing and experimental facilities. Managing and analyzing massive quantities of scientific data is challenging as data are often stored in specific formatted files, such as HDF5 and NetCDF, which do not offer appropriate search capabilities. In this research, we investigated a special class of search capability, called membership query, to identify whether queried elements of a set are members of an attribute. Attributes that naturally have classification values appear frequently in scientific domains such as category and object type as well as in daily life such as zip code and occupation. Because classification attribute values are discrete and require random data access, performing a membership query on a large scientific data set creates challenges. We applied bitmap indexing and parallelization to membership queries to overcome these challenges. Bitmap indexing provides high performance not only for low cardinality attributes but also for high cardinality attributes, such as floating‐point variables, electric charge, or momentum in a particle physics data set, due to compression algorithms such as Word‐Aligned Hybrid. We conducted experiments, in a highly parallelized environment, on data obtained from a particle accelerator model and a synthetic data set.

### Kesheng Wu, Alex Sim, Jonathan Wang, Seongwook Hwangbo,Methods, systems, and devices for accurate signal timing of power component events,2019,

US Patent app no. 20190138371, “Methods, systems, and devices for accurate signal timing of power component events”

### B Brock, A Buluç, K Yelick,"BCL: A cross-platform distributed data structures library",ACM International Conference Proceeding Series,January 2019,doi: 10.1145/3337821.3337912

One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance data structures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.

### Paul H. Hargrove, Dan Bonachea,"GASNet-EX Performance Improvements Due to Specialization for the Cray Aries Network",Parallel Applications Workshop, Alternatives To MPI (PAW-ATM),Dallas, Texas, USA,IEEE,November 16, 2018,23-33,doi: 10.25344/S44S38

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models on future exascale machines. This paper reports on the improvements in performance observed on Cray XC-series systems due to enhancements made to the GASNet-EX software. These enhancements, known as "specializations", primarily consist of replacing network-independent implementations of several recently added features with implementations tailored to the Cray Aries network. Performance gains from specialization include (1) Negotiated-Payload Active Messages improve bandwidth of a ping-pong test by up to 14%, (2) Immediate Operations reduce running time of a synthetic benchmark by up to 93%, (3) non-bulk RMA Put bandwidth is increased by up to 32%, (4) Remote Atomic performance is 70% faster than the reference on a point-to-point test and allows a hot-spot test to scale robustly, and (5) non-contiguous RMA interfaces see up to 8.6x speedups for an intra-node benchmark and 26% for inter-node. These improvements are all available in GASNet-EX version 2018.3.0 and later.

### Scott B. Baden, Paul H. Hargrove, Hadia Ahmed, John Bachan, Dan Bonachea, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ and GASNet-EX: PGAS Support for Exascale Applications and Runtimes",The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'18) Research Poster,November 2018,

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work is driven by the emerging need for adaptive, lightweight communication in irregular applications at exascale. We present an overview of UPC++ and GASNet-EX, including examples and performance results.

GASNet-EX is a portable, high-performance communication library, leveraging hardware support to efficiently implement Active Messages and Remote Memory Access (RMA). UPC++ provides higher-level abstractions appropriate for PGAS programming such as: one-sided communication (RMA), remote procedure call, locality-aware APIs for user-defined distributed objects, and robust support for asynchronous execution to hide latency. Both libraries have been redesigned relative to their predecessors to meet the needs of exascale computing. While both libraries continue to evolve, the system already demonstrates improvements in microbenchmarks and application proxies.

### Patricia Gonzalez-Guerrero, Xinfei Guo, Mircea Stan,"SC-SD: Towards low power stochastic computing using sigma delta streams",International Conference on Rebooting Computing (ICRC),McLean, VA, USA,IEEE,November 7, 2018,doi: 10.1109/ICRC.2018.8638611

Processing data using Stochastic Computing (SC) requires only ~ 7% of the area and power of the typical binary approach. However, SC has two major drawbacks that eclipse any area and power savings. First, it takes sim 99% more time to finish a computation when compared with the binary approach, since data is represented as streams of bits. Second, the Linear Feedback Shift Registers (LFSRs) required to generate the stochastic streams increment the power and area of the overall SC-LFSR system. These drawbacks result in similar or higher area, power, and energy numbers when compared with the binary counterpart. In this work, we address these drawbacks by applying SC directly on Pulse Density Modulated (PDM) streams. Most modern Systems on Chip (SoCs) already include Analog to Digital Converters (ADCs). The core of Σ△ -ADCs is the Σ△ Modulator whose output is a PDM stream. Our approach (SC-SD) simplifies the system hardware in two ways. First, we drop the filter stage at the ADC and, second, we replace the costly Stochastic Number Generators (SNGs) with Σ△ -Modulators. To further lower the system complexity, we adopt an Asynchronous Σ△ -Modulator (AΣ△M) architecture. We design and simulate the AΣ△M: using an industry-standard 1×FinFET 11 In modern technologies the node number does not refer to any one feature in the process, and foundries use slightly different conventions; we use 1x to denote the 14/16nm FinFET nodes offered by the foundry. technology with foundry models. We achieve power savings of 81 % in SNG compared to the LFSR approach. To evaluate how this area and power savings scale to more complex applications, we implement Gamma Correction, a popular image processing algorithm. For this application, our simulations show that SC-SD can save 98%-11% in the total system latency and 50%-38% in power consumption when compared with the SC-LFSR approach or the binary counterpart.

### Dan Bonachea, Paul H. Hargrove,"GASNet-EX: A High-Performance, Portable Communication Library for Exascale",Languages and Compilers for Parallel Computing (LCPC'18),Salt Lake City, Utah, USA,October 11, 2018,LBNL 2001174, doi: 10.25344/S4QP4W

Partitioned Global Address Space (PGAS) models, typified by such languages as Unified Parallel C (UPC) and Co-Array Fortran, expose one-sided communication as a key building block for High Performance Computing (HPC) applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in future exascale machines. The library is an evolution of the popular GASNet communication system, building upon over 15 years of lessons learned. We describe and evaluate several features and enhancements that have been introduced to address the needs of modern client systems. Microbenchmark results demonstrate the RMA performance of GASNet-EX is competitive with several MPI-3 implementations on current HPC systems.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Programmer's Guide, v1.0-2018.9.0",Lawrence Berkeley National Laboratory Tech Report,September 2018,LBNL 2001180, doi: 10.25344/S49G6V

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Specification v1.0, Draft 8",Lawrence Berkeley National Laboratory Tech Report,September 26, 2018,LBNL 2001179, doi: 10.25344/S45P4X

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Dan Bonachea, Paul Hargrove,"GASNet-EX Performance Improvements Due to Specialization for the Cray Aries Network (tech report version)",Lawrence Berkeley National Laboratory Tech Report,March 27, 2018,LBNL 2001134, doi: 10.2172/1430690

This document is a deliverable for milestone STPM17-6 of the Exascale Computing Project, delivered by WBS 2.3.1.14. It reports on the improvements in performance observed on Cray XC-series systems due to enhancements made to the GASNet-EX software. These enhancements, known as “specializations”, primarily consist of replacing network-independent implementations of several recently added features with implementations tailored to the Cray Aries network. Performance gains from specialization include (1) Negotiated-Payload Active Messages improve bandwidth of a ping-pong test by up to 14%, (2) Immediate Operations reduce running time of a synthetic benchmark by up to 93%, (3) non-bulk RMA Put bandwidth is increased by up to 32%, (4) Remote Atomic performance is 70% faster than the reference on a point-to-point test and allows a hot-spot test to scale robustly, and (5) non-contiguous RMA interfaces see up to 8.6x speedups for an intra-node benchmark and 26% for inter-node. These improvements are available in the GASNet-EX 2018.3.0 release.

### John Bachan, Scott Baden, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Bryce Lelbach, Brian Van Straalen,"UPC++ Specification v1.0, Draft 6",Lawrence Berkeley National Laboratory Tech Report,March 26, 2018,LBNL 2001135, doi: 10.2172/1430689

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### John Bachan, Scott Baden, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian Van Straalen,"UPC++ Programmer’s Guide, v1.0-2018.3.0",Lawrence Berkeley National Laboratory Tech Report,March 2018,LBNL 2001136, doi: 10.2172/1430693

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### Xinfei Guo, Vaibhav Verma, Patricia Gonzalez-Guerrero, Mircea R Stan,"When “things” get older: exploring circuit aging in IoT applications",International Symposium on Quality Electronic Design (ISQED),Santa Clara, CA, USA,IEEE,March 13, 2018,doi: 10.1109/ISQED.2018.8357304

The Internet of Things (IoT) brings a paradigm where humans and “things” are connected. Reliability of these devices becomes extremely critical. Circuit aging has become a limiting factor in technology scaling and a significant challenge in designing IoT systems for reliability-critical applications. As IoT becomes a general-purpose technology which starts to adapt to the advanced process nodes, it is necessary to understand how and on what level aging affects different categories of IoT applications. Since aging is highly dependent on operating conditions and switching activities, this paper classifies the IoT applications based on the aging-related metrics and studies aging using the foundry-provided FinFET aging models. We show that for many IoT applications, aging will indeed add to the already tight design margin. As the expected chip lifetime in IoT devices becomes much longer and the failure tolerant requirements of these applications become much more strict, we conclude that aging needs to be considered in the full design cycle and the IoT lifetime estimation needs to incorporate aging as an important factor. We also present application-specific solutions to mitigate circuit aging in IoT systems.

### Tests of Fundamental Physics Using Quasar Spectroscopy,Vincent Dumont,PhD,2018,

Quasar absorption lines are used extensively in astrophysics to place constraints on cosmological models. In this thesis, we focus on the measurement of two important parameters in cosmology, the electromagnetic coupling constant, or fine-structure constant, α≡e2/(4πεℏc), and the primordial deuterium-to-hydrogen ratio, D/H. Any cosmological variation of α will cause its value to be different in the early stage of the Universe, therefore impacting the production of the primordial light elements during Big Bang Nucleosynthesis. We provide updated ∆α/α measurements from 280 absorption systems previously published in the literature using the non-linear least-square Voigt Profile fitting program VPFIT10. We also investigate the impact of long-range wavelength-scale distortions on those measurements. We found that long-range distortions are unlikely to explain the 4.1σ evidence for an α dipole as reported in the literature even though they do impact on the α-dipole significance. The above work led us to examine the kinematics of each absorption system in our sample. We report a correlation between the complexity of the velocity structure of Damped Lyman-α systems and the apparent position of the Lyman limit break in quasar spectra. We develop a new technique, based on this correlation, to identify suitable Damped Lyman-α systems for D/H measurements.

### John Bachan, Dan Bonachea, Paul H Hargrove, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Scott B Baden,"The UPC++ PGAS library for Exascale Computing",Proceedings of the Second Annual PGAS Applications Workshop (PAW17),November 13, 2017,doi: 10.1145/3144779.3169108

We describe UPC++ V1.0, a C++11 library that supports APGAS programming. UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, and futures. Global pointers incorporate ownership information useful in optimizing for locality. Futures capture data readiness state, are useful for scheduling and also enable the programmer to chain operations to execute asynchronously as high-latency dependencies become satisfied, via continuations. The interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and closely resemble those used in modern C++. Communication in UPC++ runs at close to hardware speeds by utilizing the low-overhead GASNet-EX communication library.

### N. Sanderson, E. Shugerman, S. Molnar, J. Meiss E. Bradley,"Computational Topology Techniques for Characterizing Time-Series Data",Advances in Intelligent Data Analysis XVI 16th International Symposium, IDA 2017, London, UK, October 26–28, 2017, Proceedings,October 2017,pp.284-296,doi: 10.1007/978-3-319-68765-0_24

Topological data analysis (TDA), while abstract, allows a characterization of time-series data obtained from nonlinear and complex dynamical systems. Though it is surprising that such an abstract measure of structure—counting pieces and holes—could be useful for real-world data, TDA lets us compare different systems, and even do membership testing or change-point detection. However, TDA is computationally expensive and involves a number of free parameters. This complexity can be obviated by coarse-graining, using a construct called the witness complex. The parametric dependence gives rise to the concept of persistent homology: how shape changes with scale. Its results allow us to distinguish time-series data from different systems—e.g., the same note played on different musical instruments.

### Patricia Gonzalez-Guerrero, Mircea Stan,"Ultra-low-power dual-phase latch based digital accelerator for continuous monitoring of wheezing episodes",SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S),Burlingame, CA, USA,IEEE,October 16, 2017,doi: 10.1109/S3S.2017.8308752

We designed an ultra-low-power accelerator for the calculation of the Short Time Fourier Transform (STFT) optimized for wheezing detection. The low power consumption of our accelerator relies on optimizations at different stages of the design process. Post-layout simulations show that at the minimum energy point our accelerator consumes 3.3 pJ/cycle at 0.5 V and 163 KHz. We compare the energy consumption of our implementation with its flip-flop version. Simulations show that we can save up to 50% in energy consumption for a latch based design vs. a flip-flop based design, making dual-phase latch based implementations excellent candidates for ultra-low-power devices.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Programmer’s Guide, v1.0-2017.9",Lawrence Berkeley National Laboratory Tech Report,September 2017,LBNL 2001065, doi: 10.2172/1398522

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul H. Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Bryce Lelbach, Brian Van Straalen,"UPC++ Specification v1.0, Draft 4",Lawrence Berkeley National Laboratory Tech Report,September 27, 2017,LBNL 2001066, doi: 10.2172/1398521

UPC++ is a C++11 library providing classes and functions that support Asynchronous Partitioned Global Address Space (APGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Xinfei Guo, Vaibhav Verma, Patricia Gonzalez-Guerrero, Sergiu Mosanu, Mircea R Stan,"Back to the future: Digital circuit design in the finfet era",Journal of Low Power Electronics,September 1, 2017,doi: https://doi.org/10.1166/jolpe.2017.1489

It has been almost a decade since FinFET devices were introduced to full production; they allowed scaling below 20 nm, thus helping to extend Moore's law by a precious decade with another decade likely in the future when scaling to 5 nm and below. Due to superior electrical parameters and unique structure, these 3-D transistors offer significant performance improvements and power reduction compared to planar CMOS devices. As we are entering into the sub-10 nm era, FinFETs have become dominant in most of the high-end products; as the transition from planar to FinFET technologies is still ongoing, it is important for digital circuit designers to understand the challenges and opportunities brought in by the new technology characteristics. In this paper, we study these aspects from the device to the circuit level, and we make detailed comparisons across multiple technology nodes ranging from conventional bulk to advanced planar technology nodes such as Fully Depleted Silicon-on-Insulator (FDSOI), to FinFETs. In the simulations we used both state-of-art industry-standard models for current nodes, and also predictive models for future nodes. Our study shows that besides the performance and power benefits, FinFET devices show significant reduction of short-channel effects and extremely low leakage, and many of the electrical characteristics are close to ideal as in old long-channel technology nodes; FinFETs seem to have put scaling back on track! However, the combination of the new device structures, double/multi-patterning, many more complex rules, and unique thermal/reliability behaviors are creating new technical challenges. Moving forward, FinFETs still offer a bright future and are an indispensable technology for a wide range of applications from high-end performance-critical computing to energy-constraint mobile applications and smart Internet-of-Things (IoT) devices.

### Dan Bonachea, Paul Hargrove,"GASNet Specification, v1.8.1",Lawrence Berkeley National Laboratory Tech Report,August 31, 2017,LBNL 2001064, doi: 10.2172/1398512

GASNet is a language-independent, low-level networking layer that provides network-independent, high-performance communication primitives tailored for implementing parallel global address space SPMD languages and libraries such as UPC, UPC++, Co-Array Fortran, Legion, Chapel, and many others. The interface is primarily intended as a compilation target and for use by runtime library writers (as opposed to end users), and the primary goals are high performance, interface portability, and expressiveness. GASNet stands for "Global-Address Space Networking".