# All Publications

### Paul H. Hargrove, Dan Bonachea,"Efficient Active Message RMA in GASNet Using a Target-Side Reassembly Protocol (Extended Abstract)",IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM),Lawrence Berkeley National Laboratory Technical Report,November 17, 2019,LBNL 2001‍238, doi: 10.25344/S4PC7M

GASNet is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models on future exascale machines. This paper investigates strategies for efficient implementation of GASNet’s “AM Long” API that couples an RMA (Remote Memory Access) transfer with an Active Message (AM) delivery.
We discuss several network-level protocols for AM Long and propose a new target-side reassembly protocol. We present a microbenchmark evaluation on the Cray XC Aries network hardware. The target-side reassembly protocol on this network improves AM Long end-to-end latency by up to 33%, and the effective bandwidth by up to 49%, while also enabling asynchronous source completion that drastically reduces injection overheads.
The improved AM Long implementation for Aries is available in GASNet-EX release v2019.9.0 and later.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ v1.0 Specification, Revision 2019.9.0",Lawrence Berkeley National Laboratory Tech Report,September 14, 2019,LBNL 2001237, doi: 10.25344/S4ZW2C

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ v1.0 Programmer’s Guide, Revision 2019.9.0",Lawrence Berkeley National Laboratory Tech Report,September 14, 2019,LBNL 2001236, doi: 10.25344/S4V30R

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott B. Baden, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Dan Bonachea, Paul H. Hargrove, Hadia Ahmed,"UPC++: A High-Performance Communication Framework for Asynchronous Computation",33rd IEEE International Parallel & Distributed Processing Symposium (IPDPS'19),Rio de Janeiro, Brazil,IEEE,May 2019,doi: 10.25344/S4V88H

UPC++ is a C++ library that supports high-performance computation via an asynchronous communication framework. This paper describes a new incarnation that differs substantially from its predecessor, and we discuss the reasons for our design decisions. We present new design features, including future-based asynchrony management, distributed objects, and generalized Remote Procedure Call (RPC).
We show microbenchmark performance results demonstrating that one-sided Remote Memory Access (RMA) in UPC++ is competitive with MPI-3 RMA; on a Cray XC40 UPC++ delivers up to a 25% improvement in the latency of blocking RMA put, and up to a 33% bandwidth improvement in an RMA throughput test. We showcase the benefits of UPC++ with irregular applications through a pair of application motifs, a distributed hash table and a sparse solver component. Our distributed hash table in UPC++ delivers near-linear weak scaling up to 34816 cores of a Cray XC40. Our UPC++ implementation of the sparse solver component shows robust strong scaling up to 2048 cores, where it outperforms variants communicating using MPI by up to 3.1x.
UPC++ encourages the use of aggressive asynchrony in low-overhead RMA and RPC, improving programmer productivity and delivering high performance in irregular applications.

### Francois P. Hamon, Martin Schreiber, Michael L. Minion,"Parallel-in-Time Multi-Level Integration of the Shallow-Water Equations on the Rotating Sphere",April 12, 2019,

Submitted to Journal of Computational Physics

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Programmer's Guide, v1.0-2019.3.0",Lawrence Berkeley National Laboratory Tech Report,March 15, 2019,LBNL 2001191, doi: 10.25344/S4F301

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Specification v1.0, Draft 10",Lawrence Berkeley National Laboratory Tech Report,March 15, 2019,LBNL 2001192, doi: 10.25344/S4JS30

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

In submission

### Daniel F. Martin, Stephen L. Cornford, Antony J. Payne,"Millennial‐scale Vulnerability of the Antarctic Ice Sheet to Regional Ice Shelf Collapse",Geophysical Research Letters,January 9, 2019,doi: 10.1029/2018gl081229

Abstract:

The Antarctic Ice Sheet (AIS) remains the largest uncertainty in projections of future sea level rise. A likely climate‐driven vulnerability of the AIS is thinning of floating ice shelves resulting from surface‐melt‐driven hydrofracture or incursion of relatively warm water into subshelf ocean cavities. The resulting melting, weakening, and potential ice‐shelf collapse reduces shelf buttressing effects. Upstream ice flow accelerates, causing thinning, grounding‐line retreat, and potential ice sheet collapse. While high‐resolution projections have been performed for localized Antarctic regions, full‐continent simulations have typically been limited to low‐resolution models. Here we quantify the vulnerability of the entire present‐day AIS to regional ice‐shelf collapse on millennial timescales treating relevant ice flow dynamics at the necessary ∼1km resolution. Collapse of any of the ice shelves dynamically connected to the West Antarctic Ice Sheet (WAIS) is sufficient to trigger ice sheet collapse in marine‐grounded portions of the WAIS. Vulnerability elsewhere appears limited to localized responses.

Plain Language Summary:

The biggest uncertainty in near‐future sea level rise (SLR) comes from the Antarctic Ice Sheet. Antarctic ice flows in relatively fast‐moving ice streams. At the ocean, ice flows into enormous floating ice shelves which push back on their feeder ice streams, buttressing them and slowing their flow. Melting and loss of ice shelves due to climate changes can result in faster‐flowing, thinning and retreating ice leading to accelerated rates of global sea level rise.To learn where Antarctica is vulnerable to ice‐shelf loss, we divided it into 14 sectors, applied extreme melting to each sector's floating ice shelves in turn, then ran our ice flow model 1000 years into the future for each case. We found three levels of vulnerability. The greatest vulnerability came from attacking any of the three ice shelves connected to West Antarctica, where much of the ice sits on bedrock lying below sea level. Those dramatic responses contributed around 2m of sea level rise. The second level came from four other sectors, each with a contribution between 0.5‐1m. The remaining sectors produced little to no contribution. We examined combinations of sectors, determining that sectors behave independently of each other for at least a century.

### Paul H. Hargrove, Dan Bonachea,"GASNet-EX Performance Improvements Due to Specialization for the Cray Aries Network",Parallel Applications Workshop, Alternatives To MPI (PAW-ATM),Dallas, Texas, USA,IEEE,November 16, 2018,23-33,doi: 10.1109/PAW-ATM.2018.00008

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models on future exascale machines. This paper reports on the improvements in performance observed on Cray XC-series systems due to enhancements made to the GASNet-EX software. These enhancements, known as "specializations", primarily consist of replacing network-independent implementations of several recently added features with implementations tailored to the Cray Aries network. Performance gains from specialization include (1) Negotiated-Payload Active Messages improve bandwidth of a ping-pong test by up to 14%, (2) Immediate Operations reduce running time of a synthetic benchmark by up to 93%, (3) non-bulk RMA Put bandwidth is increased by up to 32%, (4) Remote Atomic performance is 70% faster than the reference on a point-to-point test and allows a hot-spot test to scale robustly, and (5) non-contiguous RMA interfaces see up to 8.6x speedups for an intra-node benchmark and 26% for inter-node. These improvements are all available in GASNet-EX version 2018.3.0 and later.

### Scott B. Baden, Paul H. Hargrove, Hadia Ahmed, John Bachan, Dan Bonachea, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ and GASNet-EX: PGAS Support for Exascale Applications and Runtimes",The International Conference for High Performance Computing, Networking, Storage and Analysis (SC'18),November 13, 2018,

Lawrence Berkeley National Lab is developing a programming system to support HPC application development using the Partitioned Global Address Space (PGAS) model. This work is driven by the emerging need for adaptive, lightweight communication in irregular applications at exascale. We present an overview of UPC++ and GASNet-EX, including examples and performance results.

GASNet-EX is a portable, high-performance communication library, leveraging hardware support to efficiently implement Active Messages and Remote Memory Access (RMA). UPC++ provides higher-level abstractions appropriate for PGAS programming such as: one-sided communication (RMA), remote procedure call, locality-aware APIs for user-defined distributed objects, and robust support for asynchronous execution to hide latency. Both libraries have been redesigned relative to their predecessors to meet the needs of exascale computing. While both libraries continue to evolve, the system already demonstrates improvements in microbenchmarks and application proxies.

### Dan Bonachea, Paul H. Hargrove,"GASNet-EX: A High-Performance, Portable Communication Library for Exascale",Languages and Compilers for Parallel Computing (LCPC'18),Salt Lake City, Utah, USA,October 11, 2018,LBNL 2001174, doi: 10.25344/S4QP4W

Partitioned Global Address Space (PGAS) models, typified by such languages as Unified Parallel C (UPC) and Co-Array Fortran, expose one-sided communication as a key building block for High Performance Computing (HPC) applications. Architectural trends in supercomputing make such programming models increasingly attractive, and newer, more sophisticated models such as UPC++, Legion and Chapel that rely upon similar communication paradigms are gaining popularity.

GASNet-EX is a portable, open-source, high-performance communication library designed to efficiently support the networking requirements of PGAS runtime systems and other alternative models in future exascale machines. The library is an evolution of the popular GASNet communication system, building upon over 15 years of lessons learned. We describe and evaluate several features and enhancements that have been introduced to address the needs of modern client systems. Microbenchmark results demonstrate the RMA performance of GASNet-EX is competitive with several MPI-3 implementations on current HPC systems.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Programmer's Guide, v1.0-2018.9.0",Lawrence Berkeley National Laboratory Tech Report,September 26, 2018,LBNL 2001180, doi: 10.25344/S49G6V

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate constituent process having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the processes. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Specification v1.0, Draft 8",Lawrence Berkeley National Laboratory Tech Report,September 26, 2018,LBNL 2001179, doi: 10.25344/S45P4X

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### J Bachan, S Baden, D Bonachea, PH Hargrove, S Hofmeyr, K Ibrahim, M Jacquelin, A Kamil, B van Straalen,"UPC++ Programmer’s Guide, v1.0-2018.3.0",March 31, 2018,LBNL 2001136, doi: 10.2172/1430693

UPC++ is a C++11 library that provides Partitioned Global Address Space (PGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The PGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, PGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### Dan Bonachea, Paul Hargrove,"GASNet-EX Performance Improvements Due to Specialization for the Cray Aries Network",Lawrence Berkeley National Laboratory Tech Report,March 27, 2018,LBNL 2001134, doi: 10.2172/1430690

This document is a deliverable for milestone STPM17-6 of the Exascale Computing Project, delivered by WBS 2.3.1.14. It reports on the improvements in performance observed on Cray XC-series systems due to enhancements made to the GASNet-EX software. These enhancements, known as “specializations”, primarily consist of replacing network-independent implementations of several recently added features with implementations tailored to the Cray Aries network. Performance gains from specialization include (1) Negotiated-Payload Active Messages improve bandwidth of a ping-pong test by up to 14%, (2) Immediate Operations reduce running time of a synthetic benchmark by up to 93%, (3) non-bulk RMA Put bandwidth is increased by up to 32%, (4) Remote Atomic performance is 70% faster than the reference on a point-to-point test and allows a hot-spot test to scale robustly, and (5) non-contiguous RMA interfaces see up to 8.6x speedups for an intra-node benchmark and 26% for inter-node. These improvements are available in the GASNet-EX 2018.3.0 release.

### J Bachan, S Baden, D Bonachea, P Hargrove, S Hofmeyr, K Ibrahim, M Jacquelin, A Kamil, B Lelbach, B van Straalen,"UPC++ Specification v1.0, Draft 6",March 26, 2018,LBNL 2001135, doi: 10.2172/1430689

UPC++ is a C++11 library providing classes and functions that support Partitioned Global Address Space (PGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### John Bachan, Dan Bonachea, Paul H Hargrove, Steve Hofmeyr, Mathias Jacquelin, Amir Kamil, Brian van Straalen, Scott B Baden,"The UPC++ PGAS library for Exascale Computing",Proceedings of the Second Annual PGAS Applications Workshop,November 13, 2017,7,

We describe UPC++ V1.0, a C++11 library that supports APGAS programming. UPC++ targets distributed data structures where communication is irregular or fine-grained. The key abstractions are global pointers, asynchronous programming via RPC, and futures. Global pointers incorporate ownership information useful in optimizing for locality. Futures capture data readiness state, are useful for scheduling and also enable the programmer to chain operations to execute asynchronously as high-latency dependencies become satisfied, via continuations. The interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and closely resemble those used in modern C++. Communication in UPC++ runs at close to hardware speeds by utilizing the low-overhead GASNet-EX communication library.

### N. Sanderson, E. Shugerman, S. Molnar, J. Meiss E. Bradley,"Computational Topology Techniques for Characterizing Time-Series Data",Advances in Intelligent Data Analysis XVI 16th International Symposium, IDA 2017, London, UK, October 26–28, 2017, Proceedings,October 2017,pp.284-296,doi: 10.1007/978-3-319-68765-0_24

Topological data analysis (TDA), while abstract, allows a characterization of time-series data obtained from nonlinear and complex dynamical systems. Though it is surprising that such an abstract measure of structure—counting pieces and holes—could be useful for real-world data, TDA lets us compare different systems, and even do membership testing or change-point detection. However, TDA is computationally expensive and involves a number of free parameters. This complexity can be obviated by coarse-graining, using a construct called the witness complex. The parametric dependence gives rise to the concept of persistent homology: how shape changes with scale. Its results allow us to distinguish time-series data from different systems—e.g., the same note played on different musical instruments.

### John Bachan, Scott Baden, Dan Bonachea, Paul Hargrove, Steven Hofmeyr, Khaled Ibrahim, Mathias Jacquelin, Amir Kamil, Brian van Straalen,"UPC++ Programmer’s Guide, v1.0-2017.9",Lawrence Berkeley National Laboratory Tech Report,September 29, 2017,LBNL 2001065, doi: 10.2172/1398522

This document has been superseded by: UPC++ Programmer’s Guide, v1.0-2018.3.0 (LBNL-2001136)

UPC++ is a C++11 library that provides Asynchronous Partitioned Global Address Space (APGAS) programming. It is designed for writing parallel programs that run efficiently and scale well on distributed-memory parallel computers. The APGAS model is single program, multiple-data (SPMD), with each separate thread of execution (referred to as a rank, a term borrowed from MPI) having access to local memory as it would in C++. However, APGAS also provides access to a global address space, which is allocated in shared segments that are distributed over the ranks. UPC++ provides numerous methods for accessing and using global memory. In UPC++, all operations that access remote memory are explicit, which encourages programmers to be aware of the cost of communication and data movement. Moreover, all remote-memory access operations are by default asynchronous, to enable programmers to write code that scales well even on hundreds of thousands of cores.

### J Bachan, S Baden, D Bonachea, P Hargrove, S Hofmeyr, K Ibrahim, M Jacquelin, A Kamil, B Lelbach, B van Straalen,"UPC++ Specification v1.0, Draft 4",September 27, 2017,LBNL 2001066, doi: 10.2172/1398521

This document has been superseded by: UPC++ Specification v1.0, Draft 6 (LBNL-2001135)

UPC++ is a C++11 library providing classes and functions that support Asynchronous Partitioned Global Address Space (APGAS) programming. We are revising the library under the auspices of the DOE’s Exascale Computing Project, to meet the needs of applications requiring PGAS support. UPC++ is intended for implementing elaborate distributed data structures where communication is irregular or fine-grained. The UPC++ interfaces for moving non-contiguous data and handling memories with different optimal access methods are composable and similar to those used in conventional C++. The UPC++ programmer can expect communication to run at close to hardware speeds. The key facilities in UPC++ are global pointers, that enable the programmer to express ownership information for improving locality, one-sided communication, both put/get and RPC, futures and continuations. Futures capture data readiness state, which is useful in making scheduling decisions, and continuations provide for completion handling via callbacks. Together, these enable the programmer to chain together a DAG of operations to execute asynchronously as high-latency dependencies become satisfied.

### Dan Bonachea, Paul Hargrove,"GASNet Specification, v1.8.1",Lawrence Berkeley National Laboratory Tech Report,August 31, 2017,LBNL 2001064, doi: 10.2172/1398512

GASNet is a language-independent, low-level networking layer that provides network-independent, high-performance communication primitives tailored for implementing parallel global address space SPMD languages and libraries such as UPC, UPC++, Co-Array Fortran, Legion, Chapel, and many others. The interface is primarily intended as a compilation target and for use by runtime library writers (as opposed to end users), and the primary goals are high performance, interface portability, and expressiveness. GASNet stands for "Global-Address Space Networking".

### Dongeun Lee, Alex Sim, Jaesik Choi, Kesheng Wu,"Improving Statistical Similarity Based Data Reduction for Non-Stationary Data",29th International Conference on Scientific and Statistical Database Management (SSDBM2017),2017,doi: 10.1145/3085504.3085583

Updated experiment version: https://sdm.lbl.gov/oapapers/ssdbm17-lee-upd.pdf
Original version: http://dl.acm.org/citation.cfm?doid=3085504.3085583

### Anastasiia Butko, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli,"Efficient Programming for Multicore Processor Heterogeneity: OpenMP versus OmpSs",Open Source Supercomputing Workshop,Frankfurt, Germany,Springer’s Lecture Notes in Computer Science (LNCS),June 22, 2017,

ARM single-ISA heterogeneous multicore processors combine high-performance big cores with power-efficient small cores. They aim at achieving a suitable balance between performance and energy. How- ever, a main challenge is to program such architectures so as to efficiently exploit their features. In this paper, we study the impact on performance and energy trade-offs of single-ISA architecture according to OpenMP 3.0 and the OmpSs programming models. We consider different symmetric/asymmetric architecture configura- tions in terms of core frequency and core count between big and LITTLE clusters. Experiments are conducted on both a real Samsung Exynos 5 Octa system-on-chip and the gem5/McPAT simulation frameworks. Results show that OmpSs implementations are more sensitive to loop scheduling parameters than OpenMP 3.0. In most cases, best OmpSs configurations significantly outperform OpenMP ones. While cluster frequency asym- metry provides uninteresting results, asymmetric cluster configuration with single high-performance core and multiple low-power cores provides better performance/energy trade-offs in many cases.

### Dilip Vasudevan, Anastasiia Butko, George Michelogiannakis, David Donofrio, John Shalf,"Towards an Integrated Strategy to Preserve Digital Computing Performance Scaling Using Emerging Technologies",Workshop on HPC computing in a Post Moore’s law world (HCPM),June 22, 2017,

With the decline and eventual end of historical rates of lithographic scaling, we arrive at a crossroad where synergistic and holistic decisions are required to preserve Moore's law technology scaling. Numerous emerging technologies aim to extend digital electronics scaling of performance, energy efficiency, and computational power/density,
ranging from devices (transistors), memories, 3D integration capabilities, specialized architectures, photonics, and others.
The wide range of technology options creates the need for an integrated strategy to understand the impact of these emerging technologies on future large-scale digital systems for diverse application requirements and optimization metrics.
In this paper, we argue for a comprehensive methodology that spans the different levels of abstraction -- from materials, to devices, to complex digital systems and applications. Our approach integrates compact models of low-level characteristics of the emerging technologies to inform higher-level simulation models to evaluate their responsiveness to application requirements.
The integrated framework can then automate the search for an optimal architecture using available emerging technologies to maximize a targeted optimization metric.

### SM Martin, MJ Berger, SB Baden,"Toucan-A Translator for Communication Tolerant MPI Applications",Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017,June 2017,998-1007,doi: 10.1109/IPDPS.2017.44

We discuss early results with Toucan, a source-to-source  translator that automatically restructures C/C++ MPI applications to overlap communication with computation. We co-designed the translator and runtime system to enable dynamic, dependence-driven execution of MPI applications, and require only a modest amount of programmer annotation. Co-design was essential to realizing overlap through dynamic code block reordering and avoiding the limitations of static code relocation and inlining. We demonstrate that Toucan hides significant communication in four representative applications running on up to 24K cores of NERSC's Edison platform. Using Toucan, we have hidden from 33% to 85% of the communication overhead, with performance meeting or exceeding that of painstakingly hand-written overlap variants. © 2017 IEEE.

### Dai Wang, Junyu Gaob, Pan Li, Bin Wang, Cong Zhang, Samveg Saxena,"Modeling of plug-in electric vehicle travel patterns and charging load based on trip chain generation",Journal of Power Sources,May 13, 2017,359:468 - 479,doi: 10.1016/j.jpowsour.2017.05.036

Modeling PEV travel and charging behavior is the key to estimate the charging demand and further explore the potential of providing grid services. This paper presents a stochastic simulation methodology to generate itineraries and charging load profiles for a population of PEVs based on real-world vehicle driving data. In order to describe the sequence of daily travel activities, we use the trip chain model which contains the detailed information of each trip, namely start time, end time, trip distance, start location and end location. A trip chain generation method is developed based on the Naive Bayes model to generate a large number of trips which are temporally and spatially coupled. We apply the proposed methodology to investigate the multi-location charging loads in three different scenarios. Simulation results show that home charging can meet the energy demand of the majority of PEVs in an average condition. In addition, we calculate the lower bound of charging load peak on the premise of lowest charging cost. The results are instructive for the design and construction of charging facilities to avoid excessive infrastructure.

### Yingqi Xiong, Bin Wang, Chi-cheng Chu, Rajit Gadh,"Distributed Optimal Vehicle Grid Integration Strategy with User Behavior Prediction",IEEE PES General Meeting 2017,March 13, 2017,

With the increasing of electric vehicle (EV) adoption in recent years, the impact of EV charging activities to the power grid becomes more and more significant. In this article, an optimal scheduling algorithm which combines smart EV charging and V2G gird service is developed to integrate EVs into power grid as distributed energy resources, with improved system cost performance. Specifically, an optimization problem is formulated and solved at each EV charging station according to control signal from aggregated control center and user charging behavior prediction by mean estimation and linear regression. The control center collects distributed optimization results and updates the control signal, periodically. The iteration continues until it converges to optimal scheduling. Experimental result shows this algorithm helps fill the valley and shave the peak in electric load profiles within a microgrid, while the energy demand of individual driver can be satisfied.

### Yubo Wan, Wenbo Shi, Bin Wang, Chi-Cheng Ch, Rajit Gadh,"Optimal operation of stationary and mobile batteries in distribution grids",Applied Energy,January 28, 2017,190:1289 - 130,doi: 10.1016/j.apenergy.2016.12.139

The trending integrations of Battery Energy Storage System (BESS, stationary battery) and Electric Vehicles (EV, mobile battery) to distribution grids call for advanced Demand Side Management (DSM) technique that addresses the scalability concerns of the system and stochastic availabilities of EVs. Towards this goal, a stochastic DSM is proposed to capture the uncertainties in EVs. Numerical approximation is then used to make the problem tractable. To accelerate the computational speed, the proposed DSM is tightly relaxed to a convex form using second-order cone programming. Furthermore, in light of the continuous increasing problem size, a distributed method with a guaranteed convergence is applied to shift the centralized computational burden to distributed controllers. To verify the proposed DSM, real-life EV data collected on UCLA campus is used to test the proposed DSM in an IEEE benchmark test system. Numerical results demonstrate the correctness and merits of the proposed approach.

### Yingqi Xiong, Bin Wang, Zhiyuan Cao, Chi-cheng Chu, Hemanshu Pota, Rajit Gadh,"Extension of IEC61850 with smart EV charging",Innovative Smart Grid Technologies - Asia (ISGT-Asia), 2016 IEEE,Melbourne, VIC, Australia,IEEE,December 26, 2016,294 - 299,doi: 10.1109/ISGT-Asia.2016.7796401

Un-coordinated Electric Vehicle (EV) charging behaviors will impact the power grid and degrade power quality. Smart charging system can optimally allocate the energy among EVs and minimize the impact on grid. IEC 61850 is an international standard for distribution & substation automation and intelligence device communication in smart grid. This paper introduces the extension of IEC 61850 with smart EV scheduling algorithms, considering current multiplexing, power sharing strategies and user behaviors. We have developed an IEC 61850 abstract data model for the information exchanged among components of smart EV charging infrastructures, including the EV charging control center, intelligent Electric Vehicle Supply Equipment (EVSE) and the EV users with mobile applications. Real-world EV usage data on UCLA campus is utilized in the simulation experiment, which is based on the predictive control paradigm. The data model is instantiated by a web service on EV control center by converting the raw data streams from EVSEs in various formats, such as JSON or raw string, etc. into a standardized IEC 61850 SCL file, which contains the critical meter data and charging session parameters. The system cost performance, the interoperability and scalability of smart EV charging infrastructure are greatly improved.

### E. Vecharynski, A. Knyazev,"Preconditioned steepest descent-like methods for symmetric indefinite systems",Linear Algebra and its Applications, Vol. 511, pp. 274–295,2016,

We construct preconditioned steepest descent (PSD)-like methods for iterative solution of symmetric indefinite linear systems using symmetric and positive definite (SPD) preconditioners. Our construction is based on a locally optimal residual minimization over two-dimensional subspaces, mathematically equivalent in exact arithmetic to preconditioned MINRES (PMINRES) restarted after every two steps. A convergence bound is derived. If certain information on the spectrum of the preconditioned system is available, we present a simpler PSD-like algorithm that performs only one-dimensional residual minimization. Search direction randomization for accelerating this algorithm is discussed. Our primary goal is to bridge the theoretical gap between the optimal (PMINRES) and PSD-like methods for solving symmetric indefinite systems. We also demonstrate situations where the suggested PSD-like schemes can be preferable to the optimal PMINRES iteration.

### S.V. Venkatakrishnan, Jeffrey Donatelli, Dinesh Kumar, Abhinav Sarje, Sunil K. Sinha, Xiaoye S. Li, Alexander Hexemer,"A Multi-slice Simulation Algorithm for Grazing-Incidence Small-Angle X-ray Scattering",Journal of Applied Crystallography,December 2016,49-6, doi: 10.1107/S1600576716013273

Grazing-incidence small-angle X-ray scattering (GISAXS) is an important technique in the characterization of samples at the nanometre scale. A key aspect of GISAXS data analysis is the accurate simulation of samples to match the measurement. The distorted-wave Born approximation (DWBA) is a widely used model for the simulation of GISAXS patterns. For certain classes of sample such as nanostructures embedded in thin films, where the electric field intensity variation is significant relative to the size of the structures, a multi-slice DWBA theory is more accurate than the conventional DWBA method. However, simulating complex structures in the multi-slice setting is challenging and the algorithms typically used are designed on a case-by-case basis depending on the structure to be simulated. In this paper, an accurate algorithm for GISAXS simulations based on the multi-slice DWBA theory is presented. In particular, fundamental properties of the Fourier transform have been utilized to develop an algorithm that accurately computes the average refractive index profile as a function of depth and the Fourier transform of the portion of the sample within a given slice, which are key quantities required for the multi-slice DWBA simulation. The results from this method are compared with the traditionally used approximations, demonstrating that the proposed algorithm can produce more accurate results. Furthermore, this algorithm is general with respect to the sample structure, and does not require any sample-specific approximations to perform the simulations.

Best Paper Award

### Bin Wang, Yubo Wang, Hamidreza Nazaripouya, Charlie Qiu, Chi-Cheng Chu, Rajit Gadh,"Predictive Scheduling Framework for Electric Vehicles with Uncertainties of User Behaviors",IEEE Internet of Things Journal,October 13, 2016,4:52 - 63,doi: 10.1109/JIOT.2016.2617314

The randomness of user behaviors plays a significant role in electric vehicle (EV) scheduling problems, especially when the power supply for EV supply equipment (EVSE) is limited. Existing EV scheduling methods do not consider this limitation and assume charging session parameters, such as stay duration and energy demand values, are perfectly known, which is not realistic in practice. In this paper, based on real-world implementations of networked EVSEs on University of California at Los Angeles campus, we developed a predictive scheduling framework, including a predictive control paradigm and a kernel-based session parameter estimator. Specifically, the scheduling service periodically computes for cost-efficient solutions, considering the predicted session parameters, by the adaptive kernel-based estimator with improved estimation accuracies. We also consider the power sharing strategy of existing EVSEs and formulate the virtual load constraint to handle the future EV arrivals with unexpected energy demand. To validate the proposed framework, 20-fold cross validation is performed on the historical dataset of charging behaviors for over one-year period. The simulation results demonstrate that average unit energy cost per kWh can be reduced by 29.42% with the proposed scheduling framework and 66.71% by further integrating solar generations with the given capacity, after the initial infrastructure investment. The effectiveness of kernel-based estimator, virtual load constraint, and event-based control scheme are also discussed in detail.

### Anastasiia Butko, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, David Novo, Lionel Torres, Michel Robert,"Full-System Simulation of big. LITTLE Multicore Architecture for Performance and Energy Exploration",Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2016 IEEE 10th International Symposium on,Lyon, France,IEEE,September 21, 2016,doi: 10.1109/MCSoC.2016.20

Single-ISA heterogeneous multicore processors have gained increasing popularity with the introduction of recent technologies such as ARM big.LITTLE. These processors offer increased energy efficiency through combining low power in-order cores with high performance out-of-order cores. Efficiently exploiting this attractive feature requires careful management so as to meet the demands of targeted applications. In this paper, we explore the design of those architectures based on the ARM big.LITTLE technology by modeling performance and power in gem5 and McPAT frameworks. Our models are validated w.r.t. the Samsung Exynos 5 Octa (5422) chip. We show average errors of 20% in execution time, 13% for power consumption and 24% for energy-to-solution.

### R. Li, Y. Xi, E. Vecharynski, C. Yang, and Y. Saad,"A Thick-Restart Lanczos algorithm with polynomial filtering for Hermitian eigenvalue problems",SIAM Journal on Scientific Computing, Vol. 38, Issue 4, pp. A2512–A2534,2016,doi: 10.1137/15M1054493

Polynomial filtering can provide a highly effective means of computing all eigenvalues of a real symmetric (or complex Hermitian) matrix that are located in a given interval, anywhere in the spectrum. This paper describes a technique for tackling this problem by combining a Thick-Restart version of the Lanczos algorithm with deflation ('locking') and a new type of polynomial filters obtained from a least-squares technique. The resulting algorithm can be utilized in a 'spectrum-slicing' approach whereby a very large number of eigenvalues and associated eigenvectors of the matrix are computed by extracting eigenpairs located in different sub-intervals independently from one another.

### Rafael Garibotti, Anastasiia Butko, Luciano Ost, Abdoulaye Gamatié, Gilles Sassatelli, Chris Adeniyi-Jones,"Efficient Embedded Software Migration towards Clusterized Distributed-Memory Architectures",IEEE Transactions on Computers,August 1, 2016,doi: 10.1109/TC.2015.2485202

A large portion of existing multithreaded embedded software has been programmed according to symmetric shared memory platforms where a monolithic memory block is shared by all cores. Such platforms accommodate popular parallel programming models such as POSIX threads and OpenMP. However with the growing number of cores in modern manycore embedded architectures, they present a bottleneck related to their centralized memory accesses. This paper proposes a solution tailored for an efficient execution of applications defined with shared-memory programming models onto on-chip distributed-memory multicore architectures. It shows how performance, area and energy consumption are significantly improved thanks to the scalability of these architectures. This is illustrated in an open-source realistic design framework, including tools from ASIC to microkernel.

### Bin Wang, Rui Huang, Yubo Wang, Hamidreza Nazaripouya, Charlie Qiu, Chi-Cheng Chu, Rajit Gadh,"Predictive Scheduling for Electric Vehicles Considering Uncertainty of Load and User Behaviors",2016 IEEE/PES Transmission and Distribution Conference and Exposition (T&D),Dallas, TX, USA,IEEE,July 25, 2016,1 - 5,doi: 10.1109/TDC.2016.7520018

Un-coordinated Electric Vehicle (EV) charging can create unexpected load in local distribution grid, which may degrade the power quality and system reliability. The uncertainty of EV load, user behaviors and other baseload in distribution grid, is one of challenges that impedes optimal control for EV charging problem. Previous researches did not fully solve this problem due to lack of real-world EV charging data and proper stochastic model to describe these behaviors. In this paper, we propose a new predictive EV scheduling algorithm (PESA) inspired by Model Predictive Control (MPC), which includes a dynamic load estimation module and a predictive optimization module. The user-related EV load and base load are dynamically estimated based on the historical data. At each time interval, the predictive optimization program will be computed for optimal schedules given the estimated parameters. Only the first element from the algorithm outputs will be implemented according to MPC paradigm. Current-multiplexing function in each Electric Vehicle Supply Equipment (EVSE) is considered and accordingly a virtual load is modeled to handle the uncertainties of future EV energy demands. This system is validated by the real-world EV charging data collected on UCLA campus and the experimental results indicate that our proposed model not only reduces load variation up to 40% but also maintains a high level of robustness. Finally, IEC 61850 standard is utilized to standardize the data models involved, which brings significance to more reliable and large-scale implementation.

### H. Nazaripouya, B. Wang, Y. Wang, P. Chu, H. R. Pota, R. Gadh,"Univariate Time Series Prediction of Solar Power Using a Hybrid Wavelet-ARMA-NARX Prediction Method",2016 IEEE/PES Transmission and Distribution Conference and Exposition (T&D),Dallas, TX, USA,IEEE,July 25, 2016,1 - 5,doi: 10.1109/TDC.2016.7519959

This paper proposes a new hybrid method for super short-term solar power prediction. Solar output power usually has a complex, nonstationary, and nonlinear characteristic due to intermittent and time varying behavior of solar radiance. In addition, solar power dynamics is fast and is inertia less. An accurate super short-time prediction is required to compensate for the fluctuations and reduce the impact of solar power penetration on the power system. The objective is to predict one step-ahead solar power generation based only on historical solar power time series data. The proposed method incorporates discrete wavelet transform (DWT), Auto-Regressive Moving Average (ARMA) models, and Recurrent Neural Networks (RNN), while the RNN architecture is based on Nonlinear Auto-Regressive models with eXogenous inputs (NARX). The wavelet transform is utilized to decompose the solar power time series into a set of richer-behaved forming series for prediction. ARMA model is employed as a linear predictor while NARX is used as a nonlinear pattern recognition tool to estimate and compensate the error of wavelet-ARMA prediction. The proposed method is applied to the data captured from UCLA solar PV panels and the results are compared with some of the common and most recent solar power prediction methods. The results validate the effectiveness of the proposed approach and show a considerable improvement in the prediction precision.

### Yubo Wang, Bin Wang, Tianyang Zhang, Hamidreza Nazaripouya, Chi-Cheng Chu, Rajit Gadh,"Optimal Energy Management for Microgrid with Stationary and Mobile Storages",2016 IEEE/PES Transmission and Distribution Conference and Exposition (T&D),Dallas, TX, USA,IEEE,July 25, 2016,1 - 5,doi: 10.1109/TDC.2016.7520004

This paper studies energy management in a Microgrid (MG) with solar generation, Battery Energy Management System (BESS) and gridable (V2G) Electric Vehicles (EVs). A two-stage stochastic optimization method is proposed to capture the intermittent solar generation and random EV user behaviors. It is subsequently formulated as a Mixed Integer Linear Programming (MILP) problem. To evaluate the proposed method, real solar generation, loads, BESS and EV data is used in Sample Average Approximation (SAA). Computational results show the correctness of the proposed method as well as steady and tightly bounded optimality gap. Comparisons demonstrate that the proposed stochastic method outperforms its deterministic counterpart at the expense of higher computational cost. It is also observed that moderate number of EVs helps to reduce the overall operational cost of the MG, which sheds light on future EV integration to the smart grid.

### Yubo Wang, Bin Wang, Rui Huang, Chi-Cheng Chu, Hemanshu R. Pota, Rajit Gadh,"Two-Tier Prediction of Solar Power Generation with Limited Sensing Resource",2016 IEEE/PES Transmission and Distribution Conference and Exposition (T&D),Dallas, TX, USA,IEEE,July 25, 2016,1 - 5,doi: 10.1109/TDC.2016.7519968

This paper considers a typical solar installations scenario with limited sensing resources. In the literature, there exist either day-ahead solar generation prediction methods with limited accuracy, or high accuracy short timescale methods that are not suitable for applications requiring longer term prediction. We propose a two-tier (global-tier and local-tier) prediction method to improve accuracy for long term (24 hour) solar generation prediction using only the historical power data. In global-tier, we examine two popular heuristic methods: weighted k-Nearest Neighbors (k-NN) and Neural Network (NN). In local-tier, the global-tier results are adaptively updated using real-time analytical residual analysis. The proposed method is validated using the UCLA Microgrid with 35kW of solar generation capacity. Experimental results show that the proposed two-tier prediction method achieves higher accuracy compared to day-ahead predictions while providing the same prediction length. The difference in the overall prediction performance using either weighted k-NN based or NN based in the global-tier are carefully discussed and reasoned. Case studies with a typical sunny day and a cloudy day are carried out to demonstrate the effectiveness of the proposed two-tier predictions.

### Nils E. R. Zimmermann, Maciej Haranczyk,"History and Utility of Zeolite Framework-Type Discovery from a Data-Science Perspective",Crystal Growth & Design,May 2, 2016,16:3043-3048,

Mature applications such as fluid catalytic cracking and hydrocracking rely critically on early zeolite structures. With a data-driven approach, we find that the discovery of exceptional zeolite framework types around the new millennium was spurred by exciting new utilization routes. The promising processes have yet not been successfully implemented (“valley of death” effect), mainly because of the lack of thermal stability of the crystals. This foreshadows limited deployability of recent zeolite discoveries that were achieved by novel crystal synthesis routes.

### Nils E. R. Zimmermann, Maciej Haranczyk,"History and Utility of Zeolite Framework-Type Discovery from a Data-Science Perspective",Crystal Growth & Design,May 2, 2016,

Mature applications such as fluid catalytic cracking and hydrocracking rely critically on early zeolite structures. With a data-driven approach, we find that the discovery of exceptional zeolite framework types around the new millennium was spurred by exciting new utilization routes. The promising processes have yet not been successfully implemented (“valley of death” effect), mainly because of the lack of thermal stability of the crystals. This foreshadows limited deployability of recent zeolite discoveries that were achieved by novel crystal synthesis routes.

Watch a movie illustrating our seeded simulation strategy here.

### George Michelogiannakis, John Shalf, David Donofrio, John Bachan,,"Continuing the Scaling of Digital Computing Post Moore’s Law",LBNL report,April 2016,LBNL 1005126,

The approaching end of traditional CMOS technology scaling that up until now followed Moore's law is coming to an end in the next decade. However, the DOE has come to depend on the rapid, predictable, and cheap scaling of computing performance to meet mission needs for scientific theory, large scale experiments, and national security. Moving forward, performance scaling of digital computing will need to originate from energy and cost reductions that are a result of novel architectures, devices, manufacturing technologies, and programming models. The deeper issue presented by these changes is the threat to DOE’s mission and to the future economic growth of the U.S. computing industry and to society as a whole. With the impending end of Moore’s law, it is imperative for the Office of Advanced Scientific Computing Research (ASCR) to develop a balanced research agenda to assess the viability of novel semiconductor technologies and navigate the ensuing challenges. This report identifies four areas and research directions for ASCR and how each can be used to preserve performance scaling of digital computing beyond exascale and after Moore's law ends.

### J. R. Jones, F.-H. Rouet, K. V. Lawler, E. Vecharynski, K. Z. Ibrahim, S. Williams, B. Abeln, C. Yang, C. W. McCurdy, D. J. Haxton, X. S. Li, T. N. Rescigno,"An efficient basis set representation for calculating electrons in molecules",Journal of Molecular Physics,2016,doi: 10.1080/00268976.2016.1176262

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.

### Bin Wang, Yubo Wang, Charlie Qiu, Chi-Cheng Chu, Rajit Gadh,"Event-based Electric Vehicle Scheduling Considering Random User Behaviors",2015 IEEE International Conference on Smart Grid Communications (SmartGridComm),Miami, FL, USA,IEEE,March 21, 2016,313-318,doi: 10.1109/SmartGridComm.2015.7436319

Uncontrolled Electric Vehicle (EV) and Plug-in Hybrid Electric Vehicle (PHEV) charging within a local distribution grid may cause unexpected high load, which further results in power quality degradation. However, coordinating charging behaviors of a number of EVs is a challenging task, which involves not only the deterministic schedule computing but also nondeterministic EV driver behaviors with random arrival time and energy demands. Previous researches in this area rarely consider these random behaviors for real EV users. In this paper, an implementable event-based cost optimal scheduling algorithm (ECSA) is developed, which solves EV scheduling problem by dynamically estimating the stay duration and energy demand for each participating EV user. Datasets, including users' historical charging records and time series meter data collected from Electric Vehicle Supply Equipments (EVSEs) in UCLA campus, are utilized for feature extraction. Based on that, proper inference technique is employed to determine parameters within each charging session. In addition, solar generation integration into EVSEs is also considered in our problem formulation. The proposed approaches are tested and validated by real EV charging schedules of users in UCLA campus. The results from simulation experiment demonstrate that the proposed algorithm has a better performance in cost minimization and load shifting compared to existing equal-sharing scheduling algorithm (ESSA).

### E. Vecharynski, C. Yang, and F. Xue,"Generalized preconditioned locally harmonic residual method for non-Hermitian eigenproblems",SIAM Journal on Scientific Computing, Vol. 38, No. 1, pp. A500–A527,2016,doi: 10.1137/15M1027413

We introduce the Generalized Preconditioned Locally Harmonic Residual (GPLHR) method for solving standard and generalized non-Hermitian eigenproblems. The method is particularly useful for computing a subset of eigenvalues, and their eigen- or Schur vectors, closest to a given shift. The proposed method is based on block iterations and can take advantage of a preconditioner if it is available. It does not need to perform exact shift-and-invert transformation. Standard and generalized eigenproblems are handled in a unified framework. Our numerical experiments demonstrate that GPLHR is generally more robust and efficient than existing methods, especially if the available memory is limited.

### E. Vecharynski and C. Yang,"Preconditioned iterative methods for eigenvalue counts",to appear in Proceedings of International Workshop on Eigenvalue Problems: Algorithms, Software and Applications in Petascale Computing, in Lecture Notes in Computational Science and Engineering, Springer,2016,

We describe preconditioned iterative methods for estimating the number of eigenvalues of a Hermitian matrix within a given interval. Such estimation is useful in a number of applications.In particular, it can be used to develop an efficient spectrum-slicing strategy to compute many eigenpairs of a Hermitian matrix. Our method is based on the Lanczos- and Arnoldi-type of iterations. We show that with a properly defined preconditioner, only a few iterations may be needed to obtain a good estimate of the number of eigenvalues within a prescribed interval. We also demonstrate that the number of iterations required by the proposed preconditioned schemes is independent of the size and condition number of the matrix. The efficiency of the methods is illustrated on several problems arising from density functional theory based electronic structure calculations.

### Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, Michel Robert,"Position Paper: OpenMP scheduling on ARM big. LITTLE architecture",9th Int’l Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG),Prague, Czech Republic,January 18, 2016,

Single-ISA heterogeneous multicore systems are emerging as a promising direction to
achieve a more suitable balance between performance and energy consumption. However,
a proper utilization of these architectures is essential to reach the energy benefits. In this
paper, we demonstrate the ineffectiveness of popular OpenMP scheduling policies
executing Rodinia benchmark on the Exynos 5 Octa (5422) SoC, which integrates the ARM
big. LITTLE architecture.

### E. Vecharynski,"A generalization of Saad's bound on harmonic Ritz vectors of Hermitian matrices",Linear Algebra and its Applications, Vol. 494, pp. 219-235,2016,doi: 10.1016/j.laa.2016.01.013

We prove a Saad's type bound for harmonic Ritz vectors of a Hermitian matrix. The new bound reveals a dependence of the harmonic Rayleigh-Ritz procedure on the condition number of a shifted problem operator. Several practical implications are discussed. In particular, the bound motivates incorporation of preconditioning into the harmonic Rayleigh-Ritz scheme.

### Yubo Wang, Bin Wang, Chi-Cheng Chu, Hemanshu Pota, Rajit Gadh,"Energy management for a commercial building microgrid with stationary and mobile battery storage",Energy and Buildings,December 30, 2015,116:141 - 150,doi: 10.1016/j.enbuild.2015.12.055

This paper investigates the Demand Side Management (DSM) in a commercial building microgrid with solar generation, stationary Battery Energy Management System (BESS) and gridable (V2G) Electric Vehicle (EV) integration. Taking into consideration of a comprehensive pricing model, we first formulate a deterministic DSM as a mixed integer linear programming problem, assuming perfect knowledge of the uncertainties in the system. A two-stage stochastic DSM is further developed that addresses the stochastic nature in solar generation, loads, EV availabilities and EV energy demands. The proposed DSMs are validated with real solar generation, loads, BESS and EV data using sample average approximation. Detailed case studies show that the stochastic DSM outperforms its deterministic counterpart for cost saving for a wide range of prices, though at the expense of higher computational time. Computational results also demonstrate that moderate number of EVs helps to cut down the overall operation cost, which sheds light on the benefit of future large scale EV integration to smart buildings.

### A. Roy, A. Klinefelter, F. B. Yahya, X. Chen, L. P. Gonzalez-Guerrero, C. J. Lukas, D. A. Kamakshi, J. Boley, K. Craig, M. Faisal, S. Oh, N. E. Roberts, Y. Shakhsheer, A. Shrivastava, D. P. Vasudevan, D. D. Wentzloff, B. H. Calhoun,"A 6.45 μW Self-Powered SoC With Integrated Energy-Harvesting Power Management and ULP Asymmetric Radios for Portable Biomedical Systems",IEEE Transactions on Biomedical Circuits and Systems,December 28, 2015,9:862 - 874,doi: 10.1109/TBCAS.2015.2498643

This paper presents a batteryless system-on-chip (SoC) that operates off energy harvested from indoor solar cells and/or thermoelectric generators (TEGs) on the body. Fabricated in a commercial 0.13 μW process, this SoC sensing platform consists of an integrated energy harvesting and power management unit (EH-PMU) with maximum power point tracking, multiple sensing modalities, programmable core and a low power microcontroller with several hardware accelerators to enable energy-efficient digital signal processing, ultra-low-power (ULP) asymmetric radios for wireless transmission, and a 100 nW wake-up radio. The EH-PMU achieves a peak end-to-end efficiency of 75% delivering power to a 100 μA load. In an example motion detection application, the SoC reads data from an accelerometer through SPI, processes it, and sends it over the radio. The SPI and digital processing consume only 2.27 μW, while the integrated radio consumes 4.18 μW when transmitting at 187.5 kbps for a total of 6.45 μW.

### D. B. Szyld, E. Vecharynski, and F. Xue,"Preconditioned eigensolvers for large-scale nonlinear Hermitian eigenproblems with variational characterizations. II. Interior eigenvalues.",SIAM Journal on Scientific Computing, Vol. 37, Issue 6, pp. A2969-A2997,2015,

We consider the solution of large-scale nonlinear algebraic Hermitian eigenproblems of the form $T(\lambda)v=0$ that admit a variational characterization of eigenvalues. These problems arise in a variety of applications and are generalizations of linear Hermitian eigenproblems $Av\!=\!\lambda Bv$. In this paper, we propose a Preconditioned Locally Minimal Residual (PLMR) method for efficiently computing interior eigenvalues of problems of this type. We discuss the development of search subspaces, preconditioning, and eigenpair extraction procedure based on the refined Rayleigh-Ritz projection. Extension to the block methods is presented, and a moving-window style soft deflation is described. Numerical experiments demonstrate that PLMR methods provide a rapid and robust convergence towards interior eigenvalues. The approach is also shown to be efficient and reliable for computing a large number of extreme eigenvalues, dramatically outperforming standard preconditioned conjugate gradient methods.

### Andrew A. Chien, Tung Thanh-Hoang, Dilip Vasudevan, Yuanwei Fang, Amirali Shambayati,"10x10: A Case Study in Highly-Programmable and Energy-Efficient Heterogeneous Federated Architecture",SIGARCH Comput. Archit. News,December 2015,43:2 - 9,doi: 10.1145/2856113.2856115

Customized architecture is widely recognized as an important approach for improved performance and energy efficiency. To balance generality and customization benefit, researchers have proposed to federate heterogeneous micro-engines. Using the 10x10 architecture and an integrated image and vision benchmark as a case study, we explore the performance and energy benefits achievable. Results for current 32nm technology and DDR3 memory show 10x10 architecture benefits of 140x performance and 72x energy overall. Adding 3D-stacked DRAM increase benefits to 171x (performance) and 100x (energy). Finally, considering future 7nm transistor process, benefits as large as 597x (performance) and 137x energy are observed.

### Abhinav Sarje, Xiaoye S Li, Slim Chourou, Dinesh Kumar, Singanallur Venkatakrishnan, Alexander Hexemer,"Inverse Modeling Nanostructures from X-Ray Scattering Data through Massive Parallelism",Supercomputing (SC'15),November 2015,

We consider the problem of reconstructing material nanostructures from grazing-incidence small-angle X-ray scattering (GISAXS) data obtained through experiments at synchrotron light-sources. This is an important tool for characterization of macromolecules and nano-particle systems applicable to applications such as design of energy-relevant nano-devices. Computational analysis of experimentally collected scattering data has been the primary bottleneck in this process.
We exploit the availability of massive parallelism in leadership-class supercomputers with multi-core and graphics processors to realize the compute-intensive reconstruction process. To develop a solution, we employ various optimization algorithms including gradient-based LMVM, derivative-free trust region-based POUNDerS, and particle swarm optimization, and apply these in a massively parallel fashion.
We compare their performance in terms of both quality of solution and computational speed. We demonstrate the effective utilization of up to 8,000 GPU nodes of the Titan supercomputer for inverse modeling of organic-photovoltaics (OPVs) in less than 15 minutes.

### E. Vecharynski, J. Brabec, M. Shao, N. Govind, C. Yang,"Efficient Block Preconditioned Eigensolvers for Linear Response Time-dependent Density Functional Theory",submitted to JCC,2015,

We present two efficient iterative algorithms for solving the linear response eigenvalue problem arising fromthe time dependent density functional theory. Although the matrix to be diagonalized is nonsymmetric, it has a special structure that can be exploited to save both memory and floating point operations. In particular, the nonsymmetric eigenvalue problem can be transformed into a product eigenvalue problem that is self-adjoint with respect to a K-inner product. This product eigenvalue problem can be solved efficiently by a modified Davidson algorithm and a modified locally optimal block preconditioned conjugate gradient (LOBPCG) algorithm that make use of the K-inner product. The solution of the product eigenvalue problem yields one component of the eigenvector associated with the original eigenvalue problem. However, the other component of the eigenvector can be easily recovered in a postprocessing procedure. Therefore, the algorithms we present here are more efficient than existing algorithms that try to approximate both components of the eigenvectors simultaneously.The efficiency of the new algorithms is demonstrated by numerical examples.

### E. Vecharynski, A. Knyazev,"Preconditioned Locally Harmonic Residual Method for computing interior eigenpairs of certain classes of Hermitian matrices",SIAM Journal on Scientific Computing, Vol. 37, Issue 5, pp. S3–S29,2015,

We propose a Preconditioned Locally Harmonic Residual (PLHR) method for computing several interior eigenpairs of a generalized Hermitian eigenvalue problem, without traditional spectral transformations, matrix factorizations, or inversions. PLHR is based on a short-term recurrence, easily extended to a block form, computing eigenpairs simultaneously. PLHR can take advantage of Hermitian positive definite preconditioning, e.g., based on an approximate inverse of an absolute value of a shifted matrix, introduced in [SISC, 35 (2013), pp. A696–A718]. Our numerical experiments demonstrate that PLHR is efficient and robust for certain classes of large-scale interior eigenvalue problems, involving Laplacian and Hamiltonian operators, especially if memory requirements are tight.

### Jason Adams, Monica Lieng, Brooks Kuhn, Edward Guo, Edik Simonian, Sean Peisert, JP Delplanque, Nick Anderson,"Automated Mechanical Ventilator Waveform Analysis of Patient-Ventilator Asynchrony",CHEST Journal,Pages: 175AOctober 2015,doi: 10.1378/chest.2281731

PURPOSE: Mechanical ventilation is a life-saving intervention but is associated with adverse effects including ventilator-induced lung injury (VILI). Patient-ventilator asynchrony (PVA) is thought to contribute to VILI, but the study of PVA has been hampered by limited access to the high frequency, large volume data streams produced by modern ventilators and a lack of robust analytics. To address these limitations, we developed an automated pipeline for breath-by-breath analysis of ventilator waveform data.

METHODS: Simulated pressure and flow time series data representing normal breaths and common forms of PVA were generated on PB840 ventilators, collected unobtrusively using small, customized wireless peripheral devices, and transmitted to a networked server for storage and analysis. Two critical care physicians reviewed all waveforms to generate gold standards. Rule-based algorithms were developed to quantify inspiratory and expiratory tidal volumes (TV) and identify PVA subtypes including double trigger and delayed termination asynchrony. Data were split randomly into derivation and validation sets. Algorithm performance was compared to ventilator reported values and clinician annotation.

RESULTS: The mean difference between algorithm-determined and ventilator-reported TVs was 3.1% (99% CI ± 1.36%). Algorithm agreement with clinician annotation was excellent for double trigger PVA and moderate for delayed termination PVA, with Kappa statistics of 0.85 and 0.58, respectively. In the validation data set (n = 492 breaths), double trigger asynchrony was detected with an overall accuracy of 94.1%, sensitivity of 100%, and specificity of 92.8%.

CONCLUSIONS: A pipeline combining wireless ventilator data acquisition and rule-based analytic algorithms informed by the principles of bedside ventilator waveform analysis allows for automated, quantitative breath-by-breath analysis of patient-ventilator interactions.

CLINICAL IMPLICATIONS: We have recently deployed this system in the medical intensive care unit of the UC Davis Medical Center, which will enable further development of mechanical ventilation analytics. We have begun to explore the use of supervised machine learning and dynamic time series modeling to improve the classification of other common types of PVA and of clinical phenotypes associated with respiratory failure. This system will help to better define the epidemiology and clinical impact of PVA and other forms of off-target mechanical ventilation, and may lead to improved decision support and patient outcomes.

### Tobias Titze, Alexander Lauerer, Lars Heinke, Christian Chmelik, Nils E. R. Zimmermann, Frerich J. Keil, Douglas M. Ruthven, Jörg Kärger,"Transport in Nanoporous Materials Including MOFs: The Applicability of Fick’s Laws",Angew. Chem. Int. Ed.,2015,doi: 10.1002/anie.201506954

Diffusion in nanoporous host–guest systems is often considered to be too complicated to comply with such “simple” relationships as Fick’s first and second law of diffusion. However, it is shown herein that the microscopic techniques of diffusion measurement, notably the pulsed field gradient (PFG) technique of NMR spectroscopy and microimaging by interference microscopy (IFM) and IR microscopy (IRM), provide direct experimental evidence of the applicability of Fick’s laws to such systems. This remains true in many situations, even when the detailed mechanism is complex. The limitations of the diffusion model are also discussed with reference to the extensive literature on this subject.

### Burke, Daniel R.,"Neuromorphic Hardware for HPC",Conference,October 7, 2015,

Presented at Simons Institute Theory of Neural Computation Workshop.

### Nils E. R. Zimmermann, Bart Vorselaars, David Quigley, Baron Peters,"Nucleation of NaCl from Aqueous Solution: Critical Sizes, Ion-Attachment Kinetics, and Rates",J. Am. Chem. Soc.,2015,doi: 10.1021/jacs.5b08098

Nucleation and crystal growth are important in material synthesis, climate modeling, biomineralization, and pharmaceutical formulation. Despite tremendous efforts, the mechanisms and kinetics of nucleation remain elusive to both theory and experiment. Here we investigate sodium chloride (NaCl) nucleation from supersaturated brines using seeded atomistic simulations, polymorph-specific order parameters, and elements of classical nucleation theory. We find that NaCl nucleates via the common rock salt structure. Ion desolvation—not diffusion—is identified as the limiting resistance to attachment. Two different analyses give approximately consistent attachment kinetics: diffusion along the nucleus size coordinate and reaction-diffusion analysis of approach-to-coexistence simulation data from Aragones et al. (J. Chem. Phys. 2012, 136, 244508). Our simulations were performed at realistic supersaturations to enable the first direct comparison to experimental nucleation rates for this system. The computed and measured rates converge to a common upper limit at extremely high supersaturation. However, our rate predictions are between 15 and 30 orders of magnitude too fast. We comment on possible origins of the large discrepancy.

Watch a movie illustrating our seeded simulation strategy here.

### Nathan Hanford, Vishal Ahuja, Mehmet Balman, Matthew. Farrens, Dipak Ghosal, Eric Pouyoul, Brian Tierney,"Improving Network Performance on Multicore Systems: Impact of Core Affinities on High Throughput Flows",The International Journal of eScience, Elsevier,2015,doi: doi:10.1016/j.future.2015.09.012

Network throughput is scaling-up to higher data rates while end-system processors are scaling-out to multiple cores. In order to optimize high speed data transfer into multicore end-systems, techniques such as network adaptor offloads and performance tuning have received a great deal of attention. Furthermore, several methods of multi-threading the network receive process have been proposed. However, thus far attention has been focused on how to set the tuning parameters and which offloads to select for higher performance, and little has been done to understand why the various parameter settings do (or do not) work. In this paper, we build on previous research to track down the sources of the end-system bottleneck for high-speed TCP flows. We define protocol processing efficiency to be the amount of system resources (such as CPU and cache) used per unit of achieved throughput (in Gbps). The amount of various system resources consumed are measured using low-level system event counters. In a multicore end-system, affinitization, or core binding, is the decision regarding how the various tasks of network receive process including interrupt, network, and application processing are assigned to the different processor cores. We conclude that affinitization has a significant impact on protocol processing efficiency, and that the performance bottleneck of the network receive process changes significantly with different affinitization.

### Anastasiia Butko, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, Michel Robert,"Design exploration for next generation high-performance manycore on-chip systems: Application to big. LITTLE architectures",VLSI (ISVLSI), 2015 IEEE Computer Society Annual Symposium on,Montpellier, France,IEEE,July 8, 2015,doi: 10.1109/ISVLSI.2015.28

Next generation embedded systems will massively adopt on-chip many core architectures to provide both performance and energy-efficiency. This trend will definitely establish the convergence of embedded computing and high-performance computing. In such a context, one major design challenge will concern the choice of adequate architecture parameters given system requirements. Moreover, it will affect the way applications can suitably exploit architecture resources for an efficient execution. This paper deals with many core on-chip system design exploration by using via simulation. It presents an approach enabling one to study central design parameters in an accurate and cost-effective manner. This approach is illustrated through the design exploration for ARM big. LITTLE heterogeneous multicore technology in the gem5 framework.

### Štěpán Timr, Jiří Brabec, Alexey Bondar, Tomáš Ryba, Miloš Železný, Josef Lazar, Pavel Jungwirth,"Non-Linear Optical Properties of Fluorescent Dyes Allow for Accurate Determination of Their Molecular Orientations in Phospholipid Membranes",The Journal of Physical Chemistry,July 6, 2015,

Several methods based on single- and two-photon fluorescence detected linear dichroism have recently been used to determine the orientational distributions of fluorescent dyes in lipid membranes. However, these determinations relied on simplified descriptions of non-linear anisotropic properties of the dye molecules, using a transition dipole moment-like vector instead of an absorptivity tensor. To investigate the validity of the vector approximation, we have now carried out a combination of computer simulations and polarization microscopy experiments on two representative fluorescent dyes (DiI and F2N12S) embedded in aqueous phosphatidylcholine bilayers. Our results indicate that a simplified vector-like treatment of the two-photon transition tensor is applicable for molecular geometries sampled in the membrane at ambient conditions. Furthermore, our results allow evaluation of several distinct polarization microscopy techniques. In combination, our results point to a robust and accurate experimental and computational treatment of orientational distributions of DiI, F2N12S and related dyes (including Cy3, Cy5, and others), with implications to monitoring physiologically relevant processes in cellular membranes in a novel way.

### Bin Wang, Boyang Hu, Charlie Qiu, Peter Chu, Rajit Gadh,"EV charging algorithm implementation with user price preference",2015 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT),Washington, DC, USA,IEEE,June 25, 2015,1 - 5,doi: 10.1109/ISGT.2015.7131895

In this paper, we propose and implement a smart Electric Vehicle (EV) charging algorithm to control the EV charging infrastructures according to users' price preferences. Charging boxes, equipped with bi-directional communication devices and smart meters, can be remotely monitored by the proposed charging algorithm applied to EV control center and mobile app. On the server side, ARIMA model is utilized to fit historical charging load data and perform day-ahead prediction. A pricing strategy with energy bidding policy is proposed and implemented to generate a charging price list to be broadcasted to EV users through mobile app. On the user side, EV drivers can submit their price preferences and daily travel schedules to negotiate with Control Center to consume the expected energy and minimize charging cost simultaneously. The proposed algorithm is tested and validated through the experimental implementations in UCLA parking lots.

### Abhinav Sarje,Computing Nanostructures at Scale,OLCF User Meeting,June 2015,

The inverse modeling, or structural fitting, problem of recovering nanostructures from X-ray scattering data obtained through experiments at light-source synchrotrons is an ideal example of a Big Data and Big Compute application. X-ray scattering based extraction of structural information from material samples is an important tool for nanostructure prediction through characterization of macromolecules and nanoparticle systems, applicable to numerous applications such as design of energy-relevant nano-devices. At Berkeley Lab, we are developing high-performance solutions for analysis of such raw data. In our work we exploit the use of massive parallelism available in clusters of GPUs, such as the Titan supercomputer, to gain efficiency in the reconstruction process. We explore the application of various numerical optimization algorithms ranging from simple gradient-based quasi-Newton methods, derivativefree trust-region-based methods, to the stochastic algorithms of Particle Swarm Optimization in a massively parallel fashion. https://vimeo.com/133558018

### Abhinav Sarje, Sukhyun Song, Douglas Jacobsen, Kevin Huck, Jeffrey Hollingsworth, Allen Malony, Samuel Williams, and Leonid Oliker,"Parallel Performance Optimizations on Unstructured Mesh-Based Simulations",Procedia Computer Science,1877-0509,June 2015,51:2016-2025,doi: 10.1016/j.procs.2015.05.466

This paper addresses two key parallelization challenges the unstructured mesh-based ocean modeling code, MPAS-Ocean, which uses a mesh based on Voronoi tessellations: (1) load imbalance across processes, and (2) unstructured data access patterns, that inhibit intra- and inter-node performance. Our work analyzes the load imbalance due to naive partitioning of the mesh, and develops methods to generate mesh partitioning with better load balance and reduced communication. Furthermore, we present methods that minimize both inter- and intra- node data movement and maximize data reuse. Our techniques include predictive ordering of data elements for higher cache efficiency, as well as communication reduction approaches. We present detailed performance data when running on thousands of cores using the Cray XC30 supercomputer and show that our optimization strategies can exceed the original performance by over 2×. Additionally, many of these solutions can be broadly applied to a wide variety of unstructured grid-based computations.

### Abhinav Sarje,Recovering Structural Information about Nanoparticle Systems,Nvidia GPU Technology Conference,March 19, 2015,

The inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at light-source synchrotrons is an ideal example of a Big Data and Big Compute application. This session will give an introduction and overview to this problem and its solutions as being developed at the Berkeley Lab. X-ray scattering based extraction of structural information from material samples is an important tool applicable to numerous applications such as design of energy-relevant nano-devices. We exploit the use of parallelism available in clusters of GPUs to gain efficiency in the reconstruction process. To develop a solution, we apply Particle Swarm Optimization (PSO) in a massively parallel fashion, and develop high-performance codes and analyze the performance.

### Abhinav Sarje, Xiaoye S. Li, Dinesh Kumar, Alexander Hexemer,"Recovering Nanostructures from X-Ray Scattering Data",Nvidia GPU Technology Conference (GTC),March 2015,

We consider the inverse modeling problem of recovering nanostructures from X-ray scattering data obtained through experiments at synchrotrons. This has been a primary bottleneck problem in such data analysis. X-ray scattering based extraction of structural information from material samples is an important tool for the characterization of macromolecules and nano-particle systems applicable to numerous applications such as design of energy-relevant nano-devices. We exploit massive parallelism available in clusters of graphics processors to gain efficiency in the reconstruction process. To solve this numerical optimization problem, here we show the application of the stochastic algorithms of Particle Swarm Optimization (PSO) in a massively parallel fashion. We develop high-performance codes for various flavors of the PSO class of algorithms and analyze their performance with respect to the application at hand. We also briefly show the use of two other optimization methods as solutions.

### E. Vecharynski, C. Yang, J. E. Pask,"A projected preconditioned conjugate gradient algorithm for computing many extreme eigenpairs of a Hermitian matrix",Journal of Computational Physics, Vol. 290, pp. 73–89,2015,

We present an iterative algorithm for computing an invariant subspace associated with the algebraically smallest eigenvalues of a large sparse or structured Hermitian matrix A. We are interested in the case in which the dimension of the invariant subspace is large (e.g., over several hundreds or thousands) even though it may still be small relative to the dimension of A. These problems arise from, for example, density functional theory (DFT) based electronic structure calculations for complex materials. The key feature of our algorithm is that it performs fewer Rayleigh–Ritz calculations compared to existing algorithms such as the locally optimal block preconditioned conjugate gradient or the Davidson algorithm. It is a block algorithm, and hence can take advantage of efficient BLAS3 operations and be implemented with multiple levels of concurrency. We discuss a number of practical issues that must be addressed in order to implement the algorithm efficiently on a high performance computer.

### Wei Hu, Lin Lin and Chao Yang,"Edge reconstruction in armchair phosphorene nanoribbons revealed by discontinuous Galerkin density functional theory",Phys. Chem. Chem. Phys., 2015, Advance Article,February 11, 2015,doi: 10.1039/C5CP00333D

With the help of our recently developed massively parallel DGDFT (Discontinuous Galerkin Density Functional Theory) methodology, we perform large-scale Kohn–Sham density functional theory calculations on phosphorene nanoribbons with armchair edges (ACPNRs) containing a few thousands to ten thousand atoms. The use of DGDFT allows us to systematically achieve a conventional plane wave basis set type of accuracy, but with a much smaller number (about 15) of adaptive local basis (ALB) functions per atom for this system. The relatively small number of degrees of freedom required to represent the Kohn–Sham Hamiltonian, together with the use of the pole expansion the selected inversion (PEXSI) technique that circumvents the need to diagonalize the Hamiltonian, results in a highly efficient and scalable computational scheme for analyzing the electronic structures of ACPNRs as well as their dynamics. The total wall clock time for calculating the electronic structures of large-scale ACPNRs containing 1080–10 800 atoms is only 10–25 s per self-consistent field (SCF) iteration, with accuracy fully comparable to that obtained from conventional planewave DFT calculations. For the ACPNR system, we observe that the DGDFT methodology can scale to 5000–50 000 processors. We use DGDFT based ab initio molecular dynamics (AIMD) calculations to study the thermodynamic stability of ACPNRs. Our calculations reveal that a 2 × 1 edge reconstruction appears in ACPNRs at room temperature.

### Thorsten Kurth, Andrew Pochinsky, Abhinav Sarje, Sergey Syritsyn, Andre Walker-Loud,"High-Performance I/O: HDF5 for Lattice QCD",arXiv:1501.06992,January 2015,

Practitioners of lattice QCD/QFT have been some of the primary pioneer users of the state-of-the-art high-performance-computing systems, and contribute towards the stress tests of such new machines as soon as they become available. As with all aspects of high-performance-computing, I/O is becoming an increasingly specialized component of these systems. In order to take advantage of the latest available high-performance I/O infrastructure, to ensure reliability and backwards compatibility of data files, and to help unify the data structures used in lattice codes, we have incorporated parallel HDF5 I/O into the SciDAC supported USQCD software stack. Here we present the design and implementation of this I/O framework. Our HDF5 implementation outperforms optimized QIO at the 10-20% level and leaves room for further improvement by utilizing appropriate dataset chunking.

### Anastasiia Butko, Rafael Garibotti, Luciano Ost, Vianney Lapotre, Abdoulaye Gamatie, Gilles Sassatelli, Chris Adeniyi-Jones,"A trace-driven approach for fast and accurate simulation of manycore architectures",Design Automation Conference (ASP-DAC), 2015 20th Asia and South Pacific,Chiba, Japan,IEEE,January 19, 2015,doi: 10.1109/ASPDAC.2015.7059093

The evolution of manycore systems, forecasted to feature hundreds of cores by the end of the decade calls for efficient solutions for design space exploration and debugging. Among the relevant existing solutions the well-known gem5 simulator provides a rich architecture description framework. However, these features come at the price of prohibitive simulation time that limits the scope of possible explorations to configurations made of tens of cores. To address this limitation, this paper proposes a novel trace-driven simulation approach for efficient exploration of manycore architectures.

### D. Zuev, E. Vecharynski, C. Yang, N. Orms, and A.I. Krylov,"New algorithms for iterative matrix-free eigensolvers in quantum chemistry",Journal of Computational Chemistry, Vol. 36, Issue 5, pp. 273–284,2015,

New algorithms for iterative diagonalization procedures that solve for a small set of eigen-states of a large matrix are described. The performance of the algorithms is illustrated by calculations of low and high-lying ionized and electronically excited states using equation-of-motion coupled-cluster methods with single and double substitutions (EOM-IP-CCSD and EOM-EE-CCSD). We present two algorithms suitable for calculating excited states that are close to a specified energy shift (interior eigenvalues). One solver is based on the Davidson algorithm, a diagonalization procedure commonly used in quantum-chemical calculations. The second is a recently developed solver, called the “Generalized Preconditioned Locally Harmonic Residual (GPLHR) method.” We also present a modification of the Davidson procedure that allows one to solve for a specific transition. The details of the algorithms, their computational scaling, and memory requirements are described. The new algorithms are implemented within the EOM-CC suite of methods in the Q-Chem electronic structure program.

### David H. Bailey, Stephanie Ger, Marcos Lopez de, Alexander Sim, Kesheng Wu,"Statistical Overfitting and Backtest Performance",Quantitative Finance,2015,

http://ssrn.com/abstract=2507040